title
Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)

description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3Gchxyg Andrew Ng Adjunct Professor of Computer Science https://www.andrewng.org/ To follow along with the course schedule and syllabus, visit: http://cs229.stanford.edu/syllabus-autumn2018.html

detail
{'title': 'Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)', 'heatmap': [{'end': 1705.401, 'start': 1600.632, 'weight': 1}, {'end': 1898.118, 'start': 1843.615, 'weight': 0.834}, {'end': 2964.868, 'start': 2811.363, 'weight': 0.972}, {'end': 3307.568, 'start': 3255.588, 'weight': 0.841}, {'end': 3840.301, 'start': 3737.128, 'weight': 0.853}, {'end': 4129.452, 'start': 4023.226, 'weight': 0.72}, {'end': 4761.893, 'start': 4706.751, 'weight': 0.925}], 'summary': 'The lecture series covers various topics including naive bayes, laplace smoothing, estimation of win probability, machine learning algorithms, word embeddings, support vector machines (svm), and linear classifiers, providing insights into their applications in spam classification, text classification, and parameter estimation with examples and practical considerations for implementing efficient anti-spam algorithms.', 'chapters': [{'end': 72.874, 'segs': [{'end': 72.874, 'src': 'embed', 'start': 28.394, 'weight': 0, 'content': [{'end': 34.236, 'text': 'Uh, you need to add to the Naive Bayes algorithm we described on Monday to really make it work.', 'start': 28.394, 'duration': 5.842}, {'end': 38.597, 'text': 'um for, say, email spam classification or- or for text classification.', 'start': 34.236, 'duration': 4.361}, {'end': 43.759, 'text': "Uh, and then we'll talk about a different version of Naive Bayes that's even better than the one we've been discussing so far.", 'start': 39.137, 'duration': 4.622}, {'end': 49.302, 'text': 'Um, talk a little bit about, uh, advice for applying machine learning algorithms.', 'start': 44.559, 'duration': 4.743}, {'end': 54.646, 'text': 'So this will be useful to you as you get started on your, uh, CS329 class projects as well.', 'start': 49.362, 'duration': 5.284}, {'end': 58.149, 'text': 'This is a strategy of how to choose an algorithm and what to do first and what to do second.', 'start': 54.666, 'duration': 3.483}, {'end': 63.293, 'text': "Uh, and then we'll start with, um, intro to support vector machines.", 'start': 58.889, 'duration': 4.404}, {'end': 72.874, 'text': 'Okay, Um, so to recap, uh, The Naive Bayes algorithm is a generative learning algorithm in which,', 'start': 63.633, 'duration': 9.241}], 'summary': 'Enhance naive bayes for email spam or text classification. also, learn about a better version of naive bayes and get advice for applying machine learning algorithms.', 'duration': 44.48, 'max_score': 28.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg28394.jpg'}], 'start': 4.482, 'title': 'Naive bayes and spam classification', 'summary': 'Discusses the use of naive bayes for spam classification, including laplace smoothing, improved naive bayes, and advice for applying machine learning algorithms in cs329 class projects.', 'chapters': [{'end': 72.874, 'start': 4.482, 'title': 'Naive bayes and spam classification', 'summary': 'Discusses the use of naive bayes as a generative learning algorithm to build a spam classifier, including the addition of laplace smoothing and a better version of naive bayes, and provides advice for applying machine learning algorithms, relevant to cs329 class projects.', 'duration': 68.392, 'highlights': ['The Naive Bayes algorithm is used to build a spam classifier for email or text classification, and Laplace smoothing is described as an essential addition to make it work effectively.', 'The chapter provides advice for applying machine learning algorithms, which is useful for CS329 class projects, offering a strategy for choosing an algorithm and sequencing tasks.', 'An introduction to support vector machines is mentioned as the next topic to be covered, indicating the progression of the learning content.', 'The discussion of Naive Bayes and its application to spam classification is the central theme of the chapter, demonstrating its relevance and importance in the context of machine learning.']}], 'duration': 68.392, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4482.jpg', 'highlights': ['The discussion of Naive Bayes and its application to spam classification is the central theme of the chapter, demonstrating its relevance and importance in the context of machine learning.', 'The Naive Bayes algorithm is used to build a spam classifier for email or text classification, and Laplace smoothing is described as an essential addition to make it work effectively.', 'The chapter provides advice for applying machine learning algorithms, which is useful for CS329 class projects, offering a strategy for choosing an algorithm and sequencing tasks.', 'An introduction to support vector machines is mentioned as the next topic to be covered, indicating the progression of the learning content.']}, {'end': 657.857, 'segs': [{'end': 96.872, 'src': 'embed', 'start': 72.874, 'weight': 0, 'content': [{'end': 80.299, 'text': 'given a piece of email or Twitter message or some piece of text, um, take a dictionary and put in zeros and ones,', 'start': 72.874, 'duration': 7.425}, {'end': 84.807, 'text': 'depending on whether different words appear in a particular email.', 'start': 80.299, 'duration': 4.508}, {'end': 90.409, 'text': "And so this becomes your feature representation for, say, an email that you're trying to classify as spam or not spam.", 'start': 84.987, 'duration': 5.422}, {'end': 96.872, 'text': "Um, so using the indicator function notation, um, oh, I shouldn't use j.", 'start': 91.01, 'duration': 5.862}], 'summary': 'Using a dictionary to represent emails with zeros and ones for classification.', 'duration': 23.998, 'max_score': 72.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg72874.jpg'}, {'end': 150.245, 'src': 'embed', 'start': 125.254, 'weight': 1, 'content': [{'end': 132.737, 'text': 'Uh, so Gaussian Discretion Analysis models these two terms with a Gaussian and Bernoulli respectively, and Naive Bayes uses a different model.', 'start': 125.254, 'duration': 7.483}, {'end': 134.798, 'text': 'And with Naive Bayes in particular.', 'start': 133.437, 'duration': 1.361}, {'end': 142.041, 'text': 'PFX given y is modeled as a um product of the conditional probabilities of the indifferent features.', 'start': 134.798, 'duration': 7.243}, {'end': 143.882, 'text': 'given the class label y,', 'start': 142.041, 'duration': 1.841}, {'end': 150.245, 'text': 'And so the parameters of the naive-based model are um, phi subscript y is the clause prior.', 'start': 144.762, 'duration': 5.483}], 'summary': 'Gaussian discretion analysis uses gaussian and bernoulli models, while naive bayes uses a different model with conditional probability features.', 'duration': 24.991, 'max_score': 125.254, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg125254.jpg'}, {'end': 227.982, 'src': 'embed', 'start': 171.029, 'weight': 2, 'content': [{'end': 187.016, 'text': 'and so if you derive the maximum likelihood estimates, um, you will find that the maximum likelihood estimates of you know phi y is this right?', 'start': 171.029, 'duration': 15.987}, {'end': 190.258, 'text': 'Just a fraction of training examples.', 'start': 187.036, 'duration': 3.222}, {'end': 192.399, 'text': 'um, that was equal to spam.', 'start': 190.258, 'duration': 2.141}, {'end': 197.541, 'text': 'And the maximum likelihood estimates of this.', 'start': 195.4, 'duration': 2.141}, {'end': 221.397, 'text': 'And this is just the indicator function, notation.', 'start': 217.634, 'duration': 3.763}, {'end': 222.578, 'text': 'way of writing.', 'start': 221.397, 'duration': 1.181}, {'end': 227.982, 'text': 'um, look at all of your uh emails with label y equals 0 and count up.', 'start': 222.578, 'duration': 5.404}], 'summary': 'Derive maximum likelihood estimates for phi y and spam emails.', 'duration': 56.953, 'max_score': 171.029, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg171029.jpg'}, {'end': 337.711, 'src': 'embed', 'start': 307.768, 'weight': 3, 'content': [{'end': 311.649, 'text': 'Um, one of the top machine learning conferences is the conference NIPS.', 'start': 307.768, 'duration': 3.881}, {'end': 313.97, 'text': 'NIPS stands for Neural Information Processing Systems.', 'start': 311.949, 'duration': 2.021}, {'end': 320.632, 'text': "Um, uh, and, uh, let's say that in your dictionary, you know, you have 10, 000 words in your dictionary.", 'start': 314.33, 'duration': 6.302}, {'end': 328.875, 'text': "Let's say that the NIPS conference, the word NIPS corresponds to word number, uh, 6 0 17, right, in your, you know, in your 10, 000 word dictionary.", 'start': 320.652, 'duration': 8.223}, {'end': 337.711, 'text': "But up until now, presumably, you've not had a lot of emails from your friends asking hey,", 'start': 331.107, 'duration': 6.604}], 'summary': 'Nips conference is a top machine learning event with 10,000-word dictionary.', 'duration': 29.943, 'max_score': 307.768, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg307768.jpg'}, {'end': 605.89, 'src': 'embed', 'start': 550.94, 'weight': 4, 'content': [{'end': 556.783, 'text': 'to estimate the probability of something as 0, just because you have not seen it once yet, right?', 'start': 550.94, 'duration': 5.843}, {'end': 568.718, 'text': 'Um. so, what I want to do is, uh, describe to you Laplace smoothing, which is a technique that helps, um, address this problem, okay?', 'start': 557.543, 'duration': 11.175}, {'end': 571.918, 'text': "And um, let's- let's, uh.", 'start': 569.218, 'duration': 2.7}, {'end': 578.98, 'text': 'uh, in order to motivate Laplace smoothing, let me, um, use a uh, uh, uh, yeah.', 'start': 571.918, 'duration': 7.062}, {'end': 588.614, 'text': "Let me use a different example for now, right? Um, Let's see.", 'start': 579.6, 'duration': 9.014}, {'end': 589.455, 'text': 'All right.', 'start': 589.175, 'duration': 0.28}, {'end': 593.879, 'text': 'So, you know, several years ago, this is- this is older data, but several years ago.', 'start': 590.016, 'duration': 3.863}, {'end': 595.16, 'text': 'So- so let me put aside Naive Bayes.', 'start': 593.919, 'duration': 1.241}, {'end': 596.241, 'text': 'I want to talk about Laplace-Rouvain.', 'start': 595.18, 'duration': 1.061}, {'end': 598.123, 'text': "I'll come back to Laplace-Rouvain and Naive Bayes.", 'start': 596.261, 'duration': 1.862}, {'end': 605.89, 'text': 'So several years ago, I was tracking the progress of the Stanford football team, um, just a few years ago now.', 'start': 598.643, 'duration': 7.247}], 'summary': 'Introducing laplace smoothing to address probability estimation problem with a motivating example from stanford football team.', 'duration': 54.95, 'max_score': 550.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg550940.jpg'}], 'start': 72.874, 'title': 'Naive bayes and laplace smoothing', 'summary': 'Details the use of naive bayes model for classifying spam emails, including feature representation, generative model, and maximum likelihood estimates, and discusses the application of cs229 class projects to academic conferences. it also presents laplace smoothing as a technique to address the issue of estimating probabilities as 0 when encountering new words in emails, using a stanford football team example.', 'chapters': [{'end': 291.44, 'start': 72.874, 'title': 'Naive bayes classification model', 'summary': 'Discusses the use of naive bayes model for classifying spam emails, detailing the feature representation, generative model, and maximum likelihood estimates.', 'duration': 218.566, 'highlights': ['Naive Bayes model uses zeros and ones to represent features in emails for spam classification, with the parameters including phi subscript y, phi subscript j given y equals 0, and phi subscript j given y equals 1.', 'The generative model for Naive Bayes involves modeling the terms p of x given y and p of y, where p of x given y is expressed as a product of conditional probabilities of different features given the class label y.', 'The maximum likelihood estimates for the Naive Bayes model involve calculating the fraction of training examples equal to spam and the fraction of emails with label y equals 0 in which a specific word xj appeared.']}, {'end': 451.736, 'start': 291.44, 'title': 'Cs229 class projects and naive bayes algorithm', 'summary': 'Discusses the submission of cs229 class projects to academic conferences, with a focus on the nips conference and highlights the limitation of using maximum likelihood estimates of parameters in the naive bayes algorithm.', 'duration': 160.296, 'highlights': ['The NIPS conference is mentioned as one of the top machine learning conferences, and some CS229 class projects get submitted as conference papers every year.', "The limitations of using maximum likelihood estimates of parameters in the naive Bayes algorithm are discussed, emphasizing that it is statistically a bad idea to estimate the chance of something as 0 just because it hasn't been seen yet.", "The discussion highlights the issue of estimating parameters using the example of the word 'NIPS' in emails, where the estimate of the probability of seeing the word given that it's a spam email is 0, and the consequences of using these estimates in the naive Bayes algorithm."]}, {'end': 657.857, 'start': 452.822, 'title': 'Laplace smoothing for email classification', 'summary': 'Discusses the issue of estimating probabilities as 0 when encountering new words in emails, presenting laplace smoothing as a technique to address this problem and illustrating it with a stanford football team example.', 'duration': 205.035, 'highlights': ['The technique of Laplace smoothing is introduced to address the problem of estimating probabilities as 0 when encountering new words in emails, preventing divide by 0 errors and statistically addressing the issue (e.g., estimating the probability of something as 0 just because it has not been seen once yet).', "The example of tracking the Stanford football team's progress and their away game outcomes is used to illustrate the concept of Laplace smoothing in a different context, providing a practical application of the technique in a real-world scenario."]}], 'duration': 584.983, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg72874.jpg', 'highlights': ['Naive Bayes model uses zeros and ones to represent features in emails for spam classification', 'The generative model for Naive Bayes involves modeling the terms p of x given y and p of y', 'The maximum likelihood estimates for the Naive Bayes model involve calculating the fraction of training examples equal to spam', 'The NIPS conference is mentioned as one of the top machine learning conferences', 'The technique of Laplace smoothing is introduced to address the problem of estimating probabilities as 0', 'The limitations of using maximum likelihood estimates of parameters in the naive Bayes algorithm are discussed', "The example of tracking the Stanford football team's progress and their away game outcomes is used to illustrate the concept of Laplace smoothing"]}, {'end': 1057.597, 'segs': [{'end': 764.614, 'src': 'embed', 'start': 710.453, 'weight': 2, 'content': [{'end': 715.177, 'text': "They lost four games, but you say nope, the chance that they're winning is zero, absolute certainty, right?", 'start': 710.453, 'duration': 4.724}, {'end': 717.299, 'text': 'And- and just statistically, this is not.', 'start': 715.217, 'duration': 2.082}, {'end': 719.561, 'text': 'um, this is not a good idea.', 'start': 717.299, 'duration': 2.262}, {'end': 737.059, 'text': "Um, and so, with Laplace moving, what we're going to do is uh, imagine that we saw the positive outcomes, the number of wins, you know,", 'start': 719.581, 'duration': 17.478}, {'end': 744.124, 'text': 'just add 1 to the number of wins we actually saw, and also the number of losses.', 'start': 737.059, 'duration': 7.065}, {'end': 744.964, 'text': 'add 1, right?', 'start': 744.124, 'duration': 0.84}, {'end': 749.047, 'text': 'So if you actually saw 0 wins, pretend, you saw 1 and you saw 4 losses.', 'start': 745.004, 'duration': 4.043}, {'end': 750.928, 'text': 'pretend, you saw 1 more than you actually saw.', 'start': 749.047, 'duration': 1.881}, {'end': 758.073, 'text': 'And so Laplace moving, gonna end up adding 1 to the numerator and adding 2 to the denominator.', 'start': 751.649, 'duration': 6.424}, {'end': 763.112, 'text': 'And so this ends up being 1 over 6.', 'start': 758.914, 'duration': 4.198}, {'end': 764.614, 'text': "and that's actually a more reasonable.", 'start': 763.112, 'duration': 1.502}], 'summary': 'Using laplace smoothing, the chance of winning is 1/6, a more reasonable estimate.', 'duration': 54.161, 'max_score': 710.453, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg710453.jpg'}, {'end': 846.18, 'src': 'embed', 'start': 819.713, 'weight': 4, 'content': [{'end': 824, 'text': "And this is actually an optimal estimate under- I'll say- I'll say the set of assumptions.", 'start': 819.713, 'duration': 4.287}, {'end': 824.981, 'text': "We don't need to worry about it.", 'start': 824.02, 'duration': 0.961}, {'end': 831.888, 'text': 'But it turns out that we assume that, you are Bayesian with a uniform Bayesian prior on the chance of the sun rising tomorrow.', 'start': 825.041, 'duration': 6.847}, {'end': 838.854, 'text': 'Uh so, if the chance of the sun rising tomorrow is uniformly distributed, you know, in the unit interval anywhere from 0 to 1, then,', 'start': 831.908, 'duration': 6.946}, {'end': 843.097, 'text': 'after a set of observations of this coin toss of whether the sun rises,', 'start': 838.854, 'duration': 4.243}, {'end': 846.18, 'text': 'this is actually a Bayesian optimal estimate of the chance of the sun rising tomorrow.', 'start': 843.097, 'duration': 3.083}], 'summary': 'Bayesian optimal estimate for sun rising chance after observations.', 'duration': 26.467, 'max_score': 819.713, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg819713.jpg'}, {'end': 956.395, 'src': 'embed', 'start': 906.997, 'weight': 0, 'content': [{'end': 914.382, 'text': 'And for Laplace smoothing, you add 1 to the numerator and, um, add k to the denominator.', 'start': 906.997, 'duration': 7.385}, {'end': 928.874, 'text': 'Okay? So for Naive Bayes, the way this mod- modifies your parameter estimates is this.', 'start': 914.402, 'duration': 14.472}, {'end': 933.686, 'text': "um, I'm just gonna copy over the formula from above.", 'start': 930.465, 'duration': 3.221}, {'end': 950.068, 'text': "Right? Um, so that's the maximum likelihood estimate.", 'start': 946.584, 'duration': 3.484}, {'end': 956.395, 'text': 'And with Laplace smoothing, you add 1 to the numerator and add 2 to the denominator.', 'start': 950.488, 'duration': 5.907}], 'summary': 'Laplace smoothing adds 1 to numerator and 2 to denominator for naive bayes.', 'duration': 49.398, 'max_score': 906.997, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg906997.jpg'}], 'start': 658.417, 'title': 'Estimating win probability and bayesian estimation', 'summary': 'Covers the use of maximum likelihood and laplace smoothing to estimate win probability, highlighting limitations and providing a 1/6 probability of winning. it also discusses laplace smoothing for parameter estimation in naive bayes algorithm, offering a simple and computation-efficient approach.', 'chapters': [{'end': 764.614, 'start': 658.417, 'title': 'Estimating win probability with maximum likelihood', 'summary': 'Discusses the use of maximum likelihood and laplace smoothing to estimate the probability of winning for a team with a streak of 4 losses, highlighting the limitations of maximum likelihood and the more reasonable approach of laplace smoothing, resulting in a 1/6 probability of winning.', 'duration': 106.197, 'highlights': ['Laplace smoothing involves adding 1 to the number of wins and 2 to the denominator, resulting in a more reasonable probability estimate of 1/6.', 'Maximum likelihood estimation yields a zero probability of winning for a team with 0 wins and 4 losses, demonstrating its limitations in such scenarios.']}, {'end': 1057.597, 'start': 764.614, 'title': 'Bayesian estimation & laplace smoothing', 'summary': "Discusses laplace's optimal estimate for the chance of the sun rising and the application of laplace smoothing for parameter estimation in naive bayes algorithm, providing a simple and computation-efficient approach to estimating probabilities.", 'duration': 292.983, 'highlights': ["Laplace's optimal estimate for the chance of the sun rising tomorrow is derived based on Bayesian statistics, assuming a uniform Bayesian prior on the chance of the sun rising, after a set of observations.", 'Laplace smoothing modifies parameter estimates by adding 1 to the numerator and k to the denominator for k-way random variables, ensuring that the estimates of probabilities are never exactly 0 or 1.', 'The application of Laplace smoothing in Naive Bayes algorithm simplifies parameter estimation by efficiently counting occurrences and results in a computation-efficient algorithm for classification.']}], 'duration': 399.18, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg658417.jpg', 'highlights': ['Laplace smoothing modifies parameter estimates by adding 1 to the numerator and k to the denominator for k-way random variables, ensuring that the estimates of probabilities are never exactly 0 or 1.', 'The application of Laplace smoothing in Naive Bayes algorithm simplifies parameter estimation by efficiently counting occurrences and results in a computation-efficient algorithm for classification.', 'Laplace smoothing involves adding 1 to the number of wins and 2 to the denominator, resulting in a more reasonable probability estimate of 1/6.', 'Maximum likelihood estimation yields a zero probability of winning for a team with 0 wins and 4 losses, demonstrating its limitations in such scenarios.', "Laplace's optimal estimate for the chance of the sun rising tomorrow is derived based on Bayesian statistics, assuming a uniform Bayesian prior on the chance of the sun rising, after a set of observations."]}, {'end': 2101.208, 'segs': [{'end': 1113.769, 'src': 'embed', 'start': 1082.882, 'weight': 2, 'content': [{'end': 1086.322, 'text': 'you want to estimate what is the chance that this house will be sold within the next 30 days.', 'start': 1082.882, 'duration': 3.44}, {'end': 1087.683, 'text': "So it's a classification problem.", 'start': 1086.422, 'duration': 1.261}, {'end': 1093.344, 'text': 'Um, so if one of the features is the size of the house x right?', 'start': 1088.363, 'duration': 4.981}, {'end': 1100.149, 'text': 'then one way to turn the feature into a discrete feature would be to choose a few buckets.', 'start': 1093.806, 'duration': 6.343}, {'end': 1113.769, 'text': 'So if the size is less than 400, square feet versus, you know, 400 to 800 or 800 to 1, 200 or greater than 1, 200 square feet,', 'start': 1100.63, 'duration': 13.139}], 'summary': 'Estimate chance of house sale within 30 days using discrete size buckets.', 'duration': 30.887, 'max_score': 1082.882, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1082882.jpg'}, {'end': 1216.386, 'src': 'embed', 'start': 1183.591, 'weight': 0, 'content': [{'end': 1191.397, 'text': 'But if you have the discretizing variables, you know, most people will start off with, uh, discretizing things into 10 values.', 'start': 1183.591, 'duration': 7.806}, {'end': 1209.801, 'text': 'Now, um, right.', 'start': 1202.955, 'duration': 6.846}, {'end': 1214.164, 'text': 'And so this is how you can apply Naive Bayes to other problems as well, including classifying, for example,', 'start': 1209.901, 'duration': 4.263}, {'end': 1216.386, 'text': 'if a house is likely to be sold in the next 30 days.', 'start': 1214.164, 'duration': 2.222}], 'summary': 'Applying naive bayes to classify if a house is likely to be sold in the next 30 days.', 'duration': 32.795, 'max_score': 1183.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1183591.jpg'}, {'end': 1339.92, 'src': 'embed', 'start': 1312.773, 'weight': 3, 'content': [{'end': 1319.674, 'text': 'Um, uh, and- and in this feature representation um, you know, each feature is either 0 or 1, right?', 'start': 1312.773, 'duration': 6.901}, {'end': 1324.695, 'text': "And that's part of why it throws away the information that, uh, with the one word,", 'start': 1320.134, 'duration': 4.561}, {'end': 1328.356, 'text': 'drugs appeared twice and maybe should be given more weight for your- in your classifier.', 'start': 1324.695, 'duration': 3.661}, {'end': 1339.92, 'text': 'Um, does a different representation, uh, which is specific to text.', 'start': 1329.276, 'duration': 10.644}], 'summary': 'Feature representation uses 0 or 1, discards word frequency for text classification.', 'duration': 27.147, 'max_score': 1312.773, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1312773.jpg'}, {'end': 1501.739, 'src': 'embed', 'start': 1474.515, 'weight': 1, 'content': [{'end': 1478.239, 'text': "Um, and the new representation we're gonna talk about is called the multinomial.", 'start': 1474.515, 'duration': 3.724}, {'end': 1484.856, 'text': 'event model.', 'start': 1484.235, 'duration': 0.621}, {'end': 1487.199, 'text': 'Uh, these two names are- are- are-.', 'start': 1484.996, 'duration': 2.203}, {'end': 1493.748, 'text': 'frankly, these two names are quite confusing, but these are the names that, uh, I think actually one of my friends, Andrew McCallum uh,', 'start': 1487.199, 'duration': 6.549}, {'end': 1496.932, 'text': 'as far as I know, wrote the paper that named these two algorithms.', 'start': 1493.748, 'duration': 3.184}, {'end': 1501.739, 'text': 'but- but I think these are- these are the names we seem to use.', 'start': 1496.932, 'duration': 4.807}], 'summary': 'Introducing the multinomial event model, named by andrew mccallum, in the discussion.', 'duration': 27.224, 'max_score': 1474.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1474515.jpg'}, {'end': 1705.401, 'src': 'heatmap', 'start': 1600.632, 'weight': 1, 'content': [{'end': 1605.836, 'text': 'okay?. Um, and it turns out that um.', 'start': 1600.632, 'duration': 5.204}, {'end': 1612.598, 'text': 'well, with this model, the parameters are same as before.', 'start': 1605.836, 'duration': 6.762}, {'end': 1615.16, 'text': 'phi y is probably y equals 1..', 'start': 1612.598, 'duration': 2.562}, {'end': 1631.672, 'text': 'And also, um, the other parameter of this model, phi k given y equals 0, is a chance of Xj equals k given y equals 0, right?', 'start': 1615.16, 'duration': 16.512}, {'end': 1634.874, 'text': 'And- and just to make sure you understand the notation, see if this makes sense.', 'start': 1631.792, 'duration': 3.082}, {'end': 1649.539, 'text': 'So this probability is the chance of word blank being blank if label y equals 0.', 'start': 1634.894, 'duration': 14.645}, {'end': 1662.829, 'text': "So what goes in those two blanks? Actually, what goes in the second bank? Uh, let's see.", 'start': 1649.539, 'duration': 13.29}, {'end': 1693.932, 'text': 'Yes So the chance of the third word in the e-mail being the word drugs, the chance of the second word in the e-mail, being bi or whatever.', 'start': 1681.321, 'duration': 12.611}, {'end': 1696.654, 'text': 'And one part of um.', 'start': 1694.552, 'duration': 2.102}, {'end': 1705.401, 'text': "what we've implicitly assumed maybe why this is tricky is that, um, we assume that this probability doesn't depend on J right?", 'start': 1696.654, 'duration': 8.747}], 'summary': 'Discussion on model parameters and probability calculation for email classification.', 'duration': 104.769, 'max_score': 1600.632, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1600632.jpg'}, {'end': 1898.118, 'src': 'heatmap', 'start': 1843.615, 'weight': 0.834, 'content': [{'end': 1850.978, 'text': 'this basically says look at all the words in all of your non-spam emails, all the emails with y equals 0,', 'start': 1843.615, 'duration': 7.363}, {'end': 1853.099, 'text': 'and look at all of the words in all of the emails.', 'start': 1850.978, 'duration': 2.121}, {'end': 1854.359, 'text': 'and so all of those words.', 'start': 1853.099, 'duration': 1.26}, {'end': 1856.64, 'text': 'what fraction of those words is the word drugs?', 'start': 1854.359, 'duration': 2.281}, {'end': 1863.182, 'text': "And that's, uh, your estimate of the chance of the word drugs appearing in the non-spam email in some position in that email.", 'start': 1857, 'duration': 6.182}, {'end': 1873.836, 'text': 'Right And so, um, in math, the denominator is sum over your training set indicator is not spam times the number of words in that email.', 'start': 1863.888, 'duration': 9.948}, {'end': 1880.781, 'text': 'So the denominator ends up being the total number of words in all of your non-spam emails in your training set.', 'start': 1874.676, 'duration': 6.105}, {'end': 1890.049, 'text': 'Um, and the numerator is sum over your training set, sum from i equals 1 through m, indicator y equals 0.', 'start': 1881.882, 'duration': 8.167}, {'end': 1893.714, 'text': 'So, you know, count up only the things for non-spam email.', 'start': 1890.049, 'duration': 3.665}, {'end': 1898.118, 'text': 'And for the non-spam email, j equals 1 through ni.', 'start': 1894.395, 'duration': 3.723}], 'summary': "Estimate the chance of the word 'drugs' appearing in non-spam emails using a mathematical formula.", 'duration': 54.503, 'max_score': 1843.615, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1843615.jpg'}], 'start': 1058.098, 'title': 'Naive bayes in classification', 'summary': 'Delves into discretizing continuous features for classification, using naive bayes to estimate the likelihood of a house being sold within 30 days, and explores feature representations for text classification. it also covers the multinomial event model, discussing parameter calculation and laplace smoothing.', 'chapters': [{'end': 1163.761, 'start': 1058.098, 'title': 'Discretization for classification in naive bayes', 'summary': 'Discusses discretizing continuous features for a classification problem using the example of estimating the chance of a house being sold within 30 days, and applying naive bayes with multinomial probabilities.', 'duration': 105.663, 'highlights': ['Discretizing a continuous feature for a classification problem involves dividing the feature into discrete values, as demonstrated with the example of categorizing house sizes into four buckets based on square footage.', 'In the context of applying Naive Bayes to the classification problem, the multinomial probability distribution can be used to estimate the likelihood of a feature taking on one of the discrete values, providing a practical approach for the problem at hand.']}, {'end': 1447.73, 'start': 1163.761, 'title': 'Naive bayes and text classification', 'summary': 'Explains the concept of discretizing variables and applying naive bayes to classify if a house is likely to be sold in the next 30 days, as well as the different feature representations for text classification.', 'duration': 283.969, 'highlights': ['Naive Bayes can be applied to classify if a house is likely to be sold in the next 30 days, based on discretized variables, often into 10 values, and a different variation of Naive Bayes is better for text classification.', 'Discretizing variables is often done into 10 buckets, and the representation for a text feature vector can be four-dimensional instead of the traditional 10,000 dimensional vector.', 'The feature representation for text using Naive Bayes throws away the information about the frequency of words and each feature is either 0 or 1, but a different representation specific to text uses an n-dimensional feature vector.', 'The length of the feature vector for an email varies depending on the length of the email, and different algorithms for text classification are developed with confusing names.']}, {'end': 2101.208, 'start': 1447.93, 'title': 'Multinomial event model', 'summary': 'Introduces the multinomial event model, explaining its differences from the multivariate bernoulli event model and detailing the calculation of parameters and laplace smoothing for the model.', 'duration': 653.278, 'highlights': ['The multinomial event model is introduced as a new representation, distinct from the multivariate Bernoulli event model, and explained in relation to generative modeling and Naive Bayes assumption.', 'Parameters for the multinomial event model are calculated, including the probability of word occurrences in non-spam emails, with explanations of the calculations, numerator, and denominator.', 'Laplace smoothing implementation is discussed, explaining the addition of 1 to the numerator and the number of possible outcomes to the denominator, along with handling words not in the dictionary.']}], 'duration': 1043.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg1058098.jpg', 'highlights': ['Naive Bayes can be applied to classify if a house is likely to be sold in the next 30 days, based on discretized variables, often into 10 values.', 'The multinomial event model is introduced as a new representation, distinct from the multivariate Bernoulli event model, and explained in relation to generative modeling and Naive Bayes assumption.', 'Discretizing a continuous feature for a classification problem involves dividing the feature into discrete values, as demonstrated with the example of categorizing house sizes into four buckets based on square footage.', 'The feature representation for text using Naive Bayes throws away the information about the frequency of words and each feature is either 0 or 1, but a different representation specific to text uses an n-dimensional feature vector.']}, {'end': 2722.972, 'segs': [{'end': 2175.202, 'src': 'embed', 'start': 2133.187, 'weight': 0, 'content': [{'end': 2137.049, 'text': 'Um, it turns out, Naive Bayes algorithm is actually not very competitive of other learning algorithms.', 'start': 2133.187, 'duration': 3.862}, {'end': 2147.655, 'text': 'Uh, so for most problems, you find that logistic regression, um, will work, uh, better in terms of delivering a higher accuracy than Naive Bayes.', 'start': 2137.409, 'duration': 10.246}, {'end': 2149.86, 'text': 'The-, the-.', 'start': 2149.019, 'duration': 0.841}, {'end': 2156.946, 'text': 'the advantages of Naive Bayes is, uh, first is computationally very efficient and, second, is relatively quick to implement right?', 'start': 2149.86, 'duration': 7.086}, {'end': 2163.732, 'text': "And- and also doesn't require an iterative gradient descent thing and the number of lines of code needed to implement Naive Bayes is relatively small.", 'start': 2157.086, 'duration': 6.646}, {'end': 2175.202, 'text': 'So if you are, um facing a problem where your goal is to implement something quick and dirty, then Naive Bayes is- is maybe a reasonable choice.', 'start': 2164.612, 'duration': 10.59}], 'summary': 'Naive bayes is less competitive than logistic regression, but computationally efficient and quick to implement.', 'duration': 42.015, 'max_score': 2133.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2133187.jpg'}, {'end': 2238.789, 'src': 'embed', 'start': 2215.73, 'weight': 2, 'content': [{'end': 2223.638, 'text': 'And if your goal is not to invent a brand new learning algorithm but to take existing algorithms and apply them, then, rule of thumb,', 'start': 2215.73, 'duration': 7.908}, {'end': 2229.183, 'text': 'I suggest to you is um uh, when you get started on the machine learning project,', 'start': 2223.638, 'duration': 5.545}, {'end': 2234.607, 'text': 'start by implementing something quick and dirty instead of implementing the most complicated possible learning algorithm.', 'start': 2229.183, 'duration': 5.424}, {'end': 2238.789, 'text': 'stop influencing something quickly and, uh, train the algorithm,', 'start': 2234.607, 'duration': 4.182}], 'summary': 'Start with a quick and dirty implementation of existing algorithms for machine learning projects.', 'duration': 23.059, 'max_score': 2215.73, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2215730.jpg'}, {'end': 2543.382, 'src': 'embed', 'start': 2515.34, 'weight': 4, 'content': [{'end': 2520.684, 'text': "But you implement spam filter and you see that it's not misclassifying a lot of examples of these misspelled words.", 'start': 2515.34, 'duration': 5.344}, {'end': 2525.608, 'text': "then I would say don't bother, go work on something else instead, or at least at least treat that as a lower priority.", 'start': 2520.684, 'duration': 4.924}, {'end': 2533.916, 'text': 'So one of the uses of um, uh, GDA, Gaussian distribution analysis, as well as Naive Bayes, is that is uh,', 'start': 2526.971, 'duration': 6.945}, {'end': 2536.358, 'text': "they're not going to be the most accurate algorithms.", 'start': 2533.916, 'duration': 2.442}, {'end': 2543.382, 'text': "If you want the highest classification accuracy, uh, there are other algorithms, like logistic regression or SVS, which we'll talk about next,", 'start': 2536.438, 'duration': 6.944}], 'summary': 'Implement spam filter, prioritize accuracy over gda and naive bayes.', 'duration': 28.042, 'max_score': 2515.34, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2515340.jpg'}], 'start': 2101.208, 'title': 'Machine learning algorithms', 'summary': 'Covers naive bayes algorithm, highlighting its computational efficiency, quick implementation, and simplicity, but noting its lower accuracy compared to logistic regression. it also emphasizes the importance of applying existing machine learning algorithms to real-world projects and suggests starting with quick and dirty implementations for better understanding and performance improvement. additionally, it underscores the importance of implementing efficient anti-spam algorithms such as naive bayes or gaussian distribution analysis to quickly identify misclassifications before considering more accurate yet complex algorithms like logistic regression or neural networks.', 'chapters': [{'end': 2175.202, 'start': 2101.208, 'title': 'Naive bayes algorithm', 'summary': 'Discusses the naive bayes algorithm, highlighting its advantages such as computational efficiency, quick implementation, and simplicity, but also noting that it is not very competitive compared to logistic regression in terms of accuracy.', 'duration': 73.994, 'highlights': ['Naive Bayes algorithm is not very competitive compared to logistic regression in terms of accuracy, with logistic regression delivering higher accuracy for most problems.', 'The advantages of Naive Bayes include computational efficiency, quick implementation, and requiring relatively fewer lines of code compared to other algorithms.', 'Naive Bayes does not require an iterative gradient descent process, making it suitable for quick and dirty implementations.']}, {'end': 2475.087, 'start': 2175.202, 'title': 'Machine learning algorithm application', 'summary': 'Emphasizes the importance of applying existing machine learning algorithms to real-world projects and suggests starting with quick and dirty implementations for better understanding and performance improvement.', 'duration': 299.885, 'highlights': ['Investing in new machine learning algorithms is important and beneficial for various applications.', 'Majority of class projects focus on applying existing learning algorithms to projects of interest.', 'Start with quick and dirty implementations of machine learning algorithms to better understand and improve performance.', 'Emphasizes the iterative approach to algorithm development for improved performance and understanding.', 'Provides an analogy of incremental programming for the approach to developing machine learning algorithms.', 'Discusses specific challenges and approaches in spam classification for email filtering, including addressing misspellings and spoofed email headers.']}, {'end': 2722.972, 'start': 2475.087, 'title': 'Implementing efficient anti-spam algorithms', 'summary': 'Emphasizes the importance of implementing a basic and quick anti-spam algorithm, such as naive bayes or gaussian distribution analysis, to quickly identify misclassifications, before considering more accurate yet complex algorithms like logistic regression or neural networks.', 'duration': 247.885, 'highlights': ['The importance of implementing a basic and quick anti-spam algorithm before considering more accurate yet complex algorithms like logistic regression or neural networks.', 'The advantages of Gaussian distribution analysis and Naive Bayes in terms of quick training, computation efficiency, and simplicity of implementation.', 'The weakness of Naive Bayes algorithm in treating all words as completely separate from each other and the mention of alternative word representation techniques like word embeddings.']}], 'duration': 621.764, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2101208.jpg', 'highlights': ['Logistic regression delivers higher accuracy compared to Naive Bayes', 'Naive Bayes offers computational efficiency and quick implementation', 'Start with quick and dirty implementations of machine learning algorithms', 'Importance of applying existing machine learning algorithms to real-world projects', 'Implement efficient anti-spam algorithms like Naive Bayes or Gaussian distribution analysis']}, {'end': 3163.597, 'segs': [{'end': 2750.052, 'src': 'embed', 'start': 2724.722, 'weight': 0, 'content': [{'end': 2731.106, 'text': 'Um, but you can also read up on word embeddings or look at some of the videos or resources from CS230 if you want to learn about that.', 'start': 2724.722, 'duration': 6.384}, {'end': 2737.07, 'text': 'Uh, so the word embeddings technique this is a technique from neural networks really will reduce the number of training examples.', 'start': 2731.647, 'duration': 5.423}, {'end': 2741.233, 'text': 'you need to learn a good text classifier, because it comes in with more knowledge baked in.', 'start': 2737.07, 'duration': 4.163}, {'end': 2743.975, 'text': 'Right Cool.', 'start': 2742.174, 'duration': 1.801}, {'end': 2750.052, 'text': 'Anything else? Cool.', 'start': 2744.636, 'duration': 5.416}], 'summary': 'Word embeddings technique reduces training examples for neural networks.', 'duration': 25.33, 'max_score': 2724.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2724722.jpg'}, {'end': 2964.868, 'src': 'heatmap', 'start': 2811.363, 'weight': 0.972, 'content': [{'end': 2818.906, 'text': 'Uh, and so you want an algorithm to find, you know, like a non-linear decision boundary, right?', 'start': 2811.363, 'duration': 7.543}, {'end': 2826.088, 'text': 'So the support vector machine will be an algorithm to help us find potentially very, very non-linear decision boundaries like this.', 'start': 2819.326, 'duration': 6.762}, {'end': 2830.84, 'text': 'Now, one way to build a classifier like this would be to use logistic regression.', 'start': 2826.656, 'duration': 4.184}, {'end': 2839.128, 'text': 'Um, but if this is x1, this is x2, right? Uh, so logistic regression will fit a straight line to data.', 'start': 2830.86, 'duration': 8.268}, {'end': 2842.251, 'text': 'A Gaussian disjoint analysis will end up with a straight line disjoint boundary.', 'start': 2839.148, 'duration': 3.103}, {'end': 2848.595, 'text': 'So one way to apply the logistic regression like this would be to take your feature vector x1, x2,', 'start': 2842.751, 'duration': 5.844}, {'end': 2860.283, 'text': 'and map it to a high dimensional feature vector with you know, x1, x2, x1 squared, x2 squared, x1, x2, maybe x1 cubed, x2 cubed and so on,', 'start': 2848.595, 'duration': 11.688}, {'end': 2863.145, 'text': 'and have a new feature vector which we call phi of x.', 'start': 2860.283, 'duration': 2.862}, {'end': 2867.382, 'text': 'that- that has these higher dimensional features right?', 'start': 2864.181, 'duration': 3.201}, {'end': 2875.165, 'text': 'Now, um, it turns out if you do this and then apply logistic regression to this augmented feature vector, uh,', 'start': 2867.662, 'duration': 7.503}, {'end': 2878.746, 'text': 'then logistic regression can learn nonlinear decision boundaries.', 'start': 2875.165, 'duration': 3.581}, {'end': 2882.967, 'text': 'Uh, with this set of features, logistic regression can actually learn the decision boundary.', 'start': 2878.886, 'duration': 4.081}, {'end': 2885.588, 'text': "that's the-, that's the-, that's the shape of an ellipse, right?", 'start': 2882.967, 'duration': 2.621}, {'end': 2890.694, 'text': 'Um, but manually choosing these features is a little bit of a pain, right?', 'start': 2885.908, 'duration': 4.786}, {'end': 2891.455, 'text': "I, I, I don't know.", 'start': 2890.814, 'duration': 0.641}, {'end': 2895.3, 'text': "What I, I, I actually don't know what you know type of, uh.", 'start': 2891.835, 'duration': 3.465}, {'end': 2902.867, 'text': 'uh set of features could get you a decision boundary like that right, rather than just an ellipse, a more complex decision boundary.', 'start': 2896.605, 'duration': 6.262}, {'end': 2912.769, 'text': 'Um, and what we will see with uh support vector machines is that we will be able to derive an algorithm that can take, say, input features x1, x2,', 'start': 2903.627, 'duration': 9.142}, {'end': 2923.543, 'text': 'map them to a much higher dimensional set of features uh and then apply linear classifier uh in a way similar to logistic regression,', 'start': 2912.769, 'duration': 10.774}, {'end': 2928.572, 'text': 'but different in details that allows you to learn very non-linear decision boundaries.', 'start': 2923.543, 'duration': 5.029}, {'end': 2933.316, 'text': 'Um, and I think, uh, you know, a support vector machine.', 'start': 2930.155, 'duration': 3.161}, {'end': 2939.717, 'text': 'one of the- actually one of the reasons, uh, uh, support vector machines are used today is- is a relatively turnkey algorithm.', 'start': 2933.316, 'duration': 6.401}, {'end': 2943.458, 'text': "And what I mean by that is it doesn't have too many parameters to fiddle with.", 'start': 2939.817, 'duration': 3.641}, {'end': 2951.379, 'text': 'Uh, even for logistic regression or for, uh, linear regression, you know you might have to tune the gradient descent parameter.', 'start': 2943.698, 'duration': 7.681}, {'end': 2952.4, 'text': 'uh, tune the learning rate.', 'start': 2951.379, 'duration': 1.021}, {'end': 2955.6, 'text': "sorry, tune the learning rate alpha, and that's just another thing to fiddle with right?", 'start': 2952.4, 'duration': 3.2}, {'end': 2958.821, 'text': "Try a few values and hope you didn't mess up how you set that value.", 'start': 2955.62, 'duration': 3.201}, {'end': 2964.868, 'text': 'Um support vector machine today has the very uh robust,', 'start': 2959.461, 'duration': 5.407}], 'summary': 'Support vector machines can learn nonlinear decision boundaries with higher dimensional features, providing a more complex boundary shape, and are considered a relatively turnkey algorithm with fewer parameters to tune.', 'duration': 153.505, 'max_score': 2811.363, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2811363.jpg'}, {'end': 2952.4, 'src': 'embed', 'start': 2912.769, 'weight': 1, 'content': [{'end': 2923.543, 'text': 'map them to a much higher dimensional set of features uh and then apply linear classifier uh in a way similar to logistic regression,', 'start': 2912.769, 'duration': 10.774}, {'end': 2928.572, 'text': 'but different in details that allows you to learn very non-linear decision boundaries.', 'start': 2923.543, 'duration': 5.029}, {'end': 2933.316, 'text': 'Um, and I think, uh, you know, a support vector machine.', 'start': 2930.155, 'duration': 3.161}, {'end': 2939.717, 'text': 'one of the- actually one of the reasons, uh, uh, support vector machines are used today is- is a relatively turnkey algorithm.', 'start': 2933.316, 'duration': 6.401}, {'end': 2943.458, 'text': "And what I mean by that is it doesn't have too many parameters to fiddle with.", 'start': 2939.817, 'duration': 3.641}, {'end': 2951.379, 'text': 'Uh, even for logistic regression or for, uh, linear regression, you know you might have to tune the gradient descent parameter.', 'start': 2943.698, 'duration': 7.681}, {'end': 2952.4, 'text': 'uh, tune the learning rate.', 'start': 2951.379, 'duration': 1.021}], 'summary': 'Support vector machines are used for non-linear decision boundaries with few parameters to tune.', 'duration': 39.631, 'max_score': 2912.769, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2912769.jpg'}, {'end': 3051.804, 'src': 'embed', 'start': 3026.362, 'weight': 4, 'content': [{'end': 3036.574, 'text': "And what that means is we're going to start off with datasets, um that we assume look like this and that are linearly separable, right?", 'start': 3026.362, 'duration': 10.212}, {'end': 3040.957, 'text': 'And so the optimal margin classifier is the basic building block of a support vector machine.', 'start': 3036.674, 'duration': 4.283}, {'end': 3044.8, 'text': "And, uh, uh, uh, we'll first derive an algorithm.", 'start': 3041.657, 'duration': 3.143}, {'end': 3051.804, 'text': "Uh, they'll be- they'll have some similarities to a logistic regression but- but allows us to scale uh, uh,", 'start': 3045.42, 'duration': 6.384}], 'summary': 'Introduction to support vector machines for linearly separable datasets.', 'duration': 25.442, 'max_score': 3026.362, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3026362.jpg'}], 'start': 2724.722, 'title': 'Word embeddings, support vector machines, and svm concepts', 'summary': 'Discusses the advantages of word embeddings in reducing training examples for text classification, highlights the use of support vector machines, and explains their ability to learn non-linear decision boundaries using high-dimensional feature vectors and requiring fewer parameters to adjust.', 'chapters': [{'end': 2800.752, 'start': 2724.722, 'title': 'Word embeddings and support vector machines', 'summary': 'Discusses the benefits of word embeddings in reducing the number of training examples needed for a good text classifier and mentions the use of support vector machines for classification.', 'duration': 76.03, 'highlights': ['The word embeddings technique from neural networks reduces the number of training examples needed for a good text classifier.', 'Support vector machines (SVMs) are used for classification problems.']}, {'end': 3163.597, 'start': 2809.983, 'title': 'Support vector machines', 'summary': 'Explains the concept of support vector machines, including their ability to learn non-linear decision boundaries using high dimensional feature vectors and the turnkey nature of the algorithm, which requires fewer parameters to fiddle with compared to other algorithms.', 'duration': 353.614, 'highlights': ['Support Vector Machines can learn non-linear decision boundaries using high dimensional feature vectors', 'Support Vector Machines are relatively turnkey and require fewer parameters to fiddle with', 'Explanation of optimal margin classifier and the concept of kernels']}], 'duration': 438.875, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg2724722.jpg', 'highlights': ['Word embeddings reduce training examples needed for text classification', 'Support Vector Machines (SVMs) used for classification problems', 'SVMs learn non-linear decision boundaries using high dimensional feature vectors', 'SVMs require fewer parameters to adjust', 'Explanation of optimal margin classifier and concept of kernels']}, {'end': 4134.673, 'segs': [{'end': 3244.468, 'src': 'embed', 'start': 3192.2, 'weight': 0, 'content': [{'end': 3197.462, 'text': 'But if you look at what actually happens in practice in machine learning, um uh,', 'start': 3192.2, 'duration': 5.262}, {'end': 3202.365, 'text': 'the set of algorithms actually used in practice is actually much wider than neural networks and deep learning.', 'start': 3197.462, 'duration': 4.903}, {'end': 3205.767, 'text': 'So- so we do not live in a neural networks only world.', 'start': 3202.425, 'duration': 3.342}, {'end': 3208.509, 'text': 'We actually use many, many tools in machine learning.', 'start': 3205.807, 'duration': 2.702}, {'end': 3217.415, 'text': "It's just that, uh, deep learning attracts the attention of the media in some dis- in some way that's quite disproportionate to what I find useful.", 'start': 3208.589, 'duration': 8.826}, {'end': 3223.758, 'text': "you know, I love them, you know, but, but they're not, they're not the only thing in the world.", 'start': 3218.115, 'duration': 5.643}, {'end': 3230.321, 'text': 'Uh, and so, yeah, and again late last night, I was talking to an engineer, uh, uh, about, uh, factor analysis,', 'start': 3224.018, 'duration': 6.303}, {'end': 3231.982, 'text': "which you'll learn about later in CSU39, right?", 'start': 3230.321, 'duration': 1.661}, {'end': 3238.685, 'text': "It's an unsupervised learning algorithm and there's an application, um uh, that that one of my teams is working on in manufacturing,", 'start': 3232.222, 'duration': 6.463}, {'end': 3244.468, 'text': "where I'm gonna use factor analysis or something very similar to it, which, which is totally not a neural network technique, right?", 'start': 3238.685, 'duration': 5.783}], 'summary': "Machine learning involves a wide range of algorithms beyond neural networks and deep learning, as highlighted by the speaker's use of factor analysis in an unsupervised learning application in manufacturing.", 'duration': 52.268, 'max_score': 3192.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3192200.jpg'}, {'end': 3340.009, 'src': 'heatmap', 'start': 3255.588, 'weight': 6, 'content': [{'end': 3255.828, 'text': 'All right.', 'start': 3255.588, 'duration': 0.24}, {'end': 3260.792, 'text': "So let's start developing the optimal margin classifier.", 'start': 3256.649, 'duration': 4.143}, {'end': 3276.167, 'text': 'So, um, first let me define the functional margin, which is, uh, informally, the functional margin of a classifier.', 'start': 3267.498, 'duration': 8.669}, {'end': 3281.953, 'text': 'is how well, how- how confidently and accurately do you classify an example??', 'start': 3276.167, 'duration': 5.786}, {'end': 3283.294, 'text': "Um, so, here's what I mean.", 'start': 3282.353, 'duration': 0.941}, {'end': 3290.582, 'text': "Uh, we're gonna go to binary classification, and we're gonna use logistic regression, right? So.", 'start': 3284.255, 'duration': 6.327}, {'end': 3296.343, 'text': "So let's- let's start by motivating this with logistic regression.", 'start': 3292.662, 'duration': 3.681}, {'end': 3302.206, 'text': 'So this logistic classifier, h of Theta equals the logistic function applied to Theta transpose X.', 'start': 3296.684, 'duration': 5.522}, {'end': 3307.568, 'text': 'And so, um, if you turn this into a binary classification if-, if-.', 'start': 3302.206, 'duration': 5.362}, {'end': 3314.27, 'text': 'if you have this algorithm predict not a probability, but predict 0 or 1, then what this classifier will do is, uh,', 'start': 3307.568, 'duration': 6.702}, {'end': 3322.697, 'text': 'predict 1 if Theta transpose X is greater than 0..', 'start': 3314.27, 'duration': 8.427}, {'end': 3330.162, 'text': 'right um and predict 0, otherwise okay?', 'start': 3322.697, 'duration': 7.465}, {'end': 3340.009, 'text': 'Because Theta transpose X greater than 0, this means that um G of Theta transpose X is greater than 0.5, right?', 'start': 3330.703, 'duration': 9.306}], 'summary': 'Developing optimal margin classifier using logistic regression for binary classification.', 'duration': 37.803, 'max_score': 3255.588, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3255588.jpg'}, {'end': 3477.892, 'src': 'embed', 'start': 3441.568, 'weight': 5, 'content': [{'end': 3446.513, 'text': 'Because, uh, if this is true, then the algorithm is doing very well on this example, okay?', 'start': 3441.568, 'duration': 4.945}, {'end': 3467.568, 'text': "So, um, So, the functional margin which we'll define in a second, uh, captures this idea that, uh, if you classify as a large functional margin,", 'start': 3448.955, 'duration': 18.613}, {'end': 3471.189, 'text': 'it means that these two statements are true, right?', 'start': 3467.568, 'duration': 3.621}, {'end': 3476.791, 'text': "Um, to look ahead a little bit, there's a different thing.", 'start': 3472.95, 'duration': 3.841}, {'end': 3477.892, 'text': "we'll define in a second.", 'start': 3476.791, 'duration': 1.101}], 'summary': 'The algorithm is performing well with a large functional margin.', 'duration': 36.324, 'max_score': 3441.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3441568.jpg'}, {'end': 3588.501, 'src': 'embed', 'start': 3566.626, 'weight': 3, 'content': [{'end': 3578.979, 'text': "Okay?. Um, and so what I'd like to do in the uh next several, I guess next- next 20 minutes is formalize definite functional margin,", 'start': 3566.626, 'duration': 12.353}, {'end': 3584.96, 'text': 'formalize definition of geometric margin, and it will pose the- the, I guess, the optimal margin classifier,', 'start': 3578.979, 'duration': 5.981}, {'end': 3588.501, 'text': 'which is basically an algorithm that tries to maximize the geometric margin.', 'start': 3584.96, 'duration': 3.541}], 'summary': 'Formalizing functional and geometric margins in next 20 minutes.', 'duration': 21.875, 'max_score': 3566.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3566626.jpg'}, {'end': 3840.301, 'src': 'heatmap', 'start': 3737.128, 'weight': 0.853, 'content': [{'end': 3747.275, 'text': 'Um, so, for the SVM, the parameters of the SVM will be the parameters w and b, and the hypothesis applied to x will be g of this,', 'start': 3737.128, 'duration': 10.147}, {'end': 3748.515, 'text': "and we're dropping the.", 'start': 3747.275, 'duration': 1.24}, {'end': 3750.497, 'text': 'x0 equals 1 constraint.', 'start': 3748.515, 'duration': 1.982}, {'end': 3753.659, 'text': 'So separate out w and b as follows.', 'start': 3750.897, 'duration': 2.762}, {'end': 3757.441, 'text': 'So this is a standard notation used to develop support vector machines.', 'start': 3753.719, 'duration': 3.722}, {'end': 3764.546, 'text': 'Um, and one way to think about this is if the parameters are, you know, Theta 0, Theta 1, Theta 2, Theta 3,,', 'start': 3757.681, 'duration': 6.865}, {'end': 3768.188, 'text': 'then this is the new b and this is a new w.', 'start': 3764.546, 'duration': 3.642}, {'end': 3774.51, 'text': 'Okay So you just separate out the, the, the, uh, Theta 0 which was previously multiplying into X0.', 'start': 3768.188, 'duration': 6.322}, {'end': 3782.232, 'text': 'Right And so, um, uh, yeah, right.', 'start': 3774.71, 'duration': 7.522}, {'end': 3795.297, 'text': "And so this term here becomes sum from i equals 1 through n of wi xi plus b, right, since we've gotten rid of X0.", 'start': 3782.252, 'duration': 13.045}, {'end': 3816.83, 'text': 'All right.', 'start': 3816.57, 'duration': 0.26}, {'end': 3821.636, 'text': 'So let me formalize the definition of a functional homogen.', 'start': 3818.452, 'duration': 3.184}, {'end': 3834.478, 'text': 'So, um, Uh, so the parameters w and b define the linear classifier right?', 'start': 3827.242, 'duration': 7.236}, {'end': 3840.301, 'text': 'So you know, uh, with the formulas you just wrote down, the parameters w and b defines a, uh.', 'start': 3834.698, 'duration': 5.603}], 'summary': 'Explaining svm parameters w and b, separating theta 0, developing support vector machines.', 'duration': 103.173, 'max_score': 3737.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3737128.jpg'}, {'end': 4129.452, 'src': 'heatmap', 'start': 4023.226, 'weight': 0.72, 'content': [{'end': 4033.21, 'text': 'Uh, so-, so-, so long as the um functional margin, so long as this Gamma hat i is greater than 0,, it means that, uh,', 'start': 4023.226, 'duration': 9.984}, {'end': 4038.212, 'text': 'either this is bigger than 0 or this is less than 0,, depending on the sign of the label.', 'start': 4033.21, 'duration': 5.002}, {'end': 4043.595, 'text': 'And it means that the algorithm gets this one example correct, at least right?', 'start': 4038.613, 'duration': 4.982}, {'end': 4046.816, 'text': 'And if much greater than 0, then it means You know.', 'start': 4043.815, 'duration': 3.001}, {'end': 4052.117, 'text': "so if it's greater than 0, it means in in the logistic regression case it means that the prediction is at least a little bit above 0.5,", 'start': 4046.816, 'duration': 5.301}, {'end': 4055.338, 'text': 'a little bit below 0.5 probability, so that at least gets it right.', 'start': 4052.117, 'duration': 3.221}, {'end': 4059.239, 'text': "And if it's much greater than 0, much less than 0, then that means it's.", 'start': 4055.498, 'duration': 3.741}, {'end': 4068.121, 'text': 'you know, the probability output in the logistic regression case is sort of very close to 1 or very close to 0..', 'start': 4059.239, 'duration': 8.882}, {'end': 4069.922, 'text': 'So one other definition.', 'start': 4068.121, 'duration': 1.801}, {'end': 4083.513, 'text': "I'm gonna define the functional margin with respect to the training set to be.", 'start': 4080.569, 'duration': 2.944}, {'end': 4092.826, 'text': 'Gamma hat equals min over i of Gamma hat i, where here i equals ranges over your training examples.', 'start': 4083.513, 'duration': 9.313}, {'end': 4096.183, 'text': 'So, um, this is a worst-case notion.', 'start': 4094.261, 'duration': 1.922}, {'end': 4099.246, 'text': 'But so this definition of a functional margin.', 'start': 4096.283, 'duration': 2.963}, {'end': 4105.412, 'text': 'on the left, we define functional margin with respect to a single training example, which is how are you doing on that one training example?', 'start': 4099.246, 'duration': 6.166}, {'end': 4113.02, 'text': "And we'll define the functional margin with respect to the entire training set as how well are you doing on the worst example in your training set.", 'start': 4106.153, 'duration': 6.867}, {'end': 4121.17, 'text': "uh, this is a little bit of a brittle notion and where for now, for today, we're assuming that the training set is linearly separable.", 'start': 4114.288, 'duration': 6.882}, {'end': 4125.731, 'text': "So we're gonna assume that the training set, you know, looks like this, uh.", 'start': 4121.21, 'duration': 4.521}, {'end': 4128.011, 'text': 'and that you can separate it with a straight line.', 'start': 4125.731, 'duration': 2.28}, {'end': 4129.452, 'text': 'uh, that will relax this later.', 'start': 4128.011, 'duration': 1.441}], 'summary': "Functional margin reflects algorithm's correctness on training examples in logistic regression case.", 'duration': 106.226, 'max_score': 4023.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4023226.jpg'}], 'start': 3176.195, 'title': 'Machine learning practices and svm', 'summary': 'Covers the prevalence of machine learning algorithms, including beyond neural networks and deep learning, and discusses the wider set of tools used in practice. it also delves into the development and formalization of support vector machines, aiming to maximize the geometric margin for optimal classification.', 'chapters': [{'end': 3244.468, 'start': 3176.195, 'title': 'Machine learning in practice', 'summary': 'Discusses the prevalence of machine learning algorithms beyond neural networks and deep learning, highlighting the wider set of tools used in practice and the disproportionate media attention towards deep learning.', 'duration': 68.273, 'highlights': ['The set of algorithms actually used in practice in machine learning is much wider than neural networks and deep learning.', 'Deep learning attracts disproportionate media attention but is not the only technique used in practice.', 'An example is given of using factor analysis, a non-neural network technique, in an application for manufacturing.']}, {'end': 3441.488, 'start': 3244.488, 'title': 'Optimal margin classifier', 'summary': 'Discusses the development of the optimal margin classifier and how logistic regression is used for binary classification, aiming to predict 1 if theta transpose x is greater than 0 and 0 if it is less than 0, with the goal of achieving very accurate and confident predictions.', 'duration': 197, 'highlights': ['The functional margin of a classifier measures how confidently and accurately it classifies an example, aiming for Theta transpose X to be much greater than 0 for y i = 1 to achieve very close to 1 output probability and very accurate prediction.', 'Logistic regression is used for binary classification to predict 1 if Theta transpose X is greater than 0 and 0 if it is less than 0, focusing on achieving very accurate and confident predictions.', 'The algorithm predicts 1 if Theta transpose X is greater than 0 and 0 otherwise, with the goal of having the output probability very close to 1 when y i = 1 and very close to 0 when y i = 0.']}, {'end': 4134.673, 'start': 3441.568, 'title': 'Support vector machines', 'summary': 'Discusses the formalization of the functional and geometric margins in svm, defining the functional and geometric margins, and the development of the optimal margin classifier, aiming to maximize the geometric margin.', 'duration': 693.105, 'highlights': ['The chapter discusses the formalization of the functional and geometric margins in SVM', 'Defining the functional and geometric margins', 'Development of the optimal margin classifier aiming to maximize the geometric margin']}], 'duration': 958.478, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg3176195.jpg', 'highlights': ['The set of algorithms used in practice in machine learning is wider than neural networks and deep learning.', 'Deep learning attracts disproportionate media attention but is not the only technique used in practice.', 'An example is given of using factor analysis, a non-neural network technique, in an application for manufacturing.', 'The chapter discusses the formalization of the functional and geometric margins in SVM.', 'Development of the optimal margin classifier aiming to maximize the geometric margin.', 'The functional margin of a classifier measures how confidently and accurately it classifies an example.', 'Logistic regression is used for binary classification to predict 1 if Theta transpose X is greater than 0 and 0 if it is less than 0.']}, {'end': 4851.875, 'segs': [{'end': 4216.613, 'src': 'embed', 'start': 4179.801, 'weight': 2, 'content': [{'end': 4186.365, 'text': 'So- so one- one way to cheat on the functional margin is just by scaling the parameters by 2, or instead of 2,', 'start': 4179.801, 'duration': 6.564}, {'end': 4192.167, 'text': "maybe you can multiply all your parameters by 10, and then you've actually increased the functional margin of your training examples 10x.", 'start': 4186.365, 'duration': 5.802}, {'end': 4195.729, 'text': "But, uh, this doesn't actually change the decision boundary, right?", 'start': 4192.787, 'duration': 2.942}, {'end': 4198.851, 'text': "It doesn't actually change any classification, just to multiply all of your parameters.", 'start': 4195.749, 'duration': 3.102}, {'end': 4201.393, 'text': 'by a factor of 10.', 'start': 4199.551, 'duration': 1.842}, {'end': 4216.613, 'text': 'Um, so one thing you could do is, uh, replace- one thing you could do, um, would be to normalize the length of your parameters.', 'start': 4201.393, 'duration': 15.22}], 'summary': "Scaling parameters can cheat functional margin, but doesn't change decision boundary or classification.", 'duration': 36.812, 'max_score': 4179.801, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4179801.jpg'}, {'end': 4306.313, 'src': 'embed', 'start': 4268.752, 'weight': 5, 'content': [{'end': 4272.255, 'text': "Okay So we'll come back and use this property in a little bit.", 'start': 4268.752, 'duration': 3.503}, {'end': 4285.785, 'text': 'All right.', 'start': 4285.545, 'duration': 0.24}, {'end': 4288.967, 'text': "So to find the functional margin, let's define the geometric margin.", 'start': 4285.865, 'duration': 3.102}, {'end': 4293.511, 'text': "And you'll see in a second how the geometric and functional margin relate to each other.", 'start': 4289.308, 'duration': 4.203}, {'end': 4304.551, 'text': "Um, so let's- let's- let's define the geometric margin with respect to a single example, which is, um, so let's see.", 'start': 4294.261, 'duration': 10.29}, {'end': 4306.313, 'text': "Let's say you have a classifier.", 'start': 4304.711, 'duration': 1.602}], 'summary': 'Defining the geometric margin and its relation to the functional margin in the context of a classifier.', 'duration': 37.561, 'max_score': 4268.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4268752.jpg'}, {'end': 4478.055, 'src': 'embed', 'start': 4427.903, 'weight': 4, 'content': [{'end': 4429.484, 'text': 'So let me just write down what that is.', 'start': 4427.903, 'duration': 1.581}, {'end': 4450.393, 'text': 'So the geometric margin of.', 'start': 4443.988, 'duration': 6.405}, {'end': 4457.478, 'text': 'you know the classifier of the hyperplane defined by WB.', 'start': 4450.393, 'duration': 7.085}, {'end': 4471.429, 'text': 'with respect to one example, XIYI, this is going to be Gamma I equals W, transpose X plus B over the normal W.', 'start': 4457.478, 'duration': 13.951}, {'end': 4476.253, 'text': "Um, and let's see, I'm not proving why this is the case.", 'start': 4472.731, 'duration': 3.522}, {'end': 4478.055, 'text': 'The proof is given in the lecture notes.', 'start': 4476.354, 'duration': 1.701}], 'summary': 'Geometric margin of the classifier is explained using the equation gamma i = w^t x + b / ||w||.', 'duration': 50.152, 'max_score': 4427.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4427903.jpg'}, {'end': 4654.067, 'src': 'embed', 'start': 4580.182, 'weight': 1, 'content': [{'end': 4584.945, 'text': 'um, and that is your geometric margin on the training set.', 'start': 4580.182, 'duration': 4.763}, {'end': 4586.726, 'text': 'Oh and- and so I hope the-.', 'start': 4585.485, 'duration': 1.241}, {'end': 4588.227, 'text': 'sorry, I hope the notation is clear, right?', 'start': 4586.726, 'duration': 1.501}, {'end': 4605.026, 'text': 'So, Gamma hat was the functional margin and Gamma is a geometric margin, okay?', 'start': 4588.247, 'duration': 16.779}, {'end': 4654.067, 'text': 'And so, um, what the optimal margin classifier does is, um, choose the parameters w and b to maximize the geometric margin, okay?', 'start': 4606.286, 'duration': 47.781}], 'summary': 'Optimal margin classifier maximizes geometric margin with parameters w and b.', 'duration': 73.885, 'max_score': 4580.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4580182.jpg'}, {'end': 4773.774, 'src': 'heatmap', 'start': 4706.751, 'weight': 0, 'content': [{'end': 4733.643, 'text': 'subject to that, subject to that, every training example, um uh, must have geometric margin, uh, uh, greater than or equal to Gamma, right?', 'start': 4706.751, 'duration': 26.892}, {'end': 4735.745, 'text': 'So you want Gamma to be as big as possible.', 'start': 4733.703, 'duration': 2.042}, {'end': 4739.167, 'text': 'subject to that, every single training example must have at least that geometric margin.', 'start': 4735.745, 'duration': 3.422}, {'end': 4743.05, 'text': 'This causes you to maximize the worst case geometric margin.', 'start': 4739.187, 'duration': 3.863}, {'end': 4749.063, 'text': "And it turns out this is, um, not- in this form, this isn't a convex optimization problem.", 'start': 4744.44, 'duration': 4.623}, {'end': 4753.107, 'text': "So it's difficult to solve this when I run a gradient descent and ensure there are no local optimer and so on.", 'start': 4749.124, 'duration': 3.983}, {'end': 4761.893, 'text': 'But it turns out that by a few steps of rewriting, you can reformulate this problem as um into an equivalent problem,', 'start': 4753.607, 'duration': 8.286}, {'end': 4773.774, 'text': 'which is to minimize the norm of w subject to the geometric margin Um.', 'start': 4761.893, 'duration': 11.881}], 'summary': 'Maximize geometric margin, reformulate as a convex optimization problem.', 'duration': 29.334, 'max_score': 4706.751, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4706751.jpg'}, {'end': 4818.842, 'src': 'embed', 'start': 4791.252, 'weight': 3, 'content': [{'end': 4796.254, 'text': 'And what we show in the lecture notes is that, uh, through a few steps, uh,', 'start': 4791.252, 'duration': 5.002}, {'end': 4803.756, 'text': 'you can rewrite this optimization problem into the following equivalent form, which is to try to minimize the norm of w?', 'start': 4796.254, 'duration': 7.502}, {'end': 4804.657, 'text': 'uh subject to this', 'start': 4803.756, 'duration': 0.901}, {'end': 4812.78, 'text': 'And maybe one piece of intuition to take away is um, uh, you know, the smaller w is the bigger right the- the-,', 'start': 4804.737, 'duration': 8.043}, {'end': 4815.981, 'text': 'the less of a normalization division effect you have, right?', 'start': 4812.78, 'duration': 3.201}, {'end': 4818.842, 'text': 'Uh, but the details are given in the lecture notes.', 'start': 4816.361, 'duration': 2.481}], 'summary': 'Optimization problem can be rewritten to minimize norm of w, details in notes.', 'duration': 27.59, 'max_score': 4791.252, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4791252.jpg'}], 'start': 4134.673, 'title': 'Linear classifier and geometric margin', 'summary': 'Discusses the linear classifier, geometric margin, and the optimization problem to maximize the geometric margin, emphasizing the importance of parameter normalization and potential cheating by scaling parameters.', 'chapters': [{'end': 4306.313, 'start': 4134.673, 'title': 'Functional margin and parameter scaling', 'summary': 'Discusses the concept of functional margin, the potential for cheating by scaling parameters, and the importance of parameter normalization, emphasizing that scaling parameters by a constant factor does not alter the classification. it also introduces the definition of geometric margin.', 'duration': 171.64, 'highlights': ['The chapter emphasizes that scaling the parameters by a constant factor does not alter the classification, preventing potential cheating on the functional margin.', 'It introduces the concept of parameter normalization, suggesting imposing a constraint on the norm of w or replacing w and b with their normalized forms to prevent cheating on the functional margin.', 'The chapter defines the geometric margin and hints at its relationship with the functional margin, setting the stage for further exploration.']}, {'end': 4851.875, 'start': 4310.02, 'title': 'Linear classifier and geometric margin', 'summary': 'Discusses the linear classifier defined by parameters w and b, the decision boundary, geometric margin, and the optimization problem to maximize the geometric margin subject to constraints, for the optimal margin classifier.', 'duration': 541.855, 'highlights': ['The geometric margin is defined as Gamma I equals W transpose X plus B over the normal W, used to measure the Euclidean distance between a training example and the decision boundary.', 'The optimal margin classifier maximizes the geometric margin by choosing parameters w and b to maximize the distance to all examples, posing the problem as maximizing Gamma w and b of Gamma subject to every training example having a geometric margin greater than or equal to Gamma.', 'The optimization problem to maximize the geometric margin is reformulated as minimizing the norm of w subject to the geometric margin, resulting in a convex optimization problem with good numerical optimization packages available for solving it.']}], 'duration': 717.202, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lDwow4aOrtg/pics/lDwow4aOrtg4134673.jpg', 'highlights': ['The optimization problem to maximize the geometric margin is reformulated as minimizing the norm of w subject to the geometric margin, resulting in a convex optimization problem with good numerical optimization packages available for solving it.', 'The optimal margin classifier maximizes the geometric margin by choosing parameters w and b to maximize the distance to all examples, posing the problem as maximizing Gamma w and b of Gamma subject to every training example having a geometric margin greater than or equal to Gamma.', 'The chapter emphasizes that scaling the parameters by a constant factor does not alter the classification, preventing potential cheating on the functional margin.', 'It introduces the concept of parameter normalization, suggesting imposing a constraint on the norm of w or replacing w and b with their normalized forms to prevent cheating on the functional margin.', 'The geometric margin is defined as Gamma I equals W transpose X plus B over the normal W, used to measure the Euclidean distance between a training example and the decision boundary.', 'The chapter defines the geometric margin and hints at its relationship with the functional margin, setting the stage for further exploration.']}], 'highlights': ['Word embeddings reduce training examples needed for text classification', 'Support Vector Machines (SVMs) used for classification problems', 'Naive Bayes can be applied to classify if a house is likely to be sold in the next 30 days, based on discretized variables, often into 10 values', 'The optimization problem to maximize the geometric margin is reformulated as minimizing the norm of w subject to the geometric margin, resulting in a convex optimization problem with good numerical optimization packages available for solving it', 'The set of algorithms used in practice in machine learning is wider than neural networks and deep learning', 'The Naive Bayes algorithm is used to build a spam classifier for email or text classification, and Laplace smoothing is described as an essential addition to make it work effectively', 'Logistic regression delivers higher accuracy compared to Naive Bayes', 'The discussion of Naive Bayes and its application to spam classification is the central theme of the chapter, demonstrating its relevance and importance in the context of machine learning', 'The application of Laplace smoothing in Naive Bayes algorithm simplifies parameter estimation by efficiently counting occurrences and results in a computation-efficient algorithm for classification', 'Naive Bayes offers computational efficiency and quick implementation']}