title

Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)

description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/2ZdTL4x
Andrew Ng
Adjunct Professor of Computer Science
https://www.andrewng.org/
To follow along with the course schedule and syllabus, visit:
http://cs229.stanford.edu/syllabus-autumn2018.html
An outline of this lecture includes:
Linear Regression Recap
Locally Weighted Regression
Probabilistic Interpretation
Logistic Regression
Newton's method
00:00 Introduction - recap discussion on supervised learning
05:38 Locally weighted regression
05:53 Parametric learning algorithms and non-parametric learning algorithms
21:32 Probabilistic Interpretation
46:18 Logistic Regression
1:05:57 Newton's method
#aicourse #andrewng

detail

{'title': 'Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)', 'heatmap': [{'end': 477.684, 'start': 371.078, 'weight': 0.721}, {'end': 765.103, 'start': 715.578, 'weight': 0.879}, {'end': 1484.329, 'start': 1373.146, 'weight': 0.814}, {'end': 1671.679, 'start': 1574.196, 'weight': 0.724}, {'end': 2005.71, 'start': 1943.132, 'weight': 0.884}, {'end': 3461.883, 'start': 3385.307, 'weight': 0.703}, {'end': 4058.94, 'start': 4004.833, 'weight': 0.726}, {'end': 4489.764, 'start': 4436.347, 'weight': 0.822}], 'summary': 'Lecture covers supervised learning concepts and algorithms, including locally weighted regression, logistic regression, gaussian density width, and parameterization, exploring their applications and computational considerations in machine learning.', 'chapters': [{'end': 327.722, 'segs': [{'end': 129.633, 'src': 'embed', 'start': 63.234, 'weight': 0, 'content': [{'end': 68.337, 'text': 'and let you play some of the ideas yourself in the um problem set, one which we released later this week.', 'start': 63.234, 'duration': 5.103}, {'end': 73.02, 'text': 'And then um, I guess, give a probabilistic interpretation of linear regression.', 'start': 68.937, 'duration': 4.083}, {'end': 78.363, 'text': "logistic regression will depend on that um, and Newton's method is for logistic regression.", 'start': 73.02, 'duration': 5.343}, {'end': 79.48, 'text': 'Thank you.', 'start': 79.189, 'duration': 0.291}, {'end': 84.066, 'text': 'To recap the notation you saw on Wednesday.', 'start': 80.464, 'duration': 3.602}, {'end': 92.991, 'text': 'we use this notation x, i, comma i, y i to denote a single training example where uh x i was n plus 1-dimensional.', 'start': 84.066, 'duration': 8.925}, {'end': 101.176, 'text': 'So if you had two features the size of a house and the number of bedrooms then x i would be 2 plus 1, would be 3-dimensional,', 'start': 93.372, 'duration': 7.804}, {'end': 107.46, 'text': 'because we had introduced a new, uh sort of fake feature, x 0, which was always set to the value of 1..', 'start': 101.176, 'duration': 6.284}, {'end': 111.942, 'text': 'Uh, and then y i, in the case of regression is always a real number.', 'start': 107.46, 'duration': 4.482}, {'end': 115.805, 'text': 'N was the number of training examples, N was the number of features.', 'start': 111.963, 'duration': 3.842}, {'end': 126.212, 'text': "And, uh, this was the hypothesis, right? It's a linear function of the features x, um, including this feature x0, which is always set to 1.", 'start': 116.606, 'duration': 9.606}, {'end': 129.633, 'text': 'And, uh, j was the cost function you would minimize.', 'start': 126.212, 'duration': 3.421}], 'summary': 'Probabilistic interpretation of linear and logistic regression with notation and dimensions explained.', 'duration': 66.399, 'max_score': 63.234, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ63234.jpg'}, {'end': 309.053, 'src': 'embed', 'start': 280.211, 'weight': 2, 'content': [{'end': 285.215, 'text': "Um, what I'd like to do today is- so- so you hear about feature selection later this quarter.", 'start': 280.211, 'duration': 5.004}, {'end': 292.66, 'text': "What I want to share with you today is a different way of addressing this out- this problem of what if the data isn't just fit well by a straight line.", 'start': 285.835, 'duration': 6.825}, {'end': 298.485, 'text': 'And in particular, I want to share with you an idea called, uh, locally weighted regression or locally weighted linear regression.', 'start': 293.161, 'duration': 5.324}, {'end': 304.269, 'text': 'So let me use a slightly different um example to illustrate this.', 'start': 299.365, 'duration': 4.904}, {'end': 309.053, 'text': 'um, which is, uh, which is that you know if you have a dataset that looks like that.', 'start': 304.269, 'duration': 4.784}], 'summary': 'Introducing locally weighted regression for non-linear data fitting.', 'duration': 28.842, 'max_score': 280.211, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ280211.jpg'}], 'start': 4.594, 'title': 'Supervised learning concepts and algorithms', 'summary': "Covers supervised learning concepts and algorithms, including locally weighted regression, probabilistic interpretation of linear regression, and newton's method for logistic regression. it emphasizes the dependency of ideas and introduces the concept of locally weighted linear regression as an alternative approach to fitting non-linear data.", 'chapters': [{'end': 84.066, 'start': 4.594, 'title': 'Supervised learning concepts and algorithms', 'summary': "Covers the continuation of supervised learning, including locally weighted regression, probabilistic interpretation of linear regression, and newton's method for logistic regression, emphasizing the dependency of ideas and the upcoming problem set.", 'duration': 79.472, 'highlights': ['Locally weighted regression is introduced as a way to modify linear regression for fitting non-linear functions, expanding beyond straight lines.', 'Discussion on the probabilistic interpretation of linear regression and its dependency for logistic regression.', "Introduction of Newton's method for logistic regression, highlighting its significance in the context of the chapter."]}, {'end': 327.722, 'start': 84.066, 'title': 'Linear regression and feature selection', 'summary': 'Discusses linear regression, hypothesis, cost function, feature selection, and introduces the concept of locally weighted linear regression as an alternative approach to fitting non-linear data.', 'duration': 243.656, 'highlights': ['The chapter explains the notation for training examples, where x_i represents n plus 1-dimensional features and y_i is the real number output.', 'It introduces the hypothesis as a linear function of the features x, including the fake feature x0, and the cost function j to minimize for finding the parameters theta for the straight line fit to the data.', 'The discussion delves into the consideration of fitting models to data, including the choice of features and the possibility of using quadratic functions or other non-linear forms for better fitting.', 'The concept of locally weighted linear regression is introduced as an alternative approach to fitting non-linear data, addressing the challenge of finding features that best fit the data when a straight line is not sufficient.']}], 'duration': 323.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ4594.jpg', 'highlights': ["Introduction of Newton's method for logistic regression, highlighting its significance in the context of the chapter.", 'Discussion on the probabilistic interpretation of linear regression and its dependency for logistic regression.', 'Locally weighted regression is introduced as a way to modify linear regression for fitting non-linear functions, expanding beyond straight lines.', 'The concept of locally weighted linear regression is introduced as an alternative approach to fitting non-linear data, addressing the challenge of finding features that best fit the data when a straight line is not sufficient.', 'The chapter explains the notation for training examples, where x_i represents n plus 1-dimensional features and y_i is the real number output.', 'It introduces the hypothesis as a linear function of the features x, including the fake feature x0, and the cost function j to minimize for finding the parameters theta for the straight line fit to the data.', 'The discussion delves into the consideration of fitting models to data, including the choice of features and the possibility of using quadratic functions or other non-linear forms for better fitting.']}, {'end': 991.57, 'segs': [{'end': 477.684, 'src': 'heatmap', 'start': 359.794, 'weight': 0, 'content': [{'end': 367.456, 'text': 'in machine learning we sometimes distinguish between parametric learning algorithms and non-parametric learning algorithms.', 'start': 359.794, 'duration': 7.662}, {'end': 387.483, 'text': "But in a parametric learning algorithm there's a uh, you fit some fixed set of parameters, such as theta i to data and so linear regression,", 'start': 371.078, 'duration': 16.405}, {'end': 393.607, 'text': "as you saw last Wednesday, is a parametric learning algorithm because there's a fixed set of parameters, the theta i.", 'start': 387.483, 'duration': 6.124}, {'end': 395.428, 'text': "so you fit the data and then you're done right.", 'start': 393.607, 'duration': 1.821}, {'end': 408.876, 'text': 'Locally weighted regression will be our first exposure to a non-parametric learning algorithm.', 'start': 397.429, 'duration': 11.447}, {'end': 428.938, 'text': 'Um, and what that means is that the amount of um, data slash parameters, uh, you need to keep, grows, and in this case it grows linearly.', 'start': 409.416, 'duration': 19.522}, {'end': 437.096, 'text': 'with the size of the data, with size of the training set.', 'start': 434.833, 'duration': 2.263}, {'end': 446.629, 'text': 'Okay?. So with a parametric learning algorithm, no matter how big your training uh, your training set is, you fit the parameters theta i,', 'start': 437.737, 'duration': 8.892}, {'end': 451.455, 'text': 'then you could erase the training set from your computer memory and make predictions just using the parameters theta i.', 'start': 446.629, 'duration': 4.826}, {'end': 455.477, 'text': "in a non-parametric learning algorithm, which we'll see in a second.", 'start': 452.276, 'duration': 3.201}, {'end': 463.74, 'text': 'the amount of stuff you need to keep around in computer memory or the need to- on stuff you need to store around grows linearly as a function of the training set size.', 'start': 455.477, 'duration': 8.263}, {'end': 469.601, 'text': 'Uh, and so this type of algorithm is, you know, may-, may-, may not be great if you have a really really massive dataset,', 'start': 464.02, 'duration': 5.581}, {'end': 475.523, 'text': 'because you need to keep all of the data around you know, in computer memory or on disk just to make predictions okay?', 'start': 469.601, 'duration': 5.922}, {'end': 477.684, 'text': "So- but we'll see an example of this.", 'start': 475.623, 'duration': 2.061}], 'summary': 'Parametric learning uses fixed parameters, while non-parametric learning requires data storage that grows linearly with training set size.', 'duration': 77.302, 'max_score': 359.794, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ359794.jpg'}, {'end': 602.883, 'src': 'embed', 'start': 570.012, 'weight': 2, 'content': [{'end': 571.554, 'text': 'For locally weighted regression.', 'start': 570.012, 'duration': 1.542}, {'end': 589.172, 'text': 'um, you do something slightly different, which is if this is the value of x and you want to make a prediction around that value of x,', 'start': 580.905, 'duration': 8.267}, {'end': 596.438, 'text': 'what you do is you look in a lo- local neighborhood at the training examples close to that point x where you want to make a prediction.', 'start': 589.172, 'duration': 7.266}, {'end': 602.883, 'text': "And then, um, I'll describe this informally for now, but we'll- we'll formalize this in math in a second.", 'start': 597.339, 'duration': 5.544}], 'summary': 'Locally weighted regression makes predictions based on training examples close to the point of interest.', 'duration': 32.871, 'max_score': 570.012, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ570012.jpg'}, {'end': 765.103, 'src': 'heatmap', 'start': 715.578, 'weight': 0.879, 'content': [{'end': 719.501, 'text': "Right? Um, I'm gonna add something to this equation a little bit later.", 'start': 715.578, 'duration': 3.923}, {'end': 729.048, 'text': 'But, uh, w i is a weighting function where notice that this- this formula has um defining property right?', 'start': 720.241, 'duration': 8.807}, {'end': 739.556, 'text': 'If x i minus x is small, then the weight will be close to 1.', 'start': 729.248, 'duration': 10.308}, {'end': 749.699, 'text': 'because, uh, if xi and x- so x is the location where you want to make a prediction and xi is the input x for your i-th training example.', 'start': 739.556, 'duration': 10.143}, {'end': 753.5, 'text': 'So wi is a weighting function.', 'start': 750.599, 'duration': 2.901}, {'end': 755.861, 'text': "um, there's a value between 0 and 1.", 'start': 753.5, 'duration': 2.361}, {'end': 765.103, 'text': 'that tells you how much should you pay attention to the values of xi, comma yi when fitting, say this green line or that red line?', 'start': 755.861, 'duration': 9.242}], 'summary': 'Introducing a weighting function w_i with values between 0 and 1 to determine attention on input values for fitting lines.', 'duration': 49.525, 'max_score': 715.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ715578.jpg'}, {'end': 755.861, 'src': 'embed', 'start': 720.241, 'weight': 1, 'content': [{'end': 729.048, 'text': 'But, uh, w i is a weighting function where notice that this- this formula has um defining property right?', 'start': 720.241, 'duration': 8.807}, {'end': 739.556, 'text': 'If x i minus x is small, then the weight will be close to 1.', 'start': 729.248, 'duration': 10.308}, {'end': 749.699, 'text': 'because, uh, if xi and x- so x is the location where you want to make a prediction and xi is the input x for your i-th training example.', 'start': 739.556, 'duration': 10.143}, {'end': 753.5, 'text': 'So wi is a weighting function.', 'start': 750.599, 'duration': 2.901}, {'end': 755.861, 'text': "um, there's a value between 0 and 1.", 'start': 753.5, 'duration': 2.361}], 'summary': 'Weighting function wi assigns value between 0 and 1 based on proximity to x', 'duration': 35.62, 'max_score': 720.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ720241.jpg'}, {'end': 849.526, 'src': 'embed', 'start': 817.73, 'weight': 4, 'content': [{'end': 833.773, 'text': "Okay?. Um, and so if you um look at the cost function, the main modification to the cost function we've made is that we've added this weighting term right?", 'start': 817.73, 'duration': 16.043}, {'end': 837.354, 'text': 'And so what locally weighted regression does is the same.', 'start': 834.613, 'duration': 2.741}, {'end': 849.526, 'text': 'If an example x i is far from where you want to make a prediction, multiply that error term by 0 or by a constant very close to 0.', 'start': 838.174, 'duration': 11.352}], 'summary': 'Modified cost function includes a weighting term for locally weighted regression.', 'duration': 31.796, 'max_score': 817.73, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ817730.jpg'}], 'start': 327.822, 'title': 'Locally weighted regression', 'summary': 'Introduces and explains locally weighted regression as a non-parametric learning algorithm in machine learning, which grows linearly with the size of the training set and uses a weight function to fit a line over points for prediction.', 'chapters': [{'end': 664.584, 'start': 327.822, 'title': 'Locally weighted regression in machine learning', 'summary': 'Introduces the concept of locally weighted regression as a non-parametric learning algorithm in machine learning, which grows linearly with the size of the training set, allowing for prediction without manual feature manipulation.', 'duration': 336.762, 'highlights': ['Locally weighted regression is a non-parametric learning algorithm that grows linearly with the size of the training set, allowing for prediction without manual feature manipulation.', 'Parametric learning algorithms require fitting a fixed set of parameters to the data, while non-parametric learning algorithms grow linearly with the amount of data/parameters needed to be kept, which may not be suitable for massive datasets.', 'Locally weighted regression focuses on a local neighborhood of training examples close to the point of prediction, filling a straight line mainly based on similar x-axis values to make predictions.']}, {'end': 991.57, 'start': 665.438, 'title': 'Locally weighted regression', 'summary': 'Explains locally weighted regression, where the weight function w_i determines the attention given to training examples, with the aim to fit a line over points close to the value of x for prediction.', 'duration': 326.132, 'highlights': ['The weight function w_i determines the attention given to training examples, with a value between 0 and 1, which affects the fitting of the line for making predictions close to the value of x.', 'The modification to the cost function in locally weighted regression involves adding a weighting term to multiply the error term by 0 or 1, depending on the proximity of the example x_i to the prediction point x.', 'The shape of the curve in locally weighted regression resembles a Gaussian bell curve, but it is not related to a Gaussian density function, and the weight function determines the attention given to nearby examples for fitting a straight line.']}], 'duration': 663.748, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ327822.jpg', 'highlights': ['Locally weighted regression is a non-parametric learning algorithm that grows linearly with the size of the training set, allowing for prediction without manual feature manipulation.', 'The weight function w_i determines the attention given to training examples, with a value between 0 and 1, which affects the fitting of the line for making predictions close to the value of x.', 'Locally weighted regression focuses on a local neighborhood of training examples close to the point of prediction, filling a straight line mainly based on similar x-axis values to make predictions.', 'Parametric learning algorithms require fitting a fixed set of parameters to the data, while non-parametric learning algorithms grow linearly with the amount of data/parameters needed to be kept, which may not be suitable for massive datasets.', 'The modification to the cost function in locally weighted regression involves adding a weighting term to multiply the error term by 0 or 1, depending on the proximity of the example x_i to the prediction point x.', 'The shape of the curve in locally weighted regression resembles a Gaussian bell curve, but it is not related to a Gaussian density function, and the weight function determines the attention given to nearby examples for fitting a straight line.']}, {'end': 1283.847, 'segs': [{'end': 1050.678, 'src': 'embed', 'start': 1017.529, 'weight': 0, 'content': [{'end': 1031.766, 'text': "And so um for a Gaussian function like this uh, this- I'm gonna call this the um bandwidth parameter tau right?", 'start': 1017.529, 'duration': 14.237}, {'end': 1036.723, 'text': 'And this is a parameter, or hyperparameter, of the algorithm.', 'start': 1032.299, 'duration': 4.424}, {'end': 1045.051, 'text': 'And uh, depending on the choice of tau um uh, you can choose a fatter or thinner bell-shaped curve,', 'start': 1037.344, 'duration': 7.707}, {'end': 1050.678, 'text': 'which causes you to look in a bigger or a narrower window in order to decide.', 'start': 1045.051, 'duration': 5.627}], 'summary': 'Gaussian function bandwidth parameter tau determines window size.', 'duration': 33.149, 'max_score': 1017.529, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1017529.jpg'}, {'end': 1098.076, 'src': 'embed', 'start': 1073.637, 'weight': 1, 'content': [{'end': 1079.942, 'text': 'Good It turns out that, um, the choice of the bandwidth tau has an effect on, uh, overfitting and underfitting.', 'start': 1073.637, 'duration': 6.305}, {'end': 1081.804, 'text': "If you don't know what those terms mean, don't worry about it.", 'start': 1080.022, 'duration': 1.782}, {'end': 1083.205, 'text': "We'll define them later this quarter.", 'start': 1081.844, 'duration': 1.361}, {'end': 1091.311, 'text': 'But, uh, what you get to do in the problem set is, uh, play with tau yourself and see why.', 'start': 1083.725, 'duration': 7.586}, {'end': 1095.955, 'text': 'um, uh, if tau is too broad, you end up fitting.', 'start': 1091.311, 'duration': 4.644}, {'end': 1098.076, 'text': 'um, you end up over-smoothing the data.', 'start': 1095.955, 'duration': 2.121}], 'summary': 'Choice of bandwidth tau impacts overfitting and underfitting in data fitting.', 'duration': 24.439, 'max_score': 1073.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1073637.jpg'}, {'end': 1187.274, 'src': 'embed', 'start': 1149.895, 'weight': 5, 'content': [{'end': 1157.098, 'text': 'um, locally weighted linear regression is usually not great at extrapolation, but then most- many learning algorithms are not great at extrapolation.', 'start': 1149.895, 'duration': 7.203}, {'end': 1159.799, 'text': 'So all- all the formulas still work, they still implement this.', 'start': 1157.218, 'duration': 2.581}, {'end': 1164.541, 'text': 'But, um, yeah, you can also try- you can also try it in your problem set and see what happens.', 'start': 1160.399, 'duration': 4.142}, {'end': 1165.661, 'text': 'All right, one last question.', 'start': 1164.561, 'duration': 1.1}, {'end': 1173.384, 'text': 'Is it possible to have, like, a variable, how, depending on if some parts of your data have lots of, uh- Oh, yeah.', 'start': 1165.681, 'duration': 7.703}, {'end': 1178.787, 'text': 'Yes Is it possible that the variable tau, uh, uh, yes, it is.', 'start': 1175.444, 'duration': 3.343}, {'end': 1184.071, 'text': 'Uh, and there are quite complicated ways to choose tau based on how many points there are in the local region and so on.', 'start': 1179.147, 'duration': 4.924}, {'end': 1187.274, 'text': "Yes There's a huge literature on different formulas.", 'start': 1184.191, 'duration': 3.083}], 'summary': 'Locally weighted linear regression works for extrapolation, but variable tau selection is complex.', 'duration': 37.379, 'max_score': 1149.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1149895.jpg'}, {'end': 1236.382, 'src': 'embed', 'start': 1206.826, 'weight': 2, 'content': [{'end': 1209.308, 'text': 'So when the number of features is not too big, right?', 'start': 1206.826, 'duration': 2.482}, {'end': 1216.733, 'text': "So when n is quite small, like two or three or something, and we have a lot of data and you don't wanna think about what features to use, right?", 'start': 1209.328, 'duration': 7.405}, {'end': 1218.535, 'text': "So so that's the scenario.", 'start': 1216.873, 'duration': 1.662}, {'end': 1222.377, 'text': "So if, if you actually have a dataset that looks like these I've been drawing, you know,", 'start': 1218.695, 'duration': 3.682}, {'end': 1225.92, 'text': 'locally weighted linear regression is is a is a pretty good algorithm.', 'start': 1222.377, 'duration': 3.543}, {'end': 1229.196, 'text': "Um, all right, one last question then we'll move on.", 'start': 1227.35, 'duration': 1.846}, {'end': 1236.382, 'text': 'Oh, sure.', 'start': 1235.882, 'duration': 0.5}], 'summary': 'Discussion on using locally weighted linear regression for small feature sets and large datasets.', 'duration': 29.556, 'max_score': 1206.826, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1206826.jpg'}], 'start': 992.291, 'title': 'Gaussian density width and locally weighted linear regression', 'summary': 'Discusses the importance of choosing the width of the gaussian density, represented by the bandwidth parameter tau, in determining the neighborhood of points used to fit a local straight line for machine learning. it also explores the impact of bandwidth choice (tau) on overfitting and underfitting in locally weighted linear regression, the challenges of extrapolation, and the applicability of the algorithm to low-dimensional datasets and computational considerations.', 'chapters': [{'end': 1073.277, 'start': 992.291, 'title': 'Choosing gaussian density width', 'summary': 'Discusses the importance of choosing the width of the gaussian density, represented by the bandwidth parameter tau, in determining the neighborhood of points used to fit a local straight line for machine learning.', 'duration': 80.986, 'highlights': ['The bandwidth parameter tau determines the width of the Gaussian function, affecting the size of the neighborhood for fitting the local straight line.', 'The choice of tau results in a fatter or thinner bell-shaped curve, influencing the window size for decision-making based on nearby examples.', 'The parameter tau is a crucial factor in determining the neighborhood size for fitting the straight line in machine learning.']}, {'end': 1283.847, 'start': 1073.637, 'title': 'Understanding locally weighted linear regression', 'summary': 'Discusses the impact of bandwidth choice (tau) on overfitting and underfitting in locally weighted linear regression, the challenges of extrapolation, and the applicability of the algorithm to low-dimensional datasets and computational considerations.', 'duration': 210.21, 'highlights': ['The choice of bandwidth tau affects overfitting and underfitting in locally weighted linear regression, with a too broad tau resulting in over-smoothing and a too thin tau leading to a jagged fit to the data.', 'Locally weighted linear regression is not ideal for extrapolation, especially when inferring values outside the dataset scope, and it may not yield accurate results.', "The algorithm's performance is influenced by the variable tau, which can be chosen based on the number of points in the local region, and there are various versions and formulas for tau selection.", 'Locally weighted linear regression is recommended for relatively low-dimensional datasets with a small number of features and ample data, while computational efficiency is maintained for thousands of training examples but may require more sophisticated algorithms for millions of examples.']}], 'duration': 291.556, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ992291.jpg', 'highlights': ['The bandwidth parameter tau determines the width of the Gaussian function, affecting the size of the neighborhood for fitting the local straight line.', 'The choice of bandwidth tau affects overfitting and underfitting in locally weighted linear regression, with a too broad tau resulting in over-smoothing and a too thin tau leading to a jagged fit to the data.', 'Locally weighted linear regression is recommended for relatively low-dimensional datasets with a small number of features and ample data, while computational efficiency is maintained for thousands of training examples but may require more sophisticated algorithms for millions of examples.', 'The choice of tau results in a fatter or thinner bell-shaped curve, influencing the window size for decision-making based on nearby examples.', 'The parameter tau is a crucial factor in determining the neighborhood size for fitting the straight line in machine learning.', 'Locally weighted linear regression is not ideal for extrapolation, especially when inferring values outside the dataset scope, and it may not yield accurate results.', "The algorithm's performance is influenced by the variable tau, which can be chosen based on the number of points in the local region, and there are various versions and formulas for tau selection."]}, {'end': 1784.404, 'segs': [{'end': 1330.136, 'src': 'embed', 'start': 1304.313, 'weight': 0, 'content': [{'end': 1308.576, 'text': 'Why the squared error? Why not, you know, to the fourth power or absolute value?', 'start': 1304.313, 'duration': 4.263}, {'end': 1315.763, 'text': 'Um, And so, um, what I want to show you today now is the probabilistic interpretation of linear regression.', 'start': 1308.897, 'duration': 6.866}, {'end': 1323.61, 'text': 'And this probabilistic interpretation will put us into good standing as we go on to logistic regression today, uh, and then generalize linear models,', 'start': 1315.903, 'duration': 7.707}, {'end': 1324.351, 'text': 'uh, later this week.', 'start': 1323.61, 'duration': 0.741}, {'end': 1330.136, 'text': "Okay? Maybe I'll keep up the- keep the notation there so you can continue to refer to it.", 'start': 1324.371, 'duration': 5.765}], 'summary': 'Introducing probabilistic interpretation of linear regression for logistic and generalized linear models.', 'duration': 25.823, 'max_score': 1304.313, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1304313.jpg'}, {'end': 1484.329, 'src': 'heatmap', 'start': 1359.536, 'weight': 1, 'content': [{'end': 1373.146, 'text': "let's assume that there's a true price of every house yi, which is X, transpose um Theta i plus epsilon i,", 'start': 1359.536, 'duration': 13.61}, {'end': 1386.617, 'text': 'where epsilon i is an error term that includes um unmodeled effects, you know, and just random noise.', 'start': 1373.146, 'duration': 13.471}, {'end': 1400.189, 'text': "So let's assume that the way you know housing prices truly work is that every house's price is a linear function of the size of the house and number of bedrooms,", 'start': 1391.366, 'duration': 8.823}, {'end': 1408.431, 'text': 'plus an error term that captures unmodeled effects, such as maybe one day that seller is in an unusually good mood or an unusually bad mood,', 'start': 1400.189, 'duration': 8.242}, {'end': 1413.653, 'text': "and so that makes the price go higher or lower, and we just don't model that um as well as random noise, right?", 'start': 1408.431, 'duration': 5.222}, {'end': 1419.576, 'text': "uh, or maybe I may don't want to screw this straight, you know, just to capture that as one of the features,", 'start': 1414.233, 'duration': 5.343}, {'end': 1422.157, 'text': 'but other things that have an impact on housing prices.', 'start': 1419.576, 'duration': 2.581}, {'end': 1436.064, 'text': "Um, and we're going to assume that, uh, epsilon i is distributed, Gaussian with mean 0 and covariance sigma squared.", 'start': 1423.358, 'duration': 12.706}, {'end': 1441.848, 'text': "So I'm going to use this notation to mean uh, so the way you read this notation is epsilon.", 'start': 1436.585, 'duration': 5.263}, {'end': 1443.089, 'text': 'i this twiddle.', 'start': 1441.848, 'duration': 1.241}, {'end': 1448.873, 'text': 'you pronounce as is distributed and then script n per n 0 comma sigma squared.', 'start': 1443.089, 'duration': 5.784}, {'end': 1452.315, 'text': 'This is a normal distribution, also called the Gaussian distribution, same thing.', 'start': 1448.933, 'duration': 3.382}, {'end': 1454.177, 'text': 'Normal distribution and Gaussian distribution mean the same thing.', 'start': 1452.375, 'duration': 1.802}, {'end': 1460.081, 'text': 'The normal distribution, uh, with mean 0 and, um, variance sigma squared.', 'start': 1454.777, 'duration': 5.304}, {'end': 1461.541, 'text': 'Okay, Um,', 'start': 1460.481, 'duration': 1.06}, {'end': 1476.248, 'text': 'and what this means is that the probability density of epsilon i is- this is the Gaussian density when the root 2 pi sigma e to the negative epsilon i squared over 2 sigma squared.', 'start': 1461.541, 'duration': 14.707}, {'end': 1484.329, 'text': 'Okay, Oh, and unlike the bell shape- the bell-shaped curve I used earlier for, uh, locally weighted linear regression this thing does integrate to 1,', 'start': 1476.584, 'duration': 7.745}], 'summary': 'Housing prices are a linear function of house size and bedrooms, plus unmodeled effects and random noise.', 'duration': 40.653, 'max_score': 1359.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1359536.jpg'}, {'end': 1612.392, 'src': 'embed', 'start': 1574.196, 'weight': 3, 'content': [{'end': 1581.003, 'text': "But, uh, this assumption that these epsilon i's are iid sensor, independently and identically distributed, um,", 'start': 1574.196, 'duration': 6.807}, {'end': 1586.889, 'text': 'is one of those assumptions that that you know is probably not absolutely true, but maybe good enough that if you make this assumption,', 'start': 1581.003, 'duration': 5.886}, {'end': 1589.351, 'text': 'you get a pretty good model, okay?', 'start': 1586.889, 'duration': 2.462}, {'end': 1593.155, 'text': "Um, and so let's see.", 'start': 1590.252, 'duration': 2.903}, {'end': 1612.392, 'text': 'Under this set of assumptions, this implies that, the density or the probability of yi given xi and Theta, this is going to be this.', 'start': 1593.756, 'duration': 18.636}], 'summary': "The assumption that epsilon i's are iid sensors may not be absolutely true, but good enough for a pretty good model.", 'duration': 38.196, 'max_score': 1574.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1574196.jpg'}, {'end': 1679.721, 'src': 'heatmap', 'start': 1574.196, 'weight': 2, 'content': [{'end': 1581.003, 'text': "But, uh, this assumption that these epsilon i's are iid sensor, independently and identically distributed, um,", 'start': 1574.196, 'duration': 6.807}, {'end': 1586.889, 'text': 'is one of those assumptions that that you know is probably not absolutely true, but maybe good enough that if you make this assumption,', 'start': 1581.003, 'duration': 5.886}, {'end': 1589.351, 'text': 'you get a pretty good model, okay?', 'start': 1586.889, 'duration': 2.462}, {'end': 1593.155, 'text': "Um, and so let's see.", 'start': 1590.252, 'duration': 2.903}, {'end': 1612.392, 'text': 'Under this set of assumptions, this implies that, the density or the probability of yi given xi and Theta, this is going to be this.', 'start': 1593.756, 'duration': 18.636}, {'end': 1627.245, 'text': "Okay Um, and I'll, I'll take this and write it another way.", 'start': 1612.412, 'duration': 14.833}, {'end': 1648.19, 'text': "In other words, um, given X and Theta, what's the density?", 'start': 1642.306, 'duration': 5.884}, {'end': 1650.832, 'text': "what-? what's the probability of a particular Holtz's price??", 'start': 1648.19, 'duration': 2.642}, {'end': 1662.079, 'text': "Well, it's going to be Gaussian, with mean given by Theta transpose X, I or Theta transpose X and variance um, given by Sigma squared, okay?", 'start': 1651.672, 'duration': 10.407}, {'end': 1671.679, 'text': 'Um, and so, uh, because the way that the price of a house is determined is by taking Theta transpose X was the you know,', 'start': 1663.297, 'duration': 8.382}, {'end': 1677.401, 'text': 'the quote true price of the house and then adding noise or adding error of variance sigma squared to it.', 'start': 1671.679, 'duration': 5.722}, {'end': 1679.721, 'text': 'And so um the.', 'start': 1677.721, 'duration': 2}], 'summary': 'Assumption of iid sensor is not absolutely true, but good enough for a pretty good model under set of assumptions.', 'duration': 28.049, 'max_score': 1574.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1574196.jpg'}, {'end': 1757.71, 'src': 'embed', 'start': 1731.248, 'weight': 4, 'content': [{'end': 1738.595, 'text': 'but if you were to write this notation this way, this would be conditioning on Theta, but Theta is not a random variable.', 'start': 1731.248, 'duration': 7.347}, {'end': 1743.06, 'text': "So you shouldn't condition on Theta, which is why I'm gonna write a semicolon.", 'start': 1738.675, 'duration': 4.385}, {'end': 1749.466, 'text': 'And so the way you read this is the probability of yi given xi and parameterized, excuse me, parameterized by.', 'start': 1743.1, 'duration': 6.366}, {'end': 1757.71, 'text': "Theta is equal to that formula, okay? Um, if- if- if you don't understand this distinction, again, don't worry too much about it.", 'start': 1750.747, 'duration': 6.963}], 'summary': 'Explaining probability notation and conditioning on theta.', 'duration': 26.462, 'max_score': 1731.248, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1731248.jpg'}], 'start': 1285.478, 'title': 'Linear regression and gaussian distribution in housing price modeling', 'summary': 'Discusses the justification for using squared error in linear regression, emphasizing the probabilistic interpretation and assumptions. it also covers the application of gaussian distribution in housing price modeling, focusing on mean, variance, and conditional probability notation.', 'chapters': [{'end': 1419.576, 'start': 1285.478, 'title': 'Justification for using squared error in linear regression', 'summary': 'Presents the probabilistic interpretation of linear regression, justifying the use of squared error and highlighting the assumptions under which least squares using squared error falls out very naturally, particularly in the context of housing price prediction.', 'duration': 134.098, 'highlights': ['The probabilistic interpretation of linear regression is presented, providing a justification for using squared error in linear regression, setting the stage for logistic regression and generalizing linear models later in the week.', "The assumption that every house's price is a linear function of the size of the house and number of bedrooms, plus an error term capturing unmodeled effects and random noise, is highlighted as a key factor in justifying the use of squared error in housing price prediction."]}, {'end': 1784.404, 'start': 1419.576, 'title': 'Gaussian distribution and housing prices', 'summary': 'Explains the gaussian distribution, its application in housing price modeling, and the assumption of independently and identically distributed error terms in the model, with a focus on the mean and variance of the distribution, as well as the notation for conditional probability.', 'duration': 364.828, 'highlights': ['The assumption of independently and identically distributed error terms is made, acknowledging its potential deviation from reality, but justifying its use for modeling purposes.', "Explanation of the Gaussian distribution's application in determining housing prices, with the mean given by Theta transpose X and the variance given by Sigma squared.", 'Clarification of the notation for conditional probability, using the semicolon to indicate parameterization by Theta, and the significance of terminology in statistics.']}], 'duration': 498.926, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1285478.jpg', 'highlights': ['The probabilistic interpretation of linear regression justifies using squared error, setting the stage for logistic regression and generalizing linear models.', "The assumption that every house's price is a linear function of the size of the house and number of bedrooms, plus an error term, justifies the use of squared error in housing price prediction.", "The Gaussian distribution's application in determining housing prices, with the mean given by Theta transpose X and the variance given by Sigma squared, is explained.", 'The assumption of independently and identically distributed error terms is made, justifying its use for modeling purposes.', 'Clarification of the notation for conditional probability, using the semicolon to indicate parameterization by Theta, is provided.']}, {'end': 2178.176, 'segs': [{'end': 1815.784, 'src': 'embed', 'start': 1784.444, 'weight': 5, 'content': [{'end': 1788.768, 'text': "If you get this notation wrong in your homework, don't worry about it, we won't penalize you, but I'll try to be consistent.", 'start': 1784.444, 'duration': 4.324}, {'end': 1793.613, 'text': "Um, but this just means that Theta, in this view, is not a random variable, it's just.", 'start': 1789.669, 'duration': 3.944}, {'end': 1797.476, 'text': 'Theta is a set of parameters that parameterizes this probability distribution.', 'start': 1793.613, 'duration': 3.863}, {'end': 1806.961, 'text': "okay?. Um, And the way to read the second equation is um, when you write these equations, you usually don't write them with parentheses,", 'start': 1797.476, 'duration': 9.485}, {'end': 1815.784, 'text': 'but the way to pause this equation is to say that this thing as a random variable the random variable Y given X and parameterized by Theta,', 'start': 1806.961, 'duration': 8.823}], 'summary': "Theta is not a random variable; it's a set of parameters for probability distribution.", 'duration': 31.34, 'max_score': 1784.444, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1784444.jpg'}, {'end': 1910.955, 'src': 'embed', 'start': 1824.947, 'weight': 3, 'content': [{'end': 1825.267, 'text': 'All right.', 'start': 1824.947, 'duration': 0.32}, {'end': 1828.628, 'text': 'Um, any questions about this?', 'start': 1827.087, 'duration': 1.541}, {'end': 1853.886, 'text': 'So it turns out that if you are willing to make those assumptions, then linear regression, um, falls out, almost naturally,', 'start': 1836.594, 'duration': 17.292}, {'end': 1856.408, 'text': 'of the assumptions we just made.', 'start': 1853.886, 'duration': 2.522}, {'end': 1869.094, 'text': 'And in particular, under the assumptions we just made, um, the, likelihood of the parameters Theta.', 'start': 1857.909, 'duration': 11.185}, {'end': 1886.024, 'text': 'So this is pronounced the likelihood of the parameters Theta, uh, L of Theta, which is defined as the probability of the data right?', 'start': 1869.755, 'duration': 16.269}, {'end': 1895.69, 'text': "So this is probably of all the values of y, of y1 up to ym, given all the x's and given uh, the parameters Theta parameterized by Theta.", 'start': 1886.044, 'duration': 9.646}, {'end': 1910.955, 'text': 'um, This is equal to the product from i, equals 1 through M of P of y i, given x i parameterized by Theta.', 'start': 1895.69, 'duration': 15.265}], 'summary': 'Linear regression stems from assumptions and has a likelihood function.', 'duration': 86.008, 'max_score': 1824.947, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1824947.jpg'}, {'end': 2005.71, 'src': 'heatmap', 'start': 1943.132, 'weight': 0.884, 'content': [{'end': 1963.772, 'text': 'Okay?. Now, um, again one more piece of terminology.', 'start': 1943.132, 'duration': 20.64}, {'end': 1965.753, 'text': 'Uh, you know.', 'start': 1963.792, 'duration': 1.961}, {'end': 1970.894, 'text': "another question about Banal is if you say hey, Andrew, what's the difference between likelihood and probability?", 'start': 1965.753, 'duration': 5.141}, {'end': 1976.256, 'text': 'right?. And so the likelihood of the parameters is exactly the same thing as the probability of the data.', 'start': 1970.894, 'duration': 5.362}, {'end': 1983.08, 'text': 'Uh, but the reason we sometimes talk about likelihood and sometimes talk about probability is, um, we think of likelihood.', 'start': 1976.976, 'duration': 6.104}, {'end': 1989.403, 'text': "So this- this is some function, right? This thing is a function of the data as well as a function of the parameter's data.", 'start': 1983.2, 'duration': 6.203}, {'end': 1996.127, 'text': 'And if you view this number, whatever this number is, if you view this thing as a function of the parameters holding the data fixed,', 'start': 1989.923, 'duration': 6.204}, {'end': 1998.007, 'text': 'then we call that the likelihood.', 'start': 1996.687, 'duration': 1.32}, {'end': 2005.71, 'text': "So if you think of the training set, the data is a fixed thing, and then varying parameters data, then I'm gonna use the term likelihood.", 'start': 1998.147, 'duration': 7.563}], 'summary': 'Likelihood and probability are different terminologies for parameter and data relationship.', 'duration': 62.578, 'max_score': 1943.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1943132.jpg'}, {'end': 2005.71, 'src': 'embed', 'start': 1970.894, 'weight': 1, 'content': [{'end': 1976.256, 'text': 'right?. And so the likelihood of the parameters is exactly the same thing as the probability of the data.', 'start': 1970.894, 'duration': 5.362}, {'end': 1983.08, 'text': 'Uh, but the reason we sometimes talk about likelihood and sometimes talk about probability is, um, we think of likelihood.', 'start': 1976.976, 'duration': 6.104}, {'end': 1989.403, 'text': "So this- this is some function, right? This thing is a function of the data as well as a function of the parameter's data.", 'start': 1983.2, 'duration': 6.203}, {'end': 1996.127, 'text': 'And if you view this number, whatever this number is, if you view this thing as a function of the parameters holding the data fixed,', 'start': 1989.923, 'duration': 6.204}, {'end': 1998.007, 'text': 'then we call that the likelihood.', 'start': 1996.687, 'duration': 1.32}, {'end': 2005.71, 'text': "So if you think of the training set, the data is a fixed thing, and then varying parameters data, then I'm gonna use the term likelihood.", 'start': 1998.147, 'duration': 7.563}], 'summary': 'Likelihood of parameters = probability of data. likelihood is a function of data and parameters.', 'duration': 34.816, 'max_score': 1970.894, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1970894.jpg'}, {'end': 2178.176, 'src': 'embed', 'start': 2149.386, 'weight': 0, 'content': [{'end': 2156.587, 'text': "If something is- if there's an error that's made up of lots of low noise sources which are not too correlated, then by the central limit theorem,", 'start': 2149.386, 'duration': 7.201}, {'end': 2157.528, 'text': 'it will be Gaussian.', 'start': 2156.587, 'duration': 0.941}, {'end': 2162.429, 'text': "So if you think that how the- the noise perturbations are the mood of the cell, what's the school district?", 'start': 2157.568, 'duration': 4.861}, {'end': 2168.19, 'text': "you know what's the weather like, access to transportation, and all of these sources are not too correlated, and you add them up,", 'start': 2162.429, 'duration': 5.761}, {'end': 2169.51, 'text': 'then the distribution will be Gaussian.', 'start': 2168.19, 'duration': 1.32}, {'end': 2172.072, 'text': 'Um, and- and I think, well, yeah.', 'start': 2170.25, 'duration': 1.822}, {'end': 2178.176, 'text': 'So really because of the central limit theorem, I think the Gaussian has become a default noise distribution.', 'start': 2174.473, 'duration': 3.703}], 'summary': 'Central limit theorem indicates gaussian distribution for uncorrelated noise sources.', 'duration': 28.79, 'max_score': 2149.386, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2149386.jpg'}], 'start': 1784.444, 'title': 'Probability and parameterization', 'summary': 'Covers the concept of theta as parameters for probability distribution, likelihood of parameters, implications for linear regression, distinction between likelihood and probability, function of parameters and data, and rationale for choosing gaussian distribution based on the central limit theorem.', 'chapters': [{'end': 1932.091, 'start': 1784.444, 'title': 'Understanding probability distributions and parameterization', 'summary': 'Discusses the concept of theta as a set of parameters for probability distribution, the likelihood of parameters theta, and the implications for linear regression under certain assumptions.', 'duration': 147.647, 'highlights': ['The concept of Theta as a set of parameters for probability distribution is emphasized, ensuring a consistent understanding of its role in the context (e.g., not a random variable).', 'The likelihood of the parameters Theta (L of Theta) is defined as the probability of the data, with a specific emphasis on the product of probabilities due to the independence assumption made for the error terms.', 'The implications for linear regression under the assumptions made are highlighted, indicating its natural emergence from the specified assumptions.']}, {'end': 2178.176, 'start': 1932.972, 'title': 'Understanding likelihood and probability', 'summary': 'Explains the distinction between likelihood and probability, the function of parameters and data, and the rationale for choosing gaussian distribution based on the central limit theorem.', 'duration': 245.204, 'highlights': ['The likelihood of the parameters is exactly the same as the probability of the data, but the distinction lies in viewing the function as a function of the data or as a function of the parameters, holding the other fixed.', 'The likelihood and probability are used based on whether the function is viewed as a function of the parameters (likelihood) or the data (probability).', 'The choice of Gaussian distribution for the error term is based on the central limit theorem, which states that most error distributions are Gaussian when the error is made up of low noise sources that are not too correlated.']}], 'duration': 393.732, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ1784444.jpg', 'highlights': ['The choice of Gaussian distribution for the error term is based on the central limit theorem, which states that most error distributions are Gaussian when the error is made up of low noise sources that are not too correlated.', 'The likelihood and probability are used based on whether the function is viewed as a function of the parameters (likelihood) or the data (probability).', 'The likelihood of the parameters is exactly the same as the probability of the data, but the distinction lies in viewing the function as a function of the data or as a function of the parameters, holding the other fixed.', 'The implications for linear regression under the assumptions made are highlighted, indicating its natural emergence from the specified assumptions.', 'The likelihood of the parameters Theta (L of Theta) is defined as the probability of the data, with a specific emphasis on the product of probabilities due to the independence assumption made for the error terms.', 'The concept of Theta as a set of parameters for probability distribution is emphasized, ensuring a consistent understanding of its role in the context (e.g., not a random variable).']}, {'end': 2557.667, 'segs': [{'end': 2405.975, 'src': 'embed', 'start': 2345.714, 'weight': 0, 'content': [{'end': 2351.558, 'text': 'So the value of Theta that maximizes the log likelihood should be the same as the value of Theta that maximizes the likelihood.', 'start': 2345.714, 'duration': 5.844}, {'end': 2358.964, 'text': "And if you derive the log likelihood, um, we conclude that if you're using maximum likelihood estimation,", 'start': 2352.399, 'duration': 6.565}, {'end': 2362.327, 'text': "what you'd like to do is choose a value of Theta that maximizes this thing.", 'start': 2358.964, 'duration': 3.363}, {'end': 2369.537, 'text': "Right? But, uh, this first term is just a constant, Theta doesn't even appear in this first term.", 'start': 2363.154, 'duration': 6.383}, {'end': 2375.62, 'text': "And so what you'd like to do is choose a value of Theta that maximizes the second term.", 'start': 2370.317, 'duration': 5.303}, {'end': 2378.161, 'text': "Uh, notice there's a minus sign there.", 'start': 2376.4, 'duration': 1.761}, {'end': 2391.927, 'text': "And so what you'd like to do is, um, uh, i.e., you know, choose Theta to minimize this term.", 'start': 2378.881, 'duration': 13.046}, {'end': 2405.975, 'text': 'Right. Oh so, sigma squared is just a constant right?', 'start': 2402.714, 'duration': 3.261}], 'summary': 'Maximize theta for log likelihood in maximum likelihood estimation', 'duration': 60.261, 'max_score': 2345.714, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2345714.jpg'}, {'end': 2458.523, 'src': 'embed', 'start': 2426.321, 'weight': 2, 'content': [{'end': 2438.445, 'text': 'Okay?. So this little proof shows that um, choosing the value of Theta to minimize the least squares errors, like you saw last Wednesday,', 'start': 2426.321, 'duration': 12.124}, {'end': 2448.975, 'text': "that's just finding the maximum likelihood estimate for the parameters Theta under the set of assumptions we made that the error terms are Gaussian and iid.", 'start': 2438.445, 'duration': 10.53}, {'end': 2451.857, 'text': 'Okay? Um, go ahead.', 'start': 2449.956, 'duration': 1.901}, {'end': 2454.3, 'text': 'Oh, thank you.', 'start': 2453.839, 'duration': 0.461}, {'end': 2455.761, 'text': 'Yes, great.', 'start': 2454.88, 'duration': 0.881}, {'end': 2457.983, 'text': 'Yeah Thanks.', 'start': 2456.882, 'duration': 1.101}, {'end': 2458.523, 'text': 'Yeah, go ahead.', 'start': 2458.063, 'duration': 0.46}], 'summary': 'Choosing theta to minimize least squares errors is finding the maximum likelihood estimate for theta under gaussian and iid assumptions.', 'duration': 32.202, 'max_score': 2426.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2426321.jpg'}], 'start': 2178.817, 'title': 'Maximum likelihood estimation and cost function optimization', 'summary': 'Covers the concepts of maximum likelihood estimation (mle) and its application in parameter estimation, emphasizing the use of log likelihood. it also discusses optimizing the cost function in linear regression to maximize performance and the relationship between terms and constants involved.', 'chapters': [{'end': 2362.327, 'start': 2178.817, 'title': 'Maximum likelihood estimation', 'summary': 'Discusses the concept of maximum likelihood estimation (mle) and its application in estimating parameters, emphasizing the use of log likelihood to simplify algebra and derive the value of theta that maximizes the likelihood.', 'duration': 183.51, 'highlights': ['The concept of maximum likelihood estimation (MLE) is explored, emphasizing the selection of Theta to maximize the likelihood of the data.', 'The use of log likelihood, denoted as lowercase l, is highlighted for simplifying algebra and deriving the value of Theta that maximizes the likelihood.', 'The process of maximizing the log likelihood is explained as equivalent to maximizing the likelihood, demonstrating the value of Theta that maximizes the log likelihood.']}, {'end': 2426.301, 'start': 2363.154, 'title': 'Optimizing cost function in linear regression', 'summary': 'Discusses the process of minimizing the cost function j(θ) to maximize the second term and optimize the performance of linear regression models, emphasizing the importance of choosing the right value for θ and understanding the relationship between the terms and constants involved.', 'duration': 63.147, 'highlights': ['Choosing a value of Theta that maximizes the second term is crucial in optimizing the performance of linear regression models.', 'Minimizing the cost function J(Θ) is equivalent to maximizing the second term, emphasizing the importance of selecting the right value for Θ.', 'The first term in the cost function J(Θ) does not contain Theta, highlighting the need to focus on maximizing the second term for optimization.', 'Sigma squared is emphasized as a constant, further underlining the significance of maximizing the second term for optimal results.']}, {'end': 2557.667, 'start': 2426.321, 'title': 'Maximum likelihood estimation in least squares', 'summary': 'Explains how choosing the value of theta to minimize least squares errors is equivalent to finding the maximum likelihood estimate for the parameters theta under the assumptions of gaussian and iid error terms, with minimal consideration for non-iid datasets.', 'duration': 131.346, 'highlights': ['The derivation shows that choosing the value of Theta to minimize least squares errors is equivalent to finding the maximum likelihood estimate for the parameters Theta under the assumptions of Gaussian and iid error terms, promoting the use of least squares in statistics.', 'In cases where the assumptions about the training set being non-iid are known to be very non-IID, more sophisticated models could be built, but it is often not considered due to computational efficiency and practicality.', 'The use of more sophisticated models for non-iid datasets is rare and may only be necessary in special cases with severe violations of assumptions or insufficient data.']}], 'duration': 378.85, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2178817.jpg', 'highlights': ['The process of maximizing the log likelihood is explained as equivalent to maximizing the likelihood, demonstrating the value of Theta that maximizes the log likelihood.', 'Choosing a value of Theta that maximizes the second term is crucial in optimizing the performance of linear regression models.', 'The derivation shows that choosing the value of Theta to minimize least squares errors is equivalent to finding the maximum likelihood estimate for the parameters Theta under the assumptions of Gaussian and iid error terms, promoting the use of least squares in statistics.']}, {'end': 3639.234, 'segs': [{'end': 2799.136, 'src': 'embed', 'start': 2764.554, 'weight': 0, 'content': [{'end': 2769.916, 'text': 'you know for a classification problem that the values are, you know, 0 or 1, right?', 'start': 2764.554, 'duration': 5.362}, {'end': 2776.059, 'text': 'And so to output negative values or values even greater than 1, seems, seems strange.', 'start': 2769.956, 'duration': 6.103}, {'end': 2788.864, 'text': "Um, so what I'd like to share with you now is really probably by far the most commonly used classification algorithm, uh, called logistic regression.", 'start': 2776.619, 'duration': 12.245}, {'end': 2799.136, 'text': 'I wanna say the two learning algorithms I probably use the most often are linear regression and logistic regression.', 'start': 2794.113, 'duration': 5.023}], 'summary': 'Logistic regression is the most commonly used classification algorithm, with linear regression being another frequently used one.', 'duration': 34.582, 'max_score': 2764.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2764554.jpg'}, {'end': 2846.374, 'src': 'embed', 'start': 2819.648, 'weight': 4, 'content': [{'end': 2824.671, 'text': 'And this is mathematical notation for the values for h of x or h prime.', 'start': 2819.648, 'duration': 5.023}, {'end': 2829.535, 'text': 'you know, h subscript Theta of x, uh lies in the set from 0 to 1, right?', 'start': 2824.671, 'duration': 4.864}, {'end': 2833.498, 'text': 'The 0 to 1 square bracket is a set of all real numbers from 0 to 1..', 'start': 2829.575, 'duration': 3.923}, {'end': 2841.65, 'text': 'So this says we want the hypothesis to output values in, you know, between 0 and 1, so in- in the set of all numbers between- from 0 to 1.', 'start': 2833.498, 'duration': 8.152}, {'end': 2846.374, 'text': "Um, and so we're going to choose the following form of the hypothesis.", 'start': 2841.65, 'duration': 4.724}], 'summary': 'Hypothesis output values between 0 and 1.', 'duration': 26.726, 'max_score': 2819.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2819648.jpg'}, {'end': 3083.743, 'src': 'embed', 'start': 3053.191, 'weight': 3, 'content': [{'end': 3058.73, 'text': "So I'm going to assume that the data has the following distribution.", 'start': 3053.191, 'duration': 5.539}, {'end': 3067.096, 'text': 'The probability of y being 1, uh, again, from the breast cancer prediction that we had from uh, the first lecture, right,', 'start': 3059.23, 'duration': 7.866}, {'end': 3074.581, 'text': 'it would be the chance of a tumor being cancerous or being, um, uh, malignant, chance of y being 1, given the size of the tumor.', 'start': 3067.096, 'duration': 7.485}, {'end': 3083.743, 'text': "that's the feature x parameterized by Theta, that this is equal to the output of your hypothesis.", 'start': 3074.581, 'duration': 9.162}], 'summary': 'Assuming data distribution for breast cancer prediction.', 'duration': 30.552, 'max_score': 3053.191, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3053191.jpg'}, {'end': 3180.443, 'src': 'embed', 'start': 3157.225, 'weight': 5, 'content': [{'end': 3166.091, 'text': "Um, and now, bearing in mind that y right by definition, because it's a binary classification problem.", 'start': 3157.225, 'duration': 8.866}, {'end': 3170.154, 'text': 'But bearing in mind that y can only take on two values, 0 or 1, um,', 'start': 3166.711, 'duration': 3.443}, {'end': 3178.561, 'text': "there's a nifty sort of little algebra way to take these two equations and write them in one equation.", 'start': 3170.154, 'duration': 8.407}, {'end': 3180.443, 'text': 'And this will make some of the math a little bit easier.', 'start': 3178.641, 'duration': 1.802}], 'summary': 'In binary classification, y can only take on the values 0 or 1, simplifying the math.', 'duration': 23.218, 'max_score': 3157.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3157225.jpg'}, {'end': 3324.591, 'src': 'embed', 'start': 3297.017, 'weight': 7, 'content': [{'end': 3305.461, 'text': "one of these two terms switches off because it's exponentiated to the power of 0, um, and anything to the power of 0 is just equal to 1, right?", 'start': 3297.017, 'duration': 8.444}, {'end': 3308.742, 'text': 'So one of these terms is just, you know, 1, uh,', 'start': 3305.481, 'duration': 3.261}, {'end': 3314.345, 'text': 'thus leaving the other term and thus selecting the- the appropriate equation depending on whether y is 0 or 1..', 'start': 3308.742, 'duration': 5.603}, {'end': 3324.591, 'text': "Okay? So with that, um, uh, so with this little, uh, I don't know, notational trick, it'll make the later derivations simpler.", 'start': 3314.345, 'duration': 10.246}], 'summary': "Using exponentiation to simplify terms, selecting equations based on y's value", 'duration': 27.574, 'max_score': 3297.017, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3297017.jpg'}, {'end': 3461.883, 'src': 'heatmap', 'start': 3385.307, 'weight': 0.703, 'content': [{'end': 3392.493, 'text': 'i times 1 minus h of x i to the power of 1 minus y i.', 'start': 3385.307, 'duration': 7.186}, {'end': 3397.977, 'text': 'okay?, Where, uh, all I did was take this definition of p of y given x, parameterized by Theta.', 'start': 3392.493, 'duration': 5.484}, {'end': 3404.982, 'text': 'uh, you know from that, after we did that little exponentiation trick and wrote it in here, right?', 'start': 3397.977, 'duration': 7.005}, {'end': 3419.043, 'text': "Um, And then, uh, with maximum likelihood estimation, we'll want to find the value of Theta that maximizes the likelihood.", 'start': 3405.883, 'duration': 13.16}, {'end': 3420.884, 'text': 'maximize the likelihood of the parameters.', 'start': 3419.043, 'duration': 1.841}, {'end': 3430.026, 'text': 'And so um, same as what we did for linear regression to make the algebra you know to- to make the algebra a bit more simple,', 'start': 3421.764, 'duration': 8.262}, {'end': 3433.667, 'text': "we're going to take the log of the likelihood and so compute the log likelihood.", 'start': 3430.026, 'duration': 3.641}, {'end': 3440.508, 'text': "And so that's equal to, um, Let's see.", 'start': 3434.407, 'duration': 6.101}, {'end': 3461.883, 'text': 'Right And so if you take the log of that, uh, you end up with- you end up with that.', 'start': 3440.989, 'duration': 20.894}], 'summary': 'Using maximum likelihood estimation to find theta that maximizes likelihood.', 'duration': 76.576, 'max_score': 3385.307, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3385307.jpg'}, {'end': 3532.711, 'src': 'embed', 'start': 3505.865, 'weight': 1, 'content': [{'end': 3514.647, 'text': 'And then what you need to do is have an algorithm such as gradient descent or gradient descent talk about that in a sec to try to find the value of Theta that maximizes the log likelihood.', 'start': 3505.865, 'duration': 8.782}, {'end': 3521.789, 'text': "And then, having chosen the value of Theta when a new patient walks into the doctor's office, you would, you know,", 'start': 3515.407, 'duration': 6.382}, {'end': 3532.711, 'text': 'take the features of the new tumor and then use H of Theta to estimate the chance of this new tumor and the new patient that walks in tomorrow to estimate the chance that this new thing is.', 'start': 3521.789, 'duration': 10.922}], 'summary': 'Using algorithms like gradient descent to find theta, which estimates the chance of a new tumor in patients.', 'duration': 26.846, 'max_score': 3505.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3505865.jpg'}, {'end': 3631.23, 'src': 'embed', 'start': 3599.425, 'weight': 8, 'content': [{'end': 3603.13, 'text': 'And the second change is, previously you were trying to minimize the squared error.', 'start': 3599.425, 'duration': 3.705}, {'end': 3604.792, 'text': "That's why we had the minus.", 'start': 3603.59, 'duration': 1.202}, {'end': 3610.479, 'text': "And today you're trying to maximize the log likelihood, which is why there's a plus sign.", 'start': 3605.212, 'duration': 5.267}, {'end': 3623.806, 'text': 'Okay?. And so um, so gradient descent, you know, is trying to climb down this hill, whereas gradient ascent has a, um,', 'start': 3610.959, 'duration': 12.847}, {'end': 3631.23, 'text': "uh has a- has a concave function like this and it's trying to right climb up the hill rather than climb down the hill.", 'start': 3623.806, 'duration': 7.424}], 'summary': 'Transition from minimizing squared error to maximizing log likelihood in gradient descent and ascent.', 'duration': 31.805, 'max_score': 3599.425, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3599425.jpg'}], 'start': 2558.428, 'title': 'Logistic regression for classification', 'summary': 'Discusses the limitations of linear regression for classification, introduces logistic regression for generating values between 0 and 1, and explains its application for classification problems. it also covers binary classification assumptions and calculations, as well as maximum likelihood estimation and gradient ascent algorithm.', 'chapters': [{'end': 3010.186, 'start': 2558.428, 'title': 'Logistic regression for classification', 'summary': 'Discusses the limitations of using linear regression for classification, introducing the logistic regression algorithm as a solution to output values between 0 and 1, addressing the need for a suitable hypothesis form and explaining its application for classification problems.', 'duration': 451.758, 'highlights': ['Introduction of logistic regression as the most commonly used classification algorithm and its role in outputting values between 0 and 1.', 'Explanation of the need for the hypothesis to output values between 0 and 1, and the choice of the sigmoid function to force the output within this range.', 'Discussion on the choice of the logistic function and its shape to output values between 0 and 1, compared to other potential functions.']}, {'end': 3314.345, 'start': 3010.206, 'title': 'Binary classification example', 'summary': 'Discusses the assumptions and calculations involved in a binary classification problem, providing insight into the probability distribution and parameterization.', 'duration': 304.139, 'highlights': ['The probability distribution of y being 1, given the size of the tumor, is discussed in the context of breast cancer prediction.', 'The method of compressing the two assumptions about P(y) into one equation is explained to simplify the math involved in the binary classification problem.', 'The nifty algebraic method of compressing the assumptions into one equation is highlighted, demonstrating the switching off of terms based on whether y is 0 or 1.']}, {'end': 3639.234, 'start': 3314.345, 'title': 'Maximum likelihood estimation', 'summary': 'Discusses the concept of maximum likelihood estimation, likelihood of parameters, log likelihood, and the algorithm of gradient ascent for maximizing the log likelihood.', 'duration': 324.889, 'highlights': ['The algorithm for choosing Theta to maximize the log likelihood is gradient ascent or batch gradient ascent, updating the parameters Theta j according to Theta j plus the partial derivative with respect to the log likelihood.', 'The process involves defining the likelihood and log likelihood, and then using an algorithm such as gradient descent to find the value of Theta that maximizes the log likelihood.', 'The difference between the algorithm for linear regression and maximum likelihood estimation lies in optimizing the log likelihood instead of the squared cost function, and trying to maximize the log likelihood instead of minimizing the squared error.']}], 'duration': 1080.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ2558428.jpg', 'highlights': ['Introduction of logistic regression as the most commonly used classification algorithm and its role in outputting values between 0 and 1.', 'The algorithm for choosing Theta to maximize the log likelihood is gradient ascent or batch gradient ascent, updating the parameters Theta j according to Theta j plus the partial derivative with respect to the log likelihood.', 'The process involves defining the likelihood and log likelihood, and then using an algorithm such as gradient descent to find the value of Theta that maximizes the log likelihood.', 'The probability distribution of y being 1, given the size of the tumor, is discussed in the context of breast cancer prediction.', 'Explanation of the need for the hypothesis to output values between 0 and 1, and the choice of the sigmoid function to force the output within this range.', 'The method of compressing the two assumptions about P(y) into one equation is explained to simplify the math involved in the binary classification problem.', 'Discussion on the choice of the logistic function and its shape to output values between 0 and 1, compared to other potential functions.', 'The nifty algebraic method of compressing the assumptions into one equation is highlighted, demonstrating the switching off of terms based on whether y is 0 or 1.', 'The difference between the algorithm for linear regression and maximum likelihood estimation lies in optimizing the log likelihood instead of the squared cost function, and trying to maximize the log likelihood instead of minimizing the squared error.']}, {'end': 4769.457, 'segs': [{'end': 3734.545, 'src': 'embed', 'start': 3691.809, 'weight': 2, 'content': [{'end': 3703.075, 'text': "You update Theta j according to, oh actually, I'm sorry, I forgot the learning rate.", 'start': 3691.809, 'duration': 11.266}, {'end': 3704.876, 'text': "Yeah, it's your learning rate Alpha.", 'start': 3703.395, 'duration': 1.481}, {'end': 3708.577, 'text': 'Okay Learning rate Alpha times this.', 'start': 3705.356, 'duration': 3.221}, {'end': 3722.317, 'text': 'Okay uh, because this term here is the partial derivative with respect to Theta j of the log likelihood.', 'start': 3708.597, 'duration': 13.72}, {'end': 3732.664, 'text': 'Okay? And the full calculus and so on derivation is given in the lecture notes.', 'start': 3727.921, 'duration': 4.743}, {'end': 3734.545, 'text': 'Okay? Um, yeah.', 'start': 3733.264, 'duration': 1.281}], 'summary': 'Updating theta j using learning rate alpha for log likelihood derivative.', 'duration': 42.736, 'max_score': 3691.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3691809.jpg'}, {'end': 3785.646, 'src': 'embed', 'start': 3758.005, 'weight': 0, 'content': [{'end': 3762.147, 'text': 'because if you choose the logistic function rather than some other function 0 to 1,', 'start': 3758.005, 'duration': 4.142}, {'end': 3766.788, 'text': "you're guaranteed that the likelihood function has only one global maximum, uh.", 'start': 3762.147, 'duration': 4.641}, {'end': 3769.65, 'text': "And this- there's actually a big class of algorithms.", 'start': 3767.288, 'duration': 2.362}, {'end': 3777.018, 'text': 'Actually, what you see on Wednesday is this big class of algorithms of which linear regression is one example, logistic regression is another example.', 'start': 3769.67, 'duration': 7.348}, {'end': 3782.503, 'text': 'And for all of the algorithms in this class, there are no local optimal problems when you- when you derive them this way.', 'start': 3777.338, 'duration': 5.165}, {'end': 3785.646, 'text': 'So you see that on Wednesday when we talk about generalizing the models.', 'start': 3782.543, 'duration': 3.103}], 'summary': 'Logistic function ensures one global maximum for likelihood function in a class of algorithms including linear and logistic regression.', 'duration': 27.641, 'max_score': 3758.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3758005.jpg'}, {'end': 3884.928, 'src': 'embed', 'start': 3855.89, 'weight': 8, 'content': [{'end': 3861.114, 'text': 'Actually, this is a general property of a much bigger class of algorithms called generalized linear models.', 'start': 3855.89, 'duration': 5.224}, {'end': 3870.459, 'text': 'Um, Although, yeah, interesting historical divergence, uh, uh, uh, because of the confusion between these two algorithms.', 'start': 3862.114, 'duration': 8.345}, {'end': 3876.043, 'text': 'in the early history of machine learning there was some debate about, you know, between academics saying no, I invented that.', 'start': 3870.459, 'duration': 5.584}, {'end': 3877.003, 'text': 'no, I invented that.', 'start': 3876.043, 'duration': 0.96}, {'end': 3884.928, 'text': "But- and then- and then- and then it goes, no, it's actually different algorithms, right? Um, all right.", 'start': 3877.283, 'duration': 7.645}], 'summary': 'Generalized linear models are part of a larger class of algorithms, sparking historical debate in early machine learning.', 'duration': 29.038, 'max_score': 3855.89, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3855890.jpg'}, {'end': 3936.435, 'src': 'embed', 'start': 3906.533, 'weight': 6, 'content': [{'end': 3911.657, 'text': 'Uh, there is no known way to just have a closed form equation that lets you find the best value of Theta,', 'start': 3906.533, 'duration': 5.124}, {'end': 3919.704, 'text': "which is why you always have to use an algorithm uh iterative optimization algorithm such as gradient ascent or uh and we'll see in a second Newton's method.", 'start': 3911.657, 'duration': 8.047}, {'end': 3936.435, 'text': "Cool So, um, this is a great lead-in to, um, the last topic for today, which is Newton's method.", 'start': 3922.426, 'duration': 14.009}], 'summary': "No closed form equation for finding best theta, need iterative optimization algorithm like gradient ascent or newton's method.", 'duration': 29.902, 'max_score': 3906.533, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3906533.jpg'}, {'end': 4004.157, 'src': 'embed', 'start': 3975.824, 'weight': 3, 'content': [{'end': 3979.947, 'text': 'so, uh, there are problems where you might need you know, say, 100 iterations or 1,', 'start': 3975.824, 'duration': 4.123}, {'end': 3984.989, 'text': "000 iterations of gradient ascent that if you run this algorithm called Newton's method,", 'start': 3979.947, 'duration': 5.042}, {'end': 3989.291, 'text': 'you might need only 10 iterations to get a very good value of Theta.', 'start': 3984.989, 'duration': 4.302}, {'end': 3991.872, 'text': 'Um, but each iteration will be more expensive.', 'start': 3989.311, 'duration': 2.561}, {'end': 3993.172, 'text': "We'll talk about pros and cons in a second.", 'start': 3991.892, 'duration': 1.28}, {'end': 3997.934, 'text': "But um, let's see how- let's- let's describe this algorithm,", 'start': 3993.853, 'duration': 4.081}, {'end': 4004.157, 'text': 'which is sometimes much faster for gradient than gradient ascent for optimizing the value of Theta.', 'start': 3997.934, 'duration': 6.223}], 'summary': "Newton's method may require 10 iterations for optimal theta, faster than gradient ascent.", 'duration': 28.333, 'max_score': 3975.824, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3975824.jpg'}, {'end': 4058.94, 'src': 'heatmap', 'start': 4004.833, 'weight': 0.726, 'content': [{'end': 4010.355, 'text': "Okay So, um, what we'd like to do is, uh, all right.", 'start': 4004.833, 'duration': 5.522}, {'end': 4016.018, 'text': "So let me, let me use a simplified one-dimensional problem to describe Newton's method.", 'start': 4010.756, 'duration': 5.262}, {'end': 4035.363, 'text': "Um, so I'm gonna solve a slightly different problem with Newton's method, which is, say, you have some function, f right and you want to find, uh,", 'start': 4022.561, 'duration': 12.802}, {'end': 4041.946, 'text': 'Theta such that f of Theta is equal to 0, okay?', 'start': 4035.363, 'duration': 6.583}, {'end': 4045.268, 'text': "So this is a problem that Newton's method solves.", 'start': 4043.027, 'duration': 2.241}, {'end': 4058.94, 'text': "And the way we're gonna, uh, use this later is what you really want is to maximize L of Theta Right?", 'start': 4045.808, 'duration': 13.132}], 'summary': "Describing newton's method for solving f(theta) = 0 and maximizing l(theta).", 'duration': 54.107, 'max_score': 4004.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ4004833.jpg'}, {'end': 4503.468, 'src': 'heatmap', 'start': 4436.347, 'weight': 1, 'content': [{'end': 4440.729, 'text': 'because we want to find the place where the first derivative of L is 0,', 'start': 4436.347, 'duration': 4.382}, {'end': 4455.597, 'text': 'then this becomes Theta t plus 1 gets updated as Theta t minus L prime of Theta t over L double prime of Theta t.', 'start': 4440.729, 'duration': 14.868}, {'end': 4459.519, 'text': "So it's really, uh, the first derivative, divided by the second derivative.", 'start': 4455.597, 'duration': 3.922}, {'end': 4484.44, 'text': "okay?. Um, so, Newton's method is a very fast algorithm and uh, it has.", 'start': 4459.519, 'duration': 24.921}, {'end': 4489.764, 'text': "um, Newton's method enjoys a property called quadratic convergence.", 'start': 4484.44, 'duration': 5.324}, {'end': 4495.128, 'text': "Not a great name, don't worry, don't worry too much about what it means.", 'start': 4492.426, 'duration': 2.702}, {'end': 4503.468, 'text': "But- but informally what it means is that, um, if on one iteration, Newton's method is 0.01 error.", 'start': 4495.528, 'duration': 7.94}], 'summary': "Newton's method achieves quadratic convergence with 0.01 error on one iteration.", 'duration': 43.949, 'max_score': 4436.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ4436347.jpg'}, {'end': 4656.799, 'src': 'embed', 'start': 4627.921, 'weight': 7, 'content': [{'end': 4632.504, 'text': 'So it becomes a squared matrix with the dimension equal to the parameter vector Theta.', 'start': 4627.921, 'duration': 4.583}, {'end': 4643.75, 'text': 'And the Hessian matrix is defined as the matrix of partial derivatives, um, Right.', 'start': 4633.784, 'duration': 9.966}, {'end': 4646.472, 'text': 'So, um and so.', 'start': 4643.83, 'duration': 2.642}, {'end': 4653.497, 'text': "the disadvantage of Newton's method is that in high dimensional problems, if Theta is a vector,", 'start': 4646.472, 'duration': 7.025}, {'end': 4656.799, 'text': "that each step of Newton's method is much more expensive.", 'start': 4653.497, 'duration': 3.302}], 'summary': "Hessian matrix is defined by partial derivatives, newton's method is expensive in high dimensional problems", 'duration': 28.878, 'max_score': 4627.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ4627921.jpg'}, {'end': 4747.764, 'src': 'embed', 'start': 4723.327, 'weight': 5, 'content': [{'end': 4729.832, 'text': 'Okay?. But if the number of parameters is not too big so that the computational cost per iteration is manageable,', 'start': 4723.327, 'duration': 6.505}, {'end': 4735.756, 'text': "then Newton's method converges in a very small number of iterations and and can be much faster algorithm than gradient descent.", 'start': 4729.832, 'duration': 5.924}, {'end': 4737.555, 'text': 'All right.', 'start': 4737.295, 'duration': 0.26}, {'end': 4742.68, 'text': "So, um, that's it for, uh, Newton's methods.", 'start': 4738.956, 'duration': 3.724}, {'end': 4745.382, 'text': "Um, on Wednesday- I guess I'm running out of time.", 'start': 4743.02, 'duration': 2.362}, {'end': 4747.764, 'text': 'On Wednesday, you hear about generalized linear models.', 'start': 4745.762, 'duration': 2.002}], 'summary': "Newton's method is faster than gradient descent for small parameter numbers.", 'duration': 24.437, 'max_score': 4723.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ4723327.jpg'}], 'start': 3641.958, 'title': "Logistic regression algorithm and newton's method", 'summary': "Covers logistic regression algorithm, including updating theta, log likelihood function's concave nature, and guarantee of single global maximum, as well as the relationship between h of theta, normal equations, and iterations of newton's method for optimizing theta in a faster manner, and the quadratic convergence of newton's method for increased accuracy.", 'chapters': [{'end': 3840.377, 'start': 3641.958, 'title': 'Logistic regression algorithm', 'summary': 'Explains the logistic regression algorithm, including the process of updating theta, the concave nature of the log likelihood function, and the guarantee of a single global maximum in the likelihood function for logistic regression.', 'duration': 198.419, 'highlights': ['The likelihood function for logistic regression always has a concave shape, ensuring a single global maximum, eliminating local optimal problems.', 'The process of updating Theta in logistic regression involves a learning rate Alpha and the partial derivative with respect to Theta j of the log likelihood.', 'The algorithm for logistic regression is similar to linear regression, but the concave nature and global maximum guarantee make it suitable for classification problems.']}, {'end': 4459.519, 'start': 3840.417, 'title': "Generalized linear models & newton's method", 'summary': "Highlights the relationship between the definition of h of theta, the equivalence of normal equations to logistic regression, and the application and iterations of newton's method, which is useful for optimizing the value of theta in a faster and more efficient manner, ultimately leading to a detailed explanation of the algorithm using a simplified one-dimensional problem.", 'duration': 619.102, 'highlights': ["Newton's method allows for faster convergence than gradient ascent, with potential to require significantly fewer iterations for optimizing the value of Theta, albeit with more expensive individual iterations.", "The normal equations do not provide a closed form solution for finding the best value of Theta in logistic regression, necessitating the use of iterative optimization algorithms such as gradient ascent or Newton's method.", 'The historical divergence and confusion between different algorithms in the early history of machine learning led to debates among academics regarding their inventions, eventually clarifying the distinction between the algorithms.']}, {'end': 4769.457, 'start': 4459.519, 'title': "Newton's method convergence", 'summary': "Discusses newton's method, which exhibits quadratic convergence, resulting in a significant increase in accuracy after each iteration, making it a faster algorithm than gradient descent for problems with a manageable computational cost.", 'duration': 309.938, 'highlights': ["Newton's method exhibits quadratic convergence, meaning that the error decreases by a factor of 100 with each iteration, resulting in rapid convergence near the minimum.", "In high-dimensional problems, Newton's method becomes more expensive due to the computational cost of inverting large matrices, making it less suitable for problems with a large number of parameters.", "For problems with a manageable computational cost per iteration and a small number of parameters, Newton's method converges in a very small number of iterations and can be much faster than gradient descent."]}], 'duration': 1127.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/het9HFqo1TQ/pics/het9HFqo1TQ3641958.jpg', 'highlights': ['The likelihood function for logistic regression always has a concave shape, ensuring a single global maximum, eliminating local optimal problems.', "Newton's method exhibits quadratic convergence, meaning that the error decreases by a factor of 100 with each iteration, resulting in rapid convergence near the minimum.", 'The process of updating Theta in logistic regression involves a learning rate Alpha and the partial derivative with respect to Theta j of the log likelihood.', "Newton's method allows for faster convergence than gradient ascent, with potential to require significantly fewer iterations for optimizing the value of Theta, albeit with more expensive individual iterations.", 'The algorithm for logistic regression is similar to linear regression, but the concave nature and global maximum guarantee make it suitable for classification problems.', "For problems with a manageable computational cost per iteration and a small number of parameters, Newton's method converges in a very small number of iterations and can be much faster than gradient descent.", "The normal equations do not provide a closed form solution for finding the best value of Theta in logistic regression, necessitating the use of iterative optimization algorithms such as gradient ascent or Newton's method.", "In high-dimensional problems, Newton's method becomes more expensive due to the computational cost of inverting large matrices, making it less suitable for problems with a large number of parameters.", 'The historical divergence and confusion between different algorithms in the early history of machine learning led to debates among academics regarding their inventions, eventually clarifying the distinction between the algorithms.']}], 'highlights': ["Newton's method exhibits quadratic convergence, decreasing error by a factor of 100 with each iteration.", 'Introduction of logistic regression as the most commonly used classification algorithm.', 'Locally weighted regression is a non-parametric learning algorithm that grows linearly with the size of the training set.', 'The likelihood function for logistic regression always has a concave shape, ensuring a single global maximum.', 'The process of maximizing the log likelihood is explained as equivalent to maximizing the likelihood.']}