title
Lecture 09 - The Linear Model II
description
The Linear Model II - More about linear models. Logistic regression, maximum likelihood, and gradient descent. Lecture 9 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - http://work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, http://creativecommons.org/licenses/by-nc-nd/3.0/
This lecture was recorded on May 1, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.
detail
{'title': 'Lecture 09 - The Linear Model II', 'heatmap': [{'end': 2201.034, 'start': 2145.782, 'weight': 0.742}, {'end': 2407.777, 'start': 2356.03, 'weight': 0.777}, {'end': 2722.72, 'start': 2615.913, 'weight': 0.709}, {'end': 3561, 'start': 3402.224, 'weight': 0.718}], 'summary': 'The lecture covers bias-variance trade-off, learning curves, linear and nonlinear models, logistic regression, error measures, gradient descent optimization, and neural network challenges, exploring their implications and applications in credit risk assessment and heart attack prediction.', 'chapters': [{'end': 783.126, 'segs': [{'end': 89.472, 'src': 'embed', 'start': 64.035, 'weight': 0, 'content': [{'end': 71.601, 'text': "And if you have a bigger hypothesis set, perhaps big enough to include your target, you don't have very much of a bias, perhaps none at all.", 'start': 64.035, 'duration': 7.566}, {'end': 79.527, 'text': 'But on the other hand, you do have variance depending on which hypothesis you zoom in, based on the data set you have.', 'start': 72.221, 'duration': 7.306}, {'end': 89.472, 'text': 'In the bias-variance decomposition, we basically had two hops from the final hypothesis you produced, based on a particular data,', 'start': 80.986, 'duration': 8.486}], 'summary': 'Bias-variance decomposition shows tradeoff between bias and variance in hypothesis set.', 'duration': 25.437, 'max_score': 64.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs64035.jpg'}, {'end': 178.7, 'src': 'embed', 'start': 151.731, 'weight': 1, 'content': [{'end': 157.034, 'text': 'Not surprisingly, as you increase the number of examples, The out-of-sample error goes down.', 'start': 151.731, 'duration': 5.303}, {'end': 160.955, 'text': 'If you have more examples to learn from, you are likely to perform better out-of-sample.', 'start': 157.134, 'duration': 3.821}, {'end': 168.377, 'text': 'And another interesting observation is that when you have fewer examples, the in-sample error goes down.', 'start': 161.975, 'duration': 6.402}, {'end': 173.378, 'text': 'And that is because you are fitting fewer examples, and you have the same resources to fit.', 'start': 168.737, 'duration': 4.641}, {'end': 174.779, 'text': 'So you tend to fit them better.', 'start': 173.698, 'duration': 1.081}, {'end': 178.7, 'text': 'And the discrepancy between them describes the generalization error.', 'start': 175.499, 'duration': 3.201}], 'summary': 'Increasing examples reduces out-of-sample error; fewer examples decrease in-sample error, affecting generalization error.', 'duration': 26.969, 'max_score': 151.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs151731.jpg'}, {'end': 283.139, 'src': 'embed', 'start': 256.795, 'weight': 2, 'content': [{'end': 267.079, 'text': 'And therefore we basically have a certain rule that you need examples in proportion to the VC dimension or to the effective degrees of freedom in order to do generalization.', 'start': 256.795, 'duration': 10.284}, {'end': 269.879, 'text': 'And the more you have, the better performance you get.', 'start': 267.579, 'duration': 2.3}, {'end': 271.72, 'text': 'This is the key observation.', 'start': 270.079, 'duration': 1.641}, {'end': 276.135, 'text': "So today, I'm going to start a series of techniques.", 'start': 273.314, 'duration': 2.821}, {'end': 283.139, 'text': 'And today is special, because the techniques of the linear models have already been covered in part.', 'start': 276.996, 'duration': 6.143}], 'summary': 'Examples in proportion to vc dimension for better generalization. linear model techniques already covered.', 'duration': 26.344, 'max_score': 256.795, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs256795.jpg'}, {'end': 371.909, 'src': 'embed', 'start': 324.417, 'weight': 3, 'content': [{'end': 325.898, 'text': 'It will be called logistic regression.', 'start': 324.417, 'duration': 1.481}, {'end': 332.722, 'text': 'And then, for all of these linear models, we have this nice trick called nonlinear transforms,', 'start': 327.119, 'duration': 5.603}, {'end': 342.906, 'text': 'that allows us to use the learning algorithms of linear models, which are very simple ones, and apply them to nonlinear transformation.', 'start': 332.722, 'duration': 10.184}, {'end': 349.149, 'text': 'And if you remember, the observation here was that linearity in the parameters was the key issue for deriving the algorithm.', 'start': 343.246, 'duration': 5.903}, {'end': 353.791, 'text': "So let's see what we finished, and what we didn't finish in these topics.", 'start': 350.409, 'duration': 3.382}, {'end': 357.701, 'text': 'Linear classification is pretty much done.', 'start': 355.72, 'duration': 1.981}, {'end': 360.383, 'text': 'We know the algorithm, perceptron or pocket.', 'start': 358.081, 'duration': 2.302}, {'end': 362.144, 'text': 'There are obviously more sophisticated algorithms.', 'start': 360.403, 'duration': 1.741}, {'end': 365.365, 'text': 'And we did the generalization analysis.', 'start': 362.924, 'duration': 2.441}, {'end': 371.909, 'text': 'We got the VC dimension of perceptrons explicitly, and therefore we are able to predict the generalization ability of linear classification.', 'start': 365.406, 'duration': 6.503}], 'summary': 'Logistic regression for linear classification finished, vc dimension obtained for perceptrons.', 'duration': 47.492, 'max_score': 324.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs324417.jpg'}, {'end': 487.596, 'src': 'embed', 'start': 459.358, 'weight': 5, 'content': [{'end': 464.84, 'text': 'And now we are going to transform this into another space using a transformation we called phi.', 'start': 459.358, 'duration': 5.482}, {'end': 468.662, 'text': 'This takes us into the Z space, or the feature space.', 'start': 465.501, 'duration': 3.161}, {'end': 473.245, 'text': 'So each of these guys is derived from the row input X.', 'start': 469.223, 'duration': 4.022}, {'end': 479.95, 'text': 'And the transformation we have can be quite general, if you look at it.', 'start': 474.485, 'duration': 5.465}, {'end': 487.596, 'text': 'Any one of these coordinates can be an arbitrary nonlinear transformation of the entire vector x.', 'start': 480.79, 'duration': 6.806}], 'summary': 'Transformation called phi takes input x to z space, allowing arbitrary nonlinear transformations.', 'duration': 28.238, 'max_score': 459.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs459358.jpg'}, {'end': 530.609, 'src': 'embed', 'start': 504.269, 'weight': 6, 'content': [{'end': 509.533, 'text': 'As a matter of fact, when we move to support vector machines, we will be able to go to infinite dimensional feature space,', 'start': 504.269, 'duration': 5.264}, {'end': 512.554, 'text': 'which is an interesting generalization.', 'start': 509.533, 'duration': 3.021}, {'end': 515.116, 'text': 'So each of them is a general transformation.', 'start': 512.876, 'duration': 2.24}, {'end': 518.34, 'text': 'And therefore the small phi.', 'start': 515.678, 'duration': 2.662}, {'end': 525.385, 'text': 'i is a member of the big transformation capital, Phi, that takes the vector x and produces the vector z working in the z space.', 'start': 518.34, 'duration': 7.045}, {'end': 527.207, 'text': "So that's the transform.", 'start': 526.346, 'duration': 0.861}, {'end': 530.609, 'text': 'An example for that which we used was second order.', 'start': 528.007, 'duration': 2.602}], 'summary': 'Support vector machines can operate in infinite dimensional feature space, providing a powerful generalization.', 'duration': 26.34, 'max_score': 504.269, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs504269.jpg'}, {'end': 685.247, 'src': 'embed', 'start': 658.079, 'weight': 8, 'content': [{'end': 663.04, 'text': 'And we realize that d plus 1 free parameters correspond directly to a VC dimension.', 'start': 658.079, 'duration': 4.961}, {'end': 673.102, 'text': 'In the case of the Z case, the feature space, we have potentially a longer vector, much longer possibly.', 'start': 664.22, 'duration': 8.882}, {'end': 676.603, 'text': 'And the dimensionality here is d tilde.', 'start': 673.822, 'duration': 2.781}, {'end': 678.503, 'text': "That's the notation we give for it.", 'start': 676.903, 'duration': 1.6}, {'end': 681.605, 'text': 'And the vector that will apply here will be w tilde.', 'start': 679.063, 'duration': 2.542}, {'end': 685.247, 'text': 'That will be a much longer vector in general than w.', 'start': 681.625, 'duration': 3.622}], 'summary': 'D+1 free parameters correspond to vc dimension in feature space, possibly much longer.', 'duration': 27.168, 'max_score': 658.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs658079.jpg'}], 'start': 1.678, 'title': 'Bias-variance trade-off and linear models', 'summary': 'Discusses bias-variance decomposition and the trade-off between bias and variance, introduces learning curves, and covers linear models including logistic regression and nonlinear transforms. it emphasizes generalization issues and algorithms for linear classification and regression, along with the impact of nonlinear transforms and support vector machines on generalization performance and vc dimension.', 'chapters': [{'end': 276.135, 'start': 1.678, 'title': 'Bias-variance decomposition and generalization', 'summary': 'Discusses the bias-variance decomposition, illustrating a trade-off between bias and variance, and introduces the concept of learning curves to analyze the in-sample and out-of-sample error as the sample size increases, demonstrating a proportional relationship between the number of examples and generalization performance.', 'duration': 274.457, 'highlights': ['The bias-variance decomposition illustrates a trade-off between bias and variance, where a small hypothesis set leads to significant bias, while a larger hypothesis set reduces bias but introduces variance. The bias-variance decomposition demonstrates the trade-off between bias and variance, where a small hypothesis set results in significant bias, while a larger hypothesis set reduces bias but introduces variance.', 'Learning curves depict the relationship between in-sample and out-of-sample error as the number of examples increases, showing that increasing the sample size decreases out-of-sample error and leads to decreased in-sample error due to better fitting with fewer examples. Learning curves illustrate the relationship between in-sample and out-of-sample error, indicating that increasing the sample size decreases out-of-sample error and decreases in-sample error due to better fitting with fewer examples.', 'The chapter emphasizes that the number of examples required to achieve a certain performance is proportional to the VC dimension or the effective degrees of freedom, leading to better generalization performance with a larger number of examples. The chapter stresses that the number of examples needed for a certain performance is proportional to the VC dimension or the effective degrees of freedom, resulting in better generalization performance with a larger number of examples.']}, {'end': 503.949, 'start': 276.996, 'title': 'Linear models: logistic regression & nonlinear transforms', 'summary': 'Covers the completion of linear models, including logistic regression and nonlinear transforms, with an emphasis on generalization issues and algorithms for linear classification and regression.', 'duration': 226.953, 'highlights': ['The completion of linear models includes logistic regression and nonlinear transforms, addressing generalization issues and algorithms for linear classification and regression.', 'Linear classification and regression algorithms are covered, including perceptron, pocket, and pseudo-inverse, with explicit VC dimension and generalization analysis.', 'An emphasis on generalization issues for nonlinear transforms is highlighted, including the transformation from X space to Z space using a transformation called phi, which allows for arbitrary nonlinear transformations and variable feature lengths.']}, {'end': 783.126, 'start': 504.269, 'title': 'Nonlinear transforms and generalization', 'summary': 'Discusses the use of nonlinear transforms, specifically focusing on support vector machines and their ability to create more sophisticated surfaces in the x space while using linear techniques, along with the price paid for generalization, illustrated through the increase in vc dimension when using a feature space transformation.', 'duration': 278.857, 'highlights': ['Support vector machines enable infinite dimensional feature space, providing a general transformation and the ability to create more sophisticated surfaces in the X space while using linear techniques. Support vector machines allow for an infinite dimensional feature space and the creation of more sophisticated surfaces in the X space using linear techniques.', 'The price paid for generalization in terms of VC dimension significantly increases when using a feature space transformation, potentially limiting the ability to generalize effectively. The increase in VC dimension when using a feature space transformation can limit the ability to generalize effectively.', 'The dimensionality of the weight vector in the feature space can be much longer than in the original space, leading to a substantial increase in VC dimension and a potential hindrance to generalization. The dimensionality of the weight vector in the feature space can be much longer than in the original space, leading to a substantial increase in VC dimension and a potential hindrance to generalization.']}], 'duration': 781.448, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1678.jpg', 'highlights': ['The bias-variance decomposition illustrates a trade-off between bias and variance, where a small hypothesis set leads to significant bias, while a larger hypothesis set reduces bias but introduces variance.', 'Learning curves depict the relationship between in-sample and out-of-sample error as the number of examples increases, showing that increasing the sample size decreases out-of-sample error and leads to decreased in-sample error due to better fitting with fewer examples.', 'The number of examples required to achieve a certain performance is proportional to the VC dimension or the effective degrees of freedom, leading to better generalization performance with a larger number of examples.', 'The completion of linear models includes logistic regression and nonlinear transforms, addressing generalization issues and algorithms for linear classification and regression.', 'Linear classification and regression algorithms are covered, including perceptron, pocket, and pseudo-inverse, with explicit VC dimension and generalization analysis.', 'An emphasis on generalization issues for nonlinear transforms is highlighted, including the transformation from X space to Z space using a transformation called phi, which allows for arbitrary nonlinear transformations and variable feature lengths.', 'Support vector machines enable infinite dimensional feature space, providing a general transformation and the ability to create more sophisticated surfaces in the X space while using linear techniques.', 'The price paid for generalization in terms of VC dimension significantly increases when using a feature space transformation, potentially limiting the ability to generalize effectively.', 'The dimensionality of the weight vector in the feature space can be much longer than in the original space, leading to a substantial increase in VC dimension and a potential hindrance to generalization.']}, {'end': 1440.654, 'segs': [{'end': 828.557, 'src': 'embed', 'start': 783.126, 'weight': 0, 'content': [{'end': 790.793, 'text': "So let's apply this to two cases where we use nonlinear transformations in order to appreciate in practical terms what is the price we pay.", 'start': 783.126, 'duration': 7.667}, {'end': 796.637, 'text': 'The first non-separable case is a pretty easy one.', 'start': 793.895, 'duration': 2.742}, {'end': 798.199, 'text': "It's almost separable.", 'start': 797.178, 'duration': 1.021}, {'end': 803.21, 'text': 'Except for some points that you can consider, maybe outliers.', 'start': 799.409, 'duration': 3.801}, {'end': 806.631, 'text': 'This red point is in the blue region, this blue in the red region.', 'start': 803.35, 'duration': 3.281}, {'end': 809.252, 'text': 'But otherwise, everything can be classified linearly.', 'start': 807.011, 'duration': 2.241}, {'end': 814.093, 'text': 'So one may think of this case, this case is really linearly separable, and we just have a bunch of outliers.', 'start': 809.712, 'duration': 4.381}, {'end': 819.575, 'text': "Maybe we shouldn't use nonlinear transforms, just settle for the linear transforms.", 'start': 814.373, 'duration': 5.202}, {'end': 820.615, 'text': 'We will talk about that.', 'start': 819.775, 'duration': 0.84}, {'end': 825.356, 'text': 'So this is one class of things that we go when we look at nonlinear transforms.', 'start': 820.955, 'duration': 4.401}, {'end': 828.557, 'text': 'The other one is genuinely nonlinear.', 'start': 826.376, 'duration': 2.181}], 'summary': 'Nonlinear transformations have different practical implications, including dealing with outliers and genuinely nonlinear cases.', 'duration': 45.431, 'max_score': 783.126, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs783126.jpg'}, {'end': 875.512, 'src': 'embed', 'start': 852.585, 'weight': 4, 'content': [{'end': 862.05, 'text': 'You can use a linear model in the X space, in the input space that you have, and then accept that the in-sample error will be positive.', 'start': 852.585, 'duration': 9.465}, {'end': 864.192, 'text': "It's not going to be 0.", 'start': 862.23, 'duration': 1.962}, {'end': 865.492, 'text': "So in this case, here's the picture.", 'start': 864.192, 'duration': 1.3}, {'end': 871.876, 'text': 'There is an in-sample error, because this guy is erroneously classified, and this guy is erroneously classified by your hypothesis.', 'start': 866.012, 'duration': 5.864}, {'end': 875.512, 'text': 'So this is option number 1.', 'start': 873.256, 'duration': 2.256}], 'summary': 'Using a linear model in the input space results in non-zero in-sample error.', 'duration': 22.927, 'max_score': 852.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs852585.jpg'}, {'end': 923.842, 'src': 'embed', 'start': 898.544, 'weight': 1, 'content': [{'end': 903.967, 'text': 'And in order to actually be able to classify a human surface, believe it or not, you are not going to be able to do it with a second-order surface.', 'start': 898.544, 'duration': 5.423}, {'end': 906.409, 'text': 'or a third-order surface.', 'start': 905.268, 'duration': 1.141}, {'end': 910.352, 'text': 'You will have to go to a fourth-order surface in order to get it all right.', 'start': 907.37, 'duration': 2.982}, {'end': 913.234, 'text': 'And when you do that, this is what you get.', 'start': 911.112, 'duration': 2.122}, {'end': 923.842, 'text': "Now, you don't need the VC analysis to realize that this is an overkill, and this doesn't have a very good chance of generalizing.", 'start': 916.616, 'duration': 7.226}], 'summary': "To classify a human surface accurately, a fourth-order surface is needed, but it's an overkill and has low generalization chance.", 'duration': 25.298, 'max_score': 898.544, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs898544.jpg'}, {'end': 967.676, 'src': 'embed', 'start': 938.416, 'weight': 5, 'content': [{'end': 943.9, 'text': 'So in this case, that is a straightforward application of the approximation-generalization tradeoff.', 'start': 938.416, 'duration': 5.484}, {'end': 945.801, 'text': 'We went to a more complex model.', 'start': 944.36, 'duration': 1.441}, {'end': 951.085, 'text': 'We were able to approximate the data better, but we are generalizing worse.', 'start': 946.522, 'duration': 4.563}, {'end': 953.266, 'text': 'So this has been completely covered already.', 'start': 951.385, 'duration': 1.881}, {'end': 958.41, 'text': 'So there is no surprise in this, other than the fact is to understand that at times,', 'start': 953.466, 'duration': 4.944}, {'end': 967.676, 'text': 'you might as well settle for a small training error in order not to use too high a complexity for the hypothesis set.', 'start': 958.41, 'duration': 9.266}], 'summary': 'Balancing complexity and generalization improves model performance.', 'duration': 29.26, 'max_score': 938.416, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs938416.jpg'}, {'end': 1404.519, 'src': 'embed', 'start': 1377.529, 'weight': 2, 'content': [{'end': 1381.131, 'text': 'And you can immediately make them not valid by doing these things.', 'start': 1377.529, 'duration': 3.602}, {'end': 1386.853, 'text': 'So this is the subject of data snooping.', 'start': 1382.651, 'duration': 4.202}, {'end': 1390.834, 'text': "And I'm not minimizing the idea of choosing a model.", 'start': 1387.433, 'duration': 3.401}, {'end': 1392.375, 'text': 'There will be ways to choose the model.', 'start': 1390.894, 'duration': 1.481}, {'end': 1395.296, 'text': 'When we talk about validation, model selection will be the order of the day.', 'start': 1392.415, 'duration': 2.881}, {'end': 1398.017, 'text': 'But it will be a legitimate means of model selection.', 'start': 1395.576, 'duration': 2.441}, {'end': 1401.218, 'text': "It's a model selection that does not contaminate the data.", 'start': 1398.037, 'duration': 3.181}, {'end': 1403.379, 'text': 'The data here was used to choose the model.', 'start': 1401.678, 'duration': 1.701}, {'end': 1404.519, 'text': "Therefore, it's contaminated.", 'start': 1403.399, 'duration': 1.12}], 'summary': 'Data snooping can contaminate model selection by using data to choose the model.', 'duration': 26.99, 'max_score': 1377.529, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1377529.jpg'}], 'start': 783.126, 'title': 'Nonlinear transformations and model complexity', 'summary': 'Discusses the implications of nonlinear transformations on classification and the trade-offs between linear and nonlinear approaches. it also explores the tradeoff between model complexity and generalization, emphasizing the impact on in-sample errors, overfitting, and model performance validity.', 'chapters': [{'end': 897.584, 'start': 783.126, 'title': 'Nonlinear transformations and classification', 'summary': 'Discusses the implications of nonlinear transformations on classification, highlighting the trade-offs between linear and nonlinear approaches in separable and non-separable cases, and the impact on in-sample errors.', 'duration': 114.458, 'highlights': ['In non-separable cases, using nonlinear transformations may lead to a need for high-dimensional space to minimize in-sample errors, such as insisting on E n being 0. In non-separable cases, using nonlinear transformations may lead to a need for high-dimensional space to minimize in-sample errors, such as insisting on E n being 0.', 'Nonlinear transformations may be unnecessary in cases where data is almost separable, except for outliers, as linear classification may suffice. Nonlinear transformations may be unnecessary in cases where data is almost separable, except for outliers, as linear classification may suffice.', 'Using a linear model in the input space may result in positive in-sample error when dealing with almost linearly separable data, due to misclassification of points. Using a linear model in the input space may result in positive in-sample error when dealing with almost linearly separable data, due to misclassification of points.']}, {'end': 1440.654, 'start': 898.544, 'title': 'Model complexity and generalization tradeoff', 'summary': "Discusses the tradeoff between model complexity and generalization, emphasizing that higher model complexity may lead to overfitting and the pitfalls of data snooping, impacting the validity of the model's performance.", 'duration': 542.11, 'highlights': ['The tradeoff between model complexity and generalization is illustrated through the need for a fourth-order surface to classify a human surface, emphasizing the impact of higher complexity on generalization. The need for a fourth-order surface to classify a human surface demonstrates the tradeoff between model complexity and generalization, highlighting the impact of higher complexity on generalization.', 'The concept of overfitting due to excessively complex models is explained, with the warning that a more complex model may lead to poor generalization, demonstrated through the example of using a nonlinear transformation to simplify the model. The concept of overfitting due to excessively complex models is explained, with the warning that a more complex model may lead to poor generalization, demonstrated through the example of using a nonlinear transformation to simplify the model.', "The discussion of data snooping and its impact on model validity is emphasized, highlighting the dangers of looking at the data before choosing a model and its implications on the model's performance. The discussion of data snooping and its impact on model validity is emphasized, highlighting the dangers of looking at the data before choosing a model and its implications on the model's performance."]}], 'duration': 657.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs783126.jpg', 'highlights': ['In non-separable cases, using nonlinear transformations may lead to a need for high-dimensional space to minimize in-sample errors, such as insisting on E n being 0.', 'The tradeoff between model complexity and generalization is illustrated through the need for a fourth-order surface to classify a human surface, emphasizing the impact of higher complexity on generalization.', "The discussion of data snooping and its impact on model validity is emphasized, highlighting the dangers of looking at the data before choosing a model and its implications on the model's performance.", 'Nonlinear transformations may be unnecessary in cases where data is almost separable, except for outliers, as linear classification may suffice.', 'Using a linear model in the input space may result in positive in-sample error when dealing with almost linearly separable data, due to misclassification of points.', 'The concept of overfitting due to excessively complex models is explained, with the warning that a more complex model may lead to poor generalization, demonstrated through the example of using a nonlinear transformation to simplify the model.']}, {'end': 2305.618, 'segs': [{'end': 1466.349, 'src': 'embed', 'start': 1443.038, 'weight': 2, 'content': [{'end': 1450.041, 'text': 'Now we move into the main topic of the lecture, which is logistic regression, which is a very important linear model.', 'start': 1443.038, 'duration': 7.003}, {'end': 1456.624, 'text': 'And it complements the two models we have seen so far, linear classification, the perceptron, and linear regression.', 'start': 1450.421, 'duration': 6.203}, {'end': 1457.665, 'text': 'And there are three pieces.', 'start': 1456.664, 'duration': 1.001}, {'end': 1459.586, 'text': "First, I'm going to describe the model.", 'start': 1458.245, 'duration': 1.341}, {'end': 1461.627, 'text': "What is the hypothesis set that I'm trying to implement?", 'start': 1459.646, 'duration': 1.981}, {'end': 1466.349, 'text': 'And then we are going to devise an error measure for it, which is a pretty interesting error measure.', 'start': 1462.467, 'duration': 3.882}], 'summary': 'Introduction to logistic regression, a key linear model complementing previous models, with a focus on describing the model and devising an error measure.', 'duration': 23.311, 'max_score': 1443.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1443038.jpg'}, {'end': 1669.833, 'src': 'embed', 'start': 1639.922, 'weight': 0, 'content': [{'end': 1641.863, 'text': 'So it has something of the linear regression.', 'start': 1639.922, 'duration': 1.941}, {'end': 1648.226, 'text': 'And the main utility of logistic regression is that the output is going to be interpreted as a probability.', 'start': 1643.084, 'duration': 5.142}, {'end': 1652.789, 'text': 'And that will cover a lot of problems where we want to estimate the probability of something.', 'start': 1649.067, 'duration': 3.722}, {'end': 1656.364, 'text': "So let's be specific.", 'start': 1654.363, 'duration': 2.001}, {'end': 1659.947, 'text': "Let's look at the logistic function theta, the nonlinearity I talked about.", 'start': 1656.965, 'duration': 2.982}, {'end': 1662.308, 'text': 'It looks like this.', 'start': 1661.568, 'duration': 0.74}, {'end': 1669.833, 'text': 'It can serve as a probability, because it goes here from 0 to 1.', 'start': 1664.41, 'duration': 5.423}], 'summary': 'Logistic regression outputs interpreted as probability, useful for estimating probabilities in problems.', 'duration': 29.911, 'max_score': 1639.922, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1639922.jpg'}, {'end': 1856.179, 'src': 'embed', 'start': 1794.704, 'weight': 1, 'content': [{'end': 1801.087, 'text': 'And then I can say 0.8, 0.7, 0.3, and let the bank decide what to do according to this probability.', 'start': 1794.704, 'duration': 6.383}, {'end': 1803.608, 'text': 'Do we extend credit? How much credit to do? And so on.', 'start': 1801.347, 'duration': 2.261}, {'end': 1805.229, 'text': 'So there is a utility for that.', 'start': 1804.128, 'duration': 1.101}, {'end': 1808.49, 'text': 'And the soft threshold reflects the uncertainty.', 'start': 1806.389, 'duration': 2.101}, {'end': 1815.514, 'text': 'Seldom do we know the binary decision with certainty, and it might be more information to give you the uncertainty as part of the deal.', 'start': 1808.971, 'duration': 6.543}, {'end': 1818.135, 'text': 'And that is reflected in this soft threshold.', 'start': 1815.774, 'duration': 2.361}, {'end': 1825.459, 'text': "It's also called sigmoid for a simple reason, because it looks like a flattened-out S.", 'start': 1819.136, 'duration': 6.323}, {'end': 1827.399, 'text': 'So this is an S.', 'start': 1826.079, 'duration': 1.32}, {'end': 1830.3, 'text': "So you'll hear sigmoidal function, or soft threshold, and whatnot.", 'start': 1827.399, 'duration': 2.901}, {'end': 1833.761, 'text': 'And there is more than one sigmoidal function or soft threshold.', 'start': 1830.64, 'duration': 3.121}, {'end': 1834.942, 'text': 'So I told you this one formula.', 'start': 1833.781, 'duration': 1.161}, {'end': 1835.762, 'text': 'There are other formulas.', 'start': 1834.962, 'duration': 0.8}, {'end': 1839.903, 'text': 'In fact, when we go to neural networks, there will be another formula that is very closely related.', 'start': 1836.442, 'duration': 3.461}, {'end': 1841.924, 'text': 'And we can invent other formulas as well.', 'start': 1840.223, 'duration': 1.701}, {'end': 1843.984, 'text': 'So this is the logistic function.', 'start': 1842.864, 'duration': 1.12}, {'end': 1844.585, 'text': 'This is the model.', 'start': 1844.004, 'duration': 0.581}, {'end': 1845.565, 'text': 'So we know what the model does.', 'start': 1844.605, 'duration': 0.96}, {'end': 1848.586, 'text': 'The main idea is the probability interpretation.', 'start': 1846.585, 'duration': 2.001}, {'end': 1853.359, 'text': 'So we have the model.', 'start': 1851.198, 'duration': 2.161}, {'end': 1856.179, 'text': 'The model is you take the linear signal,', 'start': 1854.059, 'duration': 2.12}], 'summary': 'The transcript discusses using probabilities and soft thresholds in decision making, including the sigmoidal function, and its relevance to neural networks.', 'duration': 61.475, 'max_score': 1794.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1794704.jpg'}, {'end': 2049.84, 'src': 'embed', 'start': 2022.287, 'weight': 5, 'content': [{'end': 2025.849, 'text': 'You can think of it as a risk score, if you will.', 'start': 2022.287, 'duration': 3.562}, {'end': 2031.532, 'text': 'Remember the credit score? We have the credit score, and then compared it to a threshold to decide, extend credit, not extend credit.', 'start': 2025.909, 'duration': 5.623}, {'end': 2033.113, 'text': 'So this is a risk score.', 'start': 2031.852, 'duration': 1.261}, {'end': 2040.217, 'text': 'Although we translate it to probability to make it meaningful, I can tell you, you add this up, and you are 700, you are in trouble.', 'start': 2033.393, 'duration': 6.824}, {'end': 2042.598, 'text': 'You are minus 200, you are in good shape.', 'start': 2040.957, 'duration': 1.641}, {'end': 2043.838, 'text': 'In general.', 'start': 2043.238, 'duration': 0.6}, {'end': 2046.579, 'text': 'But obviously, in order to interpret them in an operational way,', 'start': 2044.058, 'duration': 2.521}, {'end': 2049.84, 'text': 'you need to put them through the logistic in order to get a probability which can be interpreted.', 'start': 2046.579, 'duration': 3.261}], 'summary': 'A risk score translated to a probability, with 700 indicating trouble and -200 indicating good shape.', 'duration': 27.553, 'max_score': 2022.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2022287.jpg'}, {'end': 2201.034, 'src': 'heatmap', 'start': 2145.782, 'weight': 0.742, 'content': [{'end': 2147.444, 'text': 'That is what noisy targets are.', 'start': 2145.782, 'duration': 1.662}, {'end': 2159.687, 'text': "And they have the form of a certain probability that the person gets a heart attack and a certain probability that they don't get a heart attack given their data.", 'start': 2148.682, 'duration': 11.005}, {'end': 2165.069, 'text': 'And this is generated by the target that I want to learn.', 'start': 2162.148, 'duration': 2.921}, {'end': 2168.35, 'text': "So I'm going to call the probability the target function itself.", 'start': 2165.469, 'duration': 2.881}, {'end': 2174.132, 'text': 'So the probability that someone gets a heart attack is f of x.', 'start': 2169.41, 'duration': 4.722}, {'end': 2177.814, 'text': "And the probability that they don't, it's a binary thing, has to be 1 minus f of x.", 'start': 2174.132, 'duration': 3.682}, {'end': 2181.75, 'text': "And I'm trying to learn f,", 'start': 2179.13, 'duration': 2.62}, {'end': 2191.332, 'text': 'notwithstanding the fact that the examples I am getting are giving me just sample values of y that happen to be generated by f.', 'start': 2181.75, 'duration': 9.582}, {'end': 2197.393, 'text': 'I want to take the examples, and then generate h that approximates the hidden target function.', 'start': 2191.332, 'duration': 6.061}, {'end': 2201.034, 'text': "Understood the game? That's why it's genuine probability.", 'start': 2198.873, 'duration': 2.161}], 'summary': 'Learning to approximate target function using noisy data.', 'duration': 55.252, 'max_score': 2145.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2145782.jpg'}, {'end': 2286.305, 'src': 'embed', 'start': 2262.468, 'weight': 6, 'content': [{'end': 2270.634, 'text': 'So the question now becomes how do I choose the weights such that the logistic regression hypothesis reflects the target function?', 'start': 2262.468, 'duration': 8.166}, {'end': 2274.637, 'text': 'knowing that the target function is the way the examples were generated?', 'start': 2270.634, 'duration': 4.003}, {'end': 2275.517, 'text': "That's the game.", 'start': 2274.917, 'duration': 0.6}, {'end': 2279.76, 'text': "So let's talk about the error measure.", 'start': 2278.479, 'duration': 1.281}, {'end': 2286.305, 'text': 'Now again, remember in error measures, we had the proper way of generating an error measure.', 'start': 2280.741, 'duration': 5.564}], 'summary': 'Choosing weights for logistic regression to reflect the target function and error measure explanation.', 'duration': 23.837, 'max_score': 2262.468, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2262468.jpg'}], 'start': 1443.038, 'title': 'Logistic regression models and applications', 'summary': 'Covers logistic regression models, error measures, learning algorithms, and their utility in interpreting outputs as probabilities. it also discusses soft threshold concepts, probability interpretation in decision-making, and logistic regression applications in credit card risk assessment and predicting heart attack probabilities based on various factors. additionally, it explains the use of logistic regression for risk assessment and the challenge of learning a hypothesis that approximates the hidden target function.', 'chapters': [{'end': 1753.216, 'start': 1443.038, 'title': 'Logistic regression model', 'summary': 'Covers the logistic regression model, including its components: model description, error measure, learning algorithm, and its utility in interpreting outputs as probabilities, providing a representative overview of machine learning.', 'duration': 310.178, 'highlights': ['The logistic regression model is different from linear classification, the perceptron, and linear regression, with a new hypothesis set, error measure, and learning algorithm. It introduces a new model, error measure, and learning algorithm distinct from previous linear models.', "Logistic regression applies a nonlinearity, the logistic function, to the linear signal, resulting in a real-valued output interpreted as a probability, with the nonlinearity serving as a probability function that ranges from 0 to 1. The logistic function transforms the signal into a real-valued output interpreted as a probability, providing a meaningful level of certainty about events based on the signal's value.", "The logistic function's formula involves exponentials and ratios, demonstrating how the function maps large positive signals to probabilities close to 1, large negative signals to probabilities close to 0, and signals of 0 to a probability of 0.5. The formula of the logistic function is explained, showing how it maps different signal values to corresponding probabilities ranging from 0 to 1."]}, {'end': 2022.247, 'start': 1753.216, 'title': 'Logistic regression and soft threshold', 'summary': 'Discusses the concept of soft threshold and logistic regression, emphasizing the probability interpretation in decision-making, with a focus on credit card applications and predicting the probability of heart attacks based on various factors.', 'duration': 269.031, 'highlights': ['Logistic regression and soft threshold are used to provide a probability interpretation in decision-making, allowing for better uncertainty reflection and utility in scenarios such as credit card applications. Probability interpretation, credit card applications', 'The concept of soft threshold, also known as sigmoid function, softens binary decisions, allowing for the prediction of probabilities in scenarios such as assessing the risk of heart attacks based on factors like cholesterol level and age. Risk assessment, probability prediction, factors influencing heart attacks', 'The linear signal in logistic regression is processed using importance weights for factors like age, cholesterol level, and other relevant features to predict the probability of heart attacks within a specified time horizon. Processing linear signal, importance weights, predicting probability of heart attacks']}, {'end': 2305.618, 'start': 2022.287, 'title': 'Understanding logistic regression for risk assessment', 'summary': 'Explains how logistic regression is used to assess risk by translating risk scores into probabilities, and the challenge of learning a hypothesis that approximates the hidden target function, with a focus on interpreting outputs as genuine probabilities.', 'duration': 283.331, 'highlights': ['Logistic regression is used to translate risk scores into probabilities, such that a score of 700 indicates trouble and a score of -200 indicates good shape. The risk score is translated into probabilities, with a score of 700 indicating trouble and a score of -200 indicating good shape.', 'The logistic regression output is treated as a probability, even during learning, due to the nature of the data, which does not directly provide probabilities. The output of logistic regression is treated as a probability during learning, as the given data does not provide direct probabilities.', 'Learning a hypothesis that approximates the hidden target function, f(x), is the main challenge in logistic regression, where sample values of y are generated by f(x). The main challenge in logistic regression is learning a hypothesis that approximates the hidden target function, f(x), using sample values of y generated by f(x).']}], 'duration': 862.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs1443038.jpg', 'highlights': ['Logistic regression applies a nonlinearity, the logistic function, to the linear signal, resulting in a real-valued output interpreted as a probability, with the nonlinearity serving as a probability function that ranges from 0 to 1.', "The logistic function's formula involves exponentials and ratios, demonstrating how the function maps large positive signals to probabilities close to 1, large negative signals to probabilities close to 0, and signals of 0 to a probability of 0.5.", 'The logistic regression model is different from linear classification, the perceptron, and linear regression, with a new hypothesis set, error measure, and learning algorithm. It introduces a new model, error measure, and learning algorithm distinct from previous linear models.', 'Logistic regression and soft threshold are used to provide a probability interpretation in decision-making, allowing for better uncertainty reflection and utility in scenarios such as credit card applications.', 'The concept of soft threshold, also known as sigmoid function, softens binary decisions, allowing for the prediction of probabilities in scenarios such as assessing the risk of heart attacks based on factors like cholesterol level and age.', 'Logistic regression is used to translate risk scores into probabilities, such that a score of 700 indicates trouble and a score of -200 indicates good shape.', 'Learning a hypothesis that approximates the hidden target function, f(x), is the main challenge in logistic regression, where sample values of y are generated by f(x).']}, {'end': 3082.367, 'segs': [{'end': 2349.59, 'src': 'embed', 'start': 2324.064, 'weight': 0, 'content': [{'end': 2328.606, 'text': "Well, it turns out that in this case, the error measure that I'm going to describe has both properties.", 'start': 2324.064, 'duration': 4.542}, {'end': 2330.747, 'text': "It's plausible and friendly.", 'start': 2328.906, 'duration': 1.841}, {'end': 2332.667, 'text': "It's a very popular error measure.", 'start': 2331.087, 'duration': 1.58}, {'end': 2334.208, 'text': "So let's construct it.", 'start': 2333.208, 'duration': 1}, {'end': 2345.148, 'text': 'For each point x and y, and remember that y is binary, plus or minus 1, that is generated by the target function f.', 'start': 2336.004, 'duration': 9.144}, {'end': 2346.188, 'text': 'y is generated by it.', 'start': 2345.148, 'duration': 1.04}, {'end': 2349.59, 'text': 'We have the following plausible error measure.', 'start': 2347.509, 'duration': 2.081}], 'summary': 'The error measure described is both plausible and friendly, and it is popular in binary classification.', 'duration': 25.526, 'max_score': 2324.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2324064.jpg'}, {'end': 2421.282, 'src': 'heatmap', 'start': 2356.03, 'weight': 1, 'content': [{'end': 2360.252, 'text': 'Likelihood is a very established notion in statistics, not without controversy.', 'start': 2356.03, 'duration': 4.222}, {'end': 2363.273, 'text': "But nonetheless, it's very widely applied.", 'start': 2360.292, 'duration': 2.981}, {'end': 2372.277, 'text': 'And the idea of it is that I am going to grade different hypotheses according to the likelihood that they are actually the target that generated the data.', 'start': 2364.034, 'duration': 8.243}, {'end': 2374.598, 'text': "So let's be specific.", 'start': 2373.177, 'duration': 1.421}, {'end': 2382.427, 'text': "We assume that your current hypothesis, let's say that this was actually the target function, just for the moment.", 'start': 2376.124, 'duration': 6.303}, {'end': 2386.689, 'text': 'You have the data, right? The data was generated by the target function.', 'start': 2383.488, 'duration': 3.201}, {'end': 2391.832, 'text': 'So you can ask what is the probability of generating this data if your assumption is true?', 'start': 2387.35, 'duration': 4.482}, {'end': 2396.114, 'text': 'If that probability is very small, then your assumption must be poor.', 'start': 2393.073, 'duration': 3.041}, {'end': 2401.171, 'text': 'And if that probability is high, then your assumption has more plausibility.', 'start': 2397.828, 'duration': 3.343}, {'end': 2407.777, 'text': 'So I can use this to build a comparative way to saying that this is a more plausible hypothesis than another,', 'start': 2401.892, 'duration': 5.885}, {'end': 2414.282, 'text': 'because the data becomes more likely under a scenario of this hypothesis, rather than this hypothesis being the actual target function.', 'start': 2407.777, 'duration': 6.505}, {'end': 2416.498, 'text': 'So this is the idea.', 'start': 2415.697, 'duration': 0.801}, {'end': 2421.282, 'text': 'You ask yourself, how likely? And the difference, I said about controversy.', 'start': 2417.679, 'duration': 3.603}], 'summary': 'Likelihood is used to grade hypotheses based on probability of generating data, determining plausibility.', 'duration': 65.252, 'max_score': 2356.03, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2356030.jpg'}, {'end': 2722.72, 'src': 'heatmap', 'start': 2615.913, 'weight': 0.709, 'content': [{'end': 2620.277, 'text': 'And the case minus 1, I have minus this, which is 1 minus, and that gives me this formula.', 'start': 2615.913, 'duration': 4.364}, {'end': 2622.479, 'text': "So it's summarized by this very simple case.", 'start': 2620.578, 'duration': 1.901}, {'end': 2628.185, 'text': 'So I have one example, x and y, and I want to get the likelihood of this w, given a single example.', 'start': 2622.82, 'duration': 5.365}, {'end': 2630.207, 'text': 'This would be the measure for that likelihood.', 'start': 2628.605, 'duration': 1.602}, {'end': 2632.408, 'text': "That's good.", 'start': 2632.028, 'duration': 0.38}, {'end': 2636.369, 'text': 'Now we have the likelihood of the entire data set.', 'start': 2633.609, 'duration': 2.76}, {'end': 2644.052, 'text': 'So someone gives you a bunch of patients, and whether they had a heart attack within 12 months of the measurements, and quite a number of them.', 'start': 2637.03, 'duration': 7.022}, {'end': 2647.473, 'text': 'And now I would like to say what is the likelihood of this entire data set?', 'start': 2644.412, 'duration': 3.061}, {'end': 2651.074, 'text': 'The assumption, as always, is the independence from one example to another.', 'start': 2648.073, 'duration': 3.001}, {'end': 2657.916, 'text': "And therefore, if I want to get the likelihood of the full data set, I'm going to simply magnify this.", 'start': 2651.614, 'duration': 6.302}, {'end': 2665.832, 'text': "This would be, I'm multiplying the likelihood of individual ones from n equals 1 to N, covering the data set.", 'start': 2660.29, 'duration': 5.542}, {'end': 2671.274, 'text': 'So now I need a formula for that.', 'start': 2669.274, 'duration': 2}, {'end': 2674.096, 'text': "It's ready, because I already have a formula for P of y given x.", 'start': 2671.334, 'duration': 2.762}, {'end': 2675.116, 'text': 'All I need to do is plug it.', 'start': 2674.096, 'duration': 1.02}, {'end': 2677.437, 'text': 'And when I plug it, I end up with this thing.', 'start': 2675.436, 'duration': 2.001}, {'end': 2682.379, 'text': "That's a very nice formula, because now you realize, I have a bunch of examples.", 'start': 2679.076, 'duration': 3.303}, {'end': 2685.781, 'text': "They have different plus or minus 1 that will come in here, different xn's.", 'start': 2682.419, 'duration': 3.362}, {'end': 2690.725, 'text': 'The same w of my hypothesis contributes to all of these terms.', 'start': 2686.322, 'duration': 4.403}, {'end': 2693.047, 'text': 'So now you can find that there will be a compromise.', 'start': 2691.345, 'duration': 1.702}, {'end': 2696.349, 'text': "If I choose w to favor one example, I'm messing up the other.", 'start': 2693.647, 'duration': 2.702}, {'end': 2697.77, 'text': 'So I have to find a compromise.', 'start': 2696.369, 'duration': 1.401}, {'end': 2700.252, 'text': 'And the compromise is likely to reflect that.', 'start': 2697.81, 'duration': 2.442}, {'end': 2706.336, 'text': 'I am catching something for the underlying probability distribution that generated these examples in the first place.', 'start': 2700.252, 'duration': 6.084}, {'end': 2713.136, 'text': "Now let's go for what happens when we maximize this likelihood.", 'start': 2709.654, 'duration': 3.482}, {'end': 2715.457, 'text': "We'll write it down, and then we'll take it.", 'start': 2713.396, 'duration': 2.061}, {'end': 2722.72, 'text': 'And the maximizing of likelihood will translate to the minimizing of an error measure as we know it, or as we have been familiar with.', 'start': 2715.797, 'duration': 6.923}], 'summary': 'The likelihood of the entire data set is obtained by multiplying the likelihood of individual examples, leading to a compromise in choosing the hypothesis.', 'duration': 106.807, 'max_score': 2615.913, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2615913.jpg'}], 'start': 2306.098, 'title': 'Analytic error measure and likelihood-based hypothesis grading', 'summary': 'Discusses the construction of a popular error measure for binary classification tasks and explains the concept of grading hypotheses based on likelihood, deriving a likelihood measure for a hypothesis given a single data point, and the use of likelihood to derive the in-sample error measure for logistic regression.', 'chapters': [{'end': 2349.59, 'start': 2306.098, 'title': 'Analytic error measure for optimization', 'summary': 'Discusses the construction of a popular error measure for binary classification tasks, which is both plausible and friendly to the optimizer.', 'duration': 43.492, 'highlights': ['The error measure is friendly to the optimizer, making it easy to minimize. The error measure is designed to be friendly to the optimizer, ensuring that the process of minimizing it is straightforward.', 'The error measure is both plausible and friendly, making it a popular choice. The error measure is not only plausible but also friendly to the optimizer, contributing to its widespread popularity in binary classification tasks.', 'The error measure is constructed for binary classification tasks. The error measure is specifically tailored for binary classification tasks, where the target function generates binary outputs.']}, {'end': 3082.367, 'start': 2349.81, 'title': 'Likelihood-based hypothesis grading', 'summary': 'Explains the concept of grading hypotheses based on likelihood, deriving a likelihood measure for a hypothesis given a single data point, and the use of likelihood to derive the in-sample error measure for logistic regression.', 'duration': 732.557, 'highlights': ['The chapter explains the concept of grading hypotheses based on likelihood The likelihood of a hypothesis is evaluated based on the probability of generating the data under that assumption, allowing for a comparative way to determine the plausibility of different hypotheses.', 'Deriving a likelihood measure for a hypothesis given a single data point A likelihood measure for a hypothesis given a single data point is computed by assuming the data was generated by the hypothesis, and computing the probability based on the assumption.', 'The use of likelihood to derive the in-sample error measure for logistic regression Likelihood is used to derive the in-sample error measure for logistic regression, with the cross-entropy error measure being introduced as a measure to minimize the error in predictions.']}], 'duration': 776.269, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs2306098.jpg', 'highlights': ['The error measure is specifically tailored for binary classification tasks, where the target function generates binary outputs.', 'The likelihood of a hypothesis is evaluated based on the probability of generating the data under that assumption, allowing for a comparative way to determine the plausibility of different hypotheses.', 'The error measure is not only plausible but also friendly to the optimizer, contributing to its widespread popularity in binary classification tasks.']}, {'end': 3693.856, 'segs': [{'end': 3231.298, 'src': 'embed', 'start': 3199.283, 'weight': 2, 'content': [{'end': 3200.763, 'text': "So let's see what is there.", 'start': 3199.283, 'duration': 1.48}, {'end': 3206.084, 'text': 'First, let me show you what the error measure for logistic regression looks like.', 'start': 3201.323, 'duration': 4.761}, {'end': 3209.445, 'text': 'As you vary the weight, the value of the error differs.', 'start': 3206.224, 'duration': 3.221}, {'end': 3215.527, 'text': 'But it has this great property that it has one minimum, and otherwise it goes like that.', 'start': 3209.885, 'duration': 5.642}, {'end': 3217.548, 'text': 'A function that goes like that.', 'start': 3216.267, 'duration': 1.281}, {'end': 3223.212, 'text': "It's called convex, and it goes with convex optimization, which is very easy.", 'start': 3218.968, 'duration': 4.244}, {'end': 3227.515, 'text': 'Because obviously, wherever you start, you will go to the same valley and whatnot.', 'start': 3223.332, 'duration': 4.183}, {'end': 3231.298, 'text': 'You can imagine a more sophisticated nonlinear surface, where you do this and that.', 'start': 3227.816, 'duration': 3.482}], 'summary': 'Logistic regression error measure has convex optimization, with one minimum and easy convergence.', 'duration': 32.015, 'max_score': 3199.283, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3199283.jpg'}, {'end': 3278.041, 'src': 'embed', 'start': 3237.624, 'weight': 0, 'content': [{'end': 3241.287, 'text': 'And we will only tackle them when we need them, when we talk about the error measure for neural networks.', 'start': 3237.624, 'duration': 3.663}, {'end': 3246.788, 'text': "Right now, we have a very friendly guy, so we're going to describe gradient descent in terms of this friendly guy.", 'start': 3241.647, 'duration': 5.141}, {'end': 3253.43, 'text': "So what do you do with gradient descent? First, we admit that it's a general method for nonlinear optimization.", 'start': 3247.348, 'duration': 6.082}, {'end': 3258.991, 'text': 'And what you do is start at a point, initialization, pretty much like you initialize a perceptron.', 'start': 3254.15, 'duration': 4.841}, {'end': 3263.792, 'text': 'And then you take a step, and you try to make an improvement using that step.', 'start': 3259.451, 'duration': 4.341}, {'end': 3269.533, 'text': 'So the step is to take the step along the steepest slope.', 'start': 3264.372, 'duration': 5.161}, {'end': 3275.54, 'text': 'The steepest slope is not an easy notion to see in two dimensions, because I go right or left.', 'start': 3270.876, 'duration': 4.664}, {'end': 3276.78, 'text': "There aren't too many directions.", 'start': 3275.58, 'duration': 1.2}, {'end': 3278.041, 'text': "So let's do the following.", 'start': 3277.041, 'duration': 1}], 'summary': 'Describing gradient descent as a method for optimization and improvement in neural networks.', 'duration': 40.417, 'max_score': 3237.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3237624.jpg'}, {'end': 3324.933, 'src': 'embed', 'start': 3297.284, 'weight': 5, 'content': [{'end': 3301.705, 'text': "First thing to remember in optimization, you don't get to see the surface.", 'start': 3297.284, 'duration': 4.421}, {'end': 3304.826, 'text': "You don't have a bird's-eye view, and you look at it.", 'start': 3302.726, 'duration': 2.1}, {'end': 3306.727, 'text': 'That region looks good.', 'start': 3305.667, 'duration': 1.06}, {'end': 3307.287, 'text': "Let's go there.", 'start': 3306.747, 'duration': 0.54}, {'end': 3308.368, 'text': "That doesn't happen.", 'start': 3307.707, 'duration': 0.661}, {'end': 3312.449, 'text': 'You only have local information at the point you evaluated.', 'start': 3309.008, 'duration': 3.441}, {'end': 3318.151, 'text': 'So the best thing to imagine is that you are sitting on the surface, and then you close your eyes.', 'start': 3313.309, 'duration': 4.842}, {'end': 3324.933, 'text': 'And all you do is feel around you, and then decide this is a more promising direction than this.', 'start': 3319.271, 'duration': 5.662}], 'summary': 'Optimization involves utilizing local information to navigate the surface for promising directions.', 'duration': 27.649, 'max_score': 3297.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3297284.jpg'}, {'end': 3561, 'src': 'heatmap', 'start': 3402.224, 'weight': 0.718, 'content': [{'end': 3403.485, 'text': 'So this is the amount of moves.', 'start': 3402.224, 'duration': 1.261}, {'end': 3405.927, 'text': 'The only unknown I have is what is v.', 'start': 3403.505, 'duration': 2.422}, {'end': 3409.23, 'text': 'I already decided on the size, but I want to know which direction to go.', 'start': 3405.927, 'duration': 3.303}, {'end': 3417.857, 'text': 'And the formula would be, the next weight, which is w1, will be the current weight, plus the move.', 'start': 3410.531, 'duration': 7.326}, {'end': 3420.319, 'text': 'And I already decided on the move.', 'start': 3418.457, 'duration': 1.862}, {'end': 3424.022, 'text': 'So now, under this condition, you are trying to derive what is v hat.', 'start': 3420.799, 'duration': 3.223}, {'end': 3427.466, 'text': 'That is the direction.', 'start': 3426.866, 'duration': 0.6}, {'end': 3430.848, 'text': 'If you solve for it in one method or another, that gives you gradient descent.', 'start': 3427.666, 'duration': 3.182}, {'end': 3434.669, 'text': 'In another method, it will give you conjugate gradient, which has second-order stuff in it, and so on.', 'start': 3431.208, 'duration': 3.461}, {'end': 3436.25, 'text': 'So that is always the question.', 'start': 3434.909, 'duration': 1.341}, {'end': 3439.611, 'text': "So let's actually try to solve for it.", 'start': 3436.85, 'duration': 2.761}, {'end': 3446.553, 'text': 'We said that we are going to go in the direction of the steepest descent.', 'start': 3442.312, 'duration': 4.241}, {'end': 3450.955, 'text': 'So we are really talking about change in the value of the error.', 'start': 3446.633, 'duration': 4.322}, {'end': 3460.006, 'text': 'The change of the value of the error, if I move from w0 to w1, would be e in at some point, minus e in at another point.', 'start': 3451.755, 'duration': 8.251}, {'end': 3462.488, 'text': "Which two points? It's w0 and w1.", 'start': 3460.127, 'duration': 2.361}, {'end': 3467.171, 'text': 'If I decide to move to this guy, this is the amount here.', 'start': 3464.629, 'duration': 2.542}, {'end': 3474.715, 'text': 'So what I want to do, I want this guy to be negative, as negative as possible, because I want to go down, by the proper choice of w1.', 'start': 3467.611, 'duration': 7.104}, {'end': 3475.716, 'text': 'But w1 is not free.', 'start': 3474.735, 'duration': 0.981}, {'end': 3481.74, 'text': "It's dictated by the method, and it has the very specific form that it is the original guy, plus the move I made.", 'start': 3475.796, 'duration': 5.944}, {'end': 3483.541, 'text': 'So this is what I would like to make.', 'start': 3482.2, 'duration': 1.341}, {'end': 3485.202, 'text': 'I would like to make this as small as possible.', 'start': 3483.641, 'duration': 1.561}, {'end': 3496.892, 'text': 'Now, if I can write this down using the Taylor series expansion with one term, this is E of the original point plus a move, minus the original point.', 'start': 3487.643, 'duration': 9.249}, {'end': 3499.615, 'text': 'That would also be the derivative times the difference.', 'start': 3497.293, 'duration': 2.322}, {'end': 3502.218, 'text': 'So the derivative times the difference here will be the gradient.', 'start': 3500.076, 'duration': 2.142}, {'end': 3505.585, 'text': 'Transpose times the vector times eta.', 'start': 3502.924, 'duration': 2.661}, {'end': 3508.047, 'text': 'And I just took eta outside here to make it clear.', 'start': 3505.685, 'duration': 2.362}, {'end': 3512.029, 'text': 'So this would be the move according to the first-order approximation of the surface.', 'start': 3508.287, 'duration': 3.742}, {'end': 3514.43, 'text': 'If the surface was linear, this would be exact.', 'start': 3512.409, 'duration': 2.021}, {'end': 3516.011, 'text': 'But the surface is not linear.', 'start': 3514.75, 'duration': 1.261}, {'end': 3520.573, 'text': 'And therefore, I have other terms which are of the order eta squared and up.', 'start': 3516.571, 'duration': 4.002}, {'end': 3525.795, 'text': "And the assumption for gradient descent is that I'm going to neglect this fellow, as if it didn't exist.", 'start': 3521.293, 'duration': 4.502}, {'end': 3531.157, 'text': 'When you go to conjugate gradient, you will have the second guy, and you will neglect the third.', 'start': 3526.876, 'duration': 4.281}, {'end': 3534.539, 'text': 'And you can see the idea.', 'start': 3531.838, 'duration': 2.701}, {'end': 3537.32, 'text': "So now all I'm doing.", 'start': 3535.399, 'duration': 1.921}, {'end': 3541.482, 'text': 'how do I choose the direction in order to make this as negative as possible?', 'start': 3537.32, 'duration': 4.162}, {'end': 3553.224, 'text': 'By simple observation, I realize that this quantity for any choice of V hat will be greater than or equal to This fellow.', 'start': 3544.203, 'duration': 9.021}, {'end': 3556.812, 'text': "So this guy is gone, right? So I'm only dealing with this guy.", 'start': 3553.365, 'duration': 3.447}, {'end': 3561, 'text': "So I'm taking the inner product between a vector and a unit vector.", 'start': 3557.092, 'duration': 3.908}], 'summary': 'Deriving the direction for gradient descent and conjugate gradient methods.', 'duration': 158.776, 'max_score': 3402.224, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3402224.jpg'}, {'end': 3460.006, 'src': 'embed', 'start': 3431.208, 'weight': 3, 'content': [{'end': 3434.669, 'text': 'In another method, it will give you conjugate gradient, which has second-order stuff in it, and so on.', 'start': 3431.208, 'duration': 3.461}, {'end': 3436.25, 'text': 'So that is always the question.', 'start': 3434.909, 'duration': 1.341}, {'end': 3439.611, 'text': "So let's actually try to solve for it.", 'start': 3436.85, 'duration': 2.761}, {'end': 3446.553, 'text': 'We said that we are going to go in the direction of the steepest descent.', 'start': 3442.312, 'duration': 4.241}, {'end': 3450.955, 'text': 'So we are really talking about change in the value of the error.', 'start': 3446.633, 'duration': 4.322}, {'end': 3460.006, 'text': 'The change of the value of the error, if I move from w0 to w1, would be e in at some point, minus e in at another point.', 'start': 3451.755, 'duration': 8.251}], 'summary': 'Discussing conjugate gradient method for optimizing error value.', 'duration': 28.798, 'max_score': 3431.208, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3431208.jpg'}, {'end': 3657.576, 'src': 'embed', 'start': 3627.955, 'weight': 4, 'content': [{'end': 3635.401, 'text': "Now, we have said it's a fixed size, and that was a way for us to make sure that the linear approximation holds.", 'start': 3627.955, 'duration': 7.446}, {'end': 3638.023, 'text': 'We are going to modulate eta to be small.', 'start': 3635.561, 'duration': 2.462}, {'end': 3639.404, 'text': 'But you can see that there is a compromise.', 'start': 3638.083, 'duration': 1.321}, {'end': 3649.371, 'text': 'I can get a close-to-perfect approximation for linear by taking the size to be very small, but then it will take me forever to get to the minimum.', 'start': 3639.784, 'duration': 9.587}, {'end': 3649.992, 'text': "I'll be moving.", 'start': 3649.591, 'duration': 0.401}, {'end': 3657.576, 'text': 'Or I could be taking a bigger step, which looks very promising, but then the linear approximation may not apply.', 'start': 3651.973, 'duration': 5.603}], 'summary': 'Modulating eta to be small compromises linear approximation and convergence speed.', 'duration': 29.621, 'max_score': 3627.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3627955.jpg'}], 'start': 3083.187, 'title': 'Learning and iterative optimization', 'summary': 'Explores learning algorithms and gradient descent for minimizing error measures, emphasizing the iterative nature of the process. it also discusses iterative methods like gradient descent and conjugate gradient for optimization, focusing on the use of fixed step size and the compromise involved in modulating step size.', 'chapters': [{'end': 3318.151, 'start': 3083.187, 'title': 'Learning algorithms and gradient descent', 'summary': 'Explores the process of minimizing error measures through learning algorithms, focusing on the concept of gradient descent and its application to nonlinear optimization, particularly in logistic regression, and emphasizes the iterative nature of the process.', 'duration': 234.964, 'highlights': ['The chapter emphasizes the iterative nature of minimizing error measures through learning algorithms, particularly in logistic regression, and introduces the concept of gradient descent as a general method for nonlinear optimization. The chapter discusses the iterative nature of minimizing error measures through learning algorithms, particularly in logistic regression, and introduces the concept of gradient descent as a general method for nonlinear optimization.', 'The process of minimizing error measures through learning algorithms, particularly in logistic regression, is explained, highlighting the application of gradient descent and the concept of convex optimization. The process of minimizing error measures through learning algorithms, particularly in logistic regression, is explained, highlighting the application of gradient descent and the concept of convex optimization.', 'The application of gradient descent as a method for nonlinear optimization is detailed, emphasizing the process of starting at a point, taking a step, and making improvements along the steepest slope, with a focus on local information and the iterative nature of the process. The application of gradient descent as a method for nonlinear optimization is detailed, emphasizing the process of starting at a point, taking a step, and making improvements along the steepest slope, with a focus on local information and the iterative nature of the process.']}, {'end': 3693.856, 'start': 3319.271, 'title': 'Iterative methods in optimization', 'summary': 'Discusses iterative methods like gradient descent and conjugate gradient for optimization, emphasizing the use of fixed step size and the compromise involved in modulating step size.', 'duration': 374.585, 'highlights': ['The chapter discusses iterative methods like gradient descent and conjugate gradient for optimization The transcript introduces iterative methods such as gradient descent and conjugate gradient for optimization.', 'Emphasizes the use of fixed step size and the compromise involved in modulating step size The chapter emphasizes the use of a fixed step size for moving in the w space and highlights the compromise involved in modulating the step size, balancing between linear approximation and convergence time.', 'Optimization evaluates the value and the time taken to arrive at the solution The transcript states that in optimization, the evaluation is based on the value arrived at and the time taken to reach the solution, highlighting the importance of balancing speed and accuracy in iterative methods.']}], 'duration': 610.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3083187.jpg', 'highlights': ['The chapter emphasizes the iterative nature of minimizing error measures through learning algorithms, particularly in logistic regression, and introduces the concept of gradient descent as a general method for nonlinear optimization.', 'The application of gradient descent as a method for nonlinear optimization is detailed, emphasizing the process of starting at a point, taking a step, and making improvements along the steepest slope, with a focus on local information and the iterative nature of the process.', 'The process of minimizing error measures through learning algorithms, particularly in logistic regression, is explained, highlighting the application of gradient descent and the concept of convex optimization.', 'The chapter discusses iterative methods like gradient descent and conjugate gradient for optimization.', 'Emphasizes the use of fixed step size and the compromise involved in modulating step size.', 'Optimization evaluates the value and the time taken to arrive at the solution, highlighting the importance of balancing speed and accuracy in iterative methods.']}, {'end': 4072.08, 'segs': [{'end': 3787.568, 'src': 'embed', 'start': 3726.942, 'weight': 0, 'content': [{'end': 3732.646, 'text': 'So, if you look at it, you realize that the best compromise is to have initially a large eta,', 'start': 3726.942, 'duration': 5.704}, {'end': 3735.648, 'text': 'because the thing is very steep and I want to take advantage of it.', 'start': 3732.646, 'duration': 3.002}, {'end': 3739.751, 'text': "And just become more careful when I'm closer to the minimum, so that I don't bounce.", 'start': 3736.188, 'duration': 3.563}, {'end': 3744.534, 'text': 'So a rule of thumb, this is not a mathematically proved thing.', 'start': 3741.391, 'duration': 3.143}, {'end': 3745.794, 'text': "It's an observation in surfaces.", 'start': 3744.554, 'duration': 1.24}, {'end': 3751.718, 'text': 'So it looks like a very good idea, instead of having a fixed step, to have eta increase with the slope.', 'start': 3746.455, 'duration': 5.263}, {'end': 3755.281, 'text': "If I'm in a very high slope, I just go a lot, because I'm going down.", 'start': 3752.099, 'duration': 3.182}, {'end': 3759.844, 'text': "And then if I'm now close to the minimum, I'd better be careful in order not to miss the minimum and overshoot.", 'start': 3755.601, 'duration': 4.243}, {'end': 3763.866, 'text': "And because of this, here's an easy implementation of this idea.", 'start': 3760.824, 'duration': 3.042}, {'end': 3770.815, 'text': 'taking the direction, which will not change.', 'start': 3767.813, 'duration': 3.002}, {'end': 3773.217, 'text': "Here's the direction.", 'start': 3772.497, 'duration': 0.72}, {'end': 3774.758, 'text': 'And we are going to eta.', 'start': 3773.858, 'duration': 0.9}, {'end': 3776.139, 'text': 'And this is the formula for it.', 'start': 3774.898, 'duration': 1.241}, {'end': 3777.56, 'text': 'This is what we have for fixed size.', 'start': 3776.179, 'duration': 1.381}, {'end': 3781.463, 'text': "Now I'm going to try to make eta proportional to the size of the gradient.", 'start': 3778.121, 'duration': 3.342}, {'end': 3783.244, 'text': "So it's bigger when the slope is bigger.", 'start': 3781.823, 'duration': 1.421}, {'end': 3787.568, 'text': "That's very convenient, because I have here the size of the gradient sitting there.", 'start': 3784.105, 'duration': 3.463}], 'summary': 'Initial large eta for steep slope, increase with slope to prevent overshooting.', 'duration': 60.626, 'max_score': 3726.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3726942.jpg'}, {'end': 3835.996, 'src': 'embed', 'start': 3808.657, 'weight': 1, 'content': [{'end': 3813.278, 'text': 'You just compute the gradient, and use that learning rate, and that will take care of the previous observation.', 'start': 3808.657, 'duration': 4.621}, {'end': 3816.007, 'text': "So that's what we have.", 'start': 3814.986, 'duration': 1.021}, {'end': 3820.252, 'text': "That's all I'm going to say about gradient descent for this case.", 'start': 3816.348, 'duration': 3.904}, {'end': 3825.178, 'text': 'And then we are going to go to the more complicated issues of it when we talk about neural networks next time.', 'start': 3820.613, 'duration': 4.565}, {'end': 3827.761, 'text': 'So this is how to minimize.', 'start': 3826.34, 'duration': 1.421}, {'end': 3830.585, 'text': 'And now we have the logistic regression algorithm.', 'start': 3828.082, 'duration': 2.503}, {'end': 3835.996, 'text': 'Written in language, you iterate.', 'start': 3832.433, 'duration': 3.563}], 'summary': 'Compute gradient, use learning rate, and iterate to minimize logistic regression algorithm.', 'duration': 27.339, 'max_score': 3808.657, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3808657.jpg'}, {'end': 3901.69, 'src': 'embed', 'start': 3869.633, 'weight': 3, 'content': [{'end': 3877.915, 'text': 'Now let me spend two minutes summarizing all the linear models in one slide, and then we will be done completely with that model.', 'start': 3869.633, 'duration': 8.282}, {'end': 3882.28, 'text': 'We had three models.', 'start': 3880.779, 'duration': 1.501}, {'end': 3886.302, 'text': 'We had the perceptron, linear classification.', 'start': 3882.9, 'duration': 3.402}, {'end': 3889.324, 'text': 'We had linear regression.', 'start': 3888.323, 'duration': 1.001}, {'end': 3892.445, 'text': 'And we today added logistic regression.', 'start': 3890.624, 'duration': 1.821}, {'end': 3899.049, 'text': "Let's take one application domain, which is credit, and see how each of them contributes.", 'start': 3893.726, 'duration': 5.323}, {'end': 3901.69, 'text': 'So we have credit analysis.', 'start': 3900.53, 'duration': 1.16}], 'summary': '3 linear models discussed, including perceptron, linear regression, and logistic regression in 2-minute summary.', 'duration': 32.057, 'max_score': 3869.633, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3869633.jpg'}, {'end': 3979.138, 'src': 'embed', 'start': 3950.43, 'weight': 2, 'content': [{'end': 3954.091, 'text': 'And finally, logistic regression had cross-entropy error.', 'start': 3950.43, 'duration': 3.661}, {'end': 3959.673, 'text': 'Different errors that had different plausibility motivations to them.', 'start': 3955.011, 'duration': 4.662}, {'end': 3961.573, 'text': 'And we tackled all three of them.', 'start': 3960.213, 'duration': 1.36}, {'end': 3967.115, 'text': 'And then there was the learning algorithm that goes with them, that is very dependent on the error measure you choose.', 'start': 3962.753, 'duration': 4.362}, {'end': 3971.296, 'text': 'So for the case of the classification error, it was a combinatorial quantity.', 'start': 3967.815, 'duration': 3.481}, {'end': 3976.537, 'text': 'And we went for something like the perceptron learning algorithm, or the pocket version, if the thing is non-separable.', 'start': 3971.796, 'duration': 4.741}, {'end': 3979.138, 'text': 'And there are other, more sophisticated methods to do that.', 'start': 3976.557, 'duration': 2.581}], 'summary': 'Logistic regression had cross-entropy error, tackled different errors, and used perceptron learning algorithm for non-separable cases.', 'duration': 28.708, 'max_score': 3950.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3950430.jpg'}], 'start': 3695.265, 'title': 'Optimizing gradient descent and linear models', 'summary': 'Discusses the optimization of gradient descent for minimization, emphasizing the importance of adjusting the learning rate (eta) based on the slope. it also covers three linear models: perceptron, linear regression, and logistic regression, and their applications in credit analysis, with logistic regression computing the probability of default. different error measures for each model and their associated learning algorithms are also discussed, highlighting the use of combinatorial quantity, pseudo-inverse, and gradient descent.', 'chapters': [{'end': 3868.532, 'start': 3695.265, 'title': 'Optimizing gradient descent for minimization', 'summary': 'Discusses the optimization of gradient descent for minimization, emphasizing the importance of adjusting the learning rate (eta) based on the slope to prevent overshooting the minimum, ultimately leading to a more efficient approach for minimizing errors in gradient descent.', 'duration': 173.267, 'highlights': ['Adjusting learning rate (eta) based on slope to prevent overshooting the minimum is emphasized, leading to a more efficient approach for minimizing errors in gradient descent.', 'Initial suggestion to have a large eta to take advantage of steepness, but to be more careful near the minimum to avoid bouncing.', 'Proposing an implementation where eta is made proportional to the size of the gradient, leading to a more effective gradient descent approach.', 'Detailed explanation of the logistic regression algorithm, involving iterative computation of gradients and updating weights based on the learning rate.']}, {'end': 4072.08, 'start': 3869.633, 'title': 'Linear models summary', 'summary': 'Discusses three linear models: perceptron, linear regression, and logistic regression, and their applications in credit analysis, with logistic regression computing the probability of default. it also covers different error measures for each model and their associated learning algorithms, highlighting the use of combinatorial quantity, pseudo-inverse, and gradient descent.', 'duration': 202.447, 'highlights': ['The chapter discusses three linear models: perceptron, linear regression, and logistic regression, and their applications in credit analysis, with logistic regression computing the probability of default. The chapter emphasizes the application of three linear models (perceptron, linear regression, logistic regression) in credit analysis, with logistic regression specifically computing the probability of default.', 'Different error measures for each model and their associated learning algorithms are covered, highlighting the use of combinatorial quantity, pseudo-inverse, and gradient descent. The chapter delves into different error measures for each model, such as combinatorial quantity for classification error, pseudo-inverse for squared error, and gradient descent for cross-entropy error.', 'The chapter emphasizes the application of three linear models (perceptron, linear regression, logistic regression) in credit analysis. The chapter highlights the application of three linear models (perceptron, linear regression, logistic regression) in credit analysis, showcasing their distinct contributions to credit assessment.']}], 'duration': 376.815, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs3695265.jpg', 'highlights': ['Adjusting learning rate (eta) based on slope to prevent overshooting the minimum is emphasized, leading to a more efficient approach for minimizing errors in gradient descent.', 'Detailed explanation of the logistic regression algorithm, involving iterative computation of gradients and updating weights based on the learning rate.', 'Different error measures for each model and their associated learning algorithms are covered, highlighting the use of combinatorial quantity, pseudo-inverse, and gradient descent.', 'The chapter emphasizes the application of three linear models (perceptron, linear regression, logistic regression) in credit analysis, showcasing their distinct contributions to credit assessment.', 'Initial suggestion to have a large eta to take advantage of steepness, but to be more careful near the minimum to avoid bouncing.', 'Proposing an implementation where eta is made proportional to the size of the gradient, leading to a more effective gradient descent approach.']}, {'end': 5217.342, 'segs': [{'end': 4112.055, 'src': 'embed', 'start': 4087.396, 'weight': 0, 'content': [{'end': 4093.841, 'text': "And the termination is an issue here, but it's less of an issue here than in other cases, because of reasons that I'm going to explain.", 'start': 4087.396, 'duration': 6.445}, {'end': 4097.724, 'text': 'But in general, the termination is tricky.', 'start': 4094.481, 'duration': 3.243}, {'end': 4100.666, 'text': 'And you have a combination of criteria.', 'start': 4098.404, 'duration': 2.262}, {'end': 4103.908, 'text': 'So one of them is, what do I want? I want to minimize the error.', 'start': 4100.706, 'duration': 3.202}, {'end': 4112.055, 'text': 'So one of them is to say if the thing gets flat and flat and flat to the level where I move from one point to another,', 'start': 4104.509, 'duration': 7.546}], 'summary': 'Termination is a tricky issue, with a focus on minimizing error.', 'duration': 24.659, 'max_score': 4087.396, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4087396.jpg'}, {'end': 4349.547, 'src': 'embed', 'start': 4319.088, 'weight': 2, 'content': [{'end': 4326.274, 'text': 'In many applications, you just apply gradient descent in a very simple way, and you often get very, very good results.', 'start': 4319.088, 'duration': 7.186}, {'end': 4334.839, 'text': 'And the conjugate gradient, which is sort of the king of the derivative-based methods, is a very attractive one.', 'start': 4327.635, 'duration': 7.204}, {'end': 4340.002, 'text': 'And in some optimizations, it completely trumps the alternatives.', 'start': 4334.979, 'duration': 5.023}, {'end': 4349.547, 'text': 'On the other hand, in many ways, the stochastic version of this and the simplicity of it makes it the algorithm of choice in many applications.', 'start': 4340.422, 'duration': 9.125}], 'summary': 'Gradient descent and conjugate gradient are effective in many applications, with stochastic version being algorithm of choice.', 'duration': 30.459, 'max_score': 4319.088, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4319088.jpg'}, {'end': 4414.639, 'src': 'embed', 'start': 4385.405, 'weight': 1, 'content': [{'end': 4391.189, 'text': "Let's say I'm using neural networks, and the local minima are abundant in neural networks.", 'start': 4385.405, 'duration': 5.784}, {'end': 4394.451, 'text': 'And therefore, it looks on face value like a serious problem.', 'start': 4391.229, 'duration': 3.222}, {'end': 4407.433, 'text': 'If all you do is do the learning a number of times, starting from different initial conditions, that is, do a session starting from this point,', 'start': 4396.423, 'duration': 11.01}, {'end': 4409.335, 'text': 'do another session starting from this point, et cetera.', 'start': 4407.433, 'duration': 1.902}, {'end': 4414.639, 'text': 'So each of them will go to its nearest local minimum.', 'start': 4409.955, 'duration': 4.684}], 'summary': 'Neural networks face abundant local minima, but multiple learning sessions can reach nearest minima.', 'duration': 29.234, 'max_score': 4385.405, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4385405.jpg'}, {'end': 4559.928, 'src': 'embed', 'start': 4530.282, 'weight': 3, 'content': [{'end': 4536.945, 'text': "you get a function based on a probability and it's basically the expected value of log 1 over the probability.", 'start': 4530.282, 'duration': 6.663}, {'end': 4539.406, 'text': 'That would be your classical definition of an entropy.', 'start': 4536.985, 'duration': 2.421}, {'end': 4544.868, 'text': 'When you have two different probabilities, you can get a cross-entropy between them.', 'start': 4539.946, 'duration': 4.922}, {'end': 4552.798, 'text': 'by getting the expected value of, and you take a ratio of them one way or the other.', 'start': 4545.348, 'duration': 7.45}, {'end': 4559.928, 'text': 'And there are a number of them in the literature that have different definitions and different scopes.', 'start': 4552.879, 'duration': 7.049}], 'summary': 'Entropy is the expected value of log 1 over probability; cross-entropy involves comparing two probabilities.', 'duration': 29.646, 'max_score': 4530.282, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4530282.jpg'}, {'end': 4682.388, 'src': 'embed', 'start': 4658.897, 'weight': 4, 'content': [{'end': 4665.779, 'text': 'So if I have to evaluate a lot of stuff, I have to show for it that I got a much better value.', 'start': 4658.897, 'duration': 6.882}, {'end': 4670.041, 'text': 'If I evaluate many more values and get that much of a difference,', 'start': 4666.239, 'duration': 3.802}, {'end': 4676.983, 'text': "then I lose in the optimization game because I used CPU cycles and I didn't improve the error as much.", 'start': 4670.041, 'duration': 6.942}, {'end': 4682.388, 'text': "So whenever you are looking at a method, it's a very practical question whether it will work or not.", 'start': 4677.503, 'duration': 4.885}], 'summary': "Evaluating more values should show much better value, but it's practical to consider if it will work.", 'duration': 23.491, 'max_score': 4658.897, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4658897.jpg'}, {'end': 4796.506, 'src': 'embed', 'start': 4756.299, 'weight': 5, 'content': [{'end': 4759.1, 'text': 'But in many cases, the other cases can be derived in terms of those.', 'start': 4756.299, 'duration': 2.801}, {'end': 4762.921, 'text': "Let's look at, for example, the multi-class, the thing being asked about.", 'start': 4759.14, 'duration': 3.781}, {'end': 4765.962, 'text': 'Remember recognizing the digits that we talked about.', 'start': 4763.402, 'duration': 2.56}, {'end': 4770.324, 'text': 'How many digits did we have? We had 10 digits, 0, 1, 2, up to 9 in the zip codes.', 'start': 4766.203, 'duration': 4.121}, {'end': 4773.325, 'text': 'And we wanted to be able to classify them.', 'start': 4770.824, 'duration': 2.501}, {'end': 4775.866, 'text': 'What did we do with that? We used perceptron.', 'start': 4773.805, 'duration': 2.061}, {'end': 4776.686, 'text': 'Wait a minute.', 'start': 4775.986, 'duration': 0.7}, {'end': 4778.047, 'text': 'Perceptron does a binary thing.', 'start': 4776.726, 'duration': 1.321}, {'end': 4783.211, 'text': 'How did we do that? We did what we usually do for multi-class problems.', 'start': 4778.087, 'duration': 5.124}, {'end': 4796.506, 'text': 'Instead of taking 1 versus 2 versus 3 versus 4, et cetera, we either take a class versus another, class like 1 versus 5, and 2 versus 3, et cetera,', 'start': 4783.871, 'duration': 12.635}], 'summary': 'Discussed using perceptron for 10-digit classification in zip codes.', 'duration': 40.207, 'max_score': 4756.299, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4756299.jpg'}, {'end': 5086.315, 'src': 'embed', 'start': 5056.446, 'weight': 6, 'content': [{'end': 5059.607, 'text': 'If you want this as an input to the credit, you may not want it as a linear thing.', 'start': 5056.446, 'duration': 3.161}, {'end': 5063.268, 'text': 'But you say, am I bigger than five years or less than five years? Those are meaningful features.', 'start': 5059.667, 'duration': 3.601}, {'end': 5077.853, 'text': 'So the key distinction you need to make in your mind is that Did I choose the feature by understanding the problem or did I choose the feature by looking at the specific data set that was given to me?', 'start': 5063.909, 'duration': 13.944}, {'end': 5080.194, 'text': 'The latter is the problem.', 'start': 5079.033, 'duration': 1.161}, {'end': 5086.315, 'text': 'If I look at the data and then choose features, then I am doing the learning myself.', 'start': 5081.514, 'duration': 4.801}], 'summary': 'Choosing features based on problem understanding improves learning.', 'duration': 29.869, 'max_score': 5056.446, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs5056446.jpg'}, {'end': 5188.395, 'src': 'embed', 'start': 5161.732, 'weight': 7, 'content': [{'end': 5166.374, 'text': "There's a question if it's possible to choose parameters automatically.", 'start': 5161.732, 'duration': 4.642}, {'end': 5169.455, 'text': "I'm guessing they're referring to the learning rate.", 'start': 5166.454, 'duration': 3.001}, {'end': 5174.03, 'text': 'No, wait.', 'start': 5172.53, 'duration': 1.5}, {'end': 5175.371, 'text': 'Sorry They corrected.', 'start': 5174.07, 'duration': 1.301}, {'end': 5184.354, 'text': "So how to select the features automatically, so it's back to the original? Automatically, that's what we are in business with.", 'start': 5175.411, 'duration': 8.943}, {'end': 5185.094, 'text': "It's machine learning.", 'start': 5184.374, 'duration': 0.72}, {'end': 5185.954, 'text': 'Things are automatic.', 'start': 5185.174, 'duration': 0.78}, {'end': 5188.395, 'text': 'But then, this becomes part of learning.', 'start': 5186.674, 'duration': 1.721}], 'summary': 'Discussion on automatic selection of parameters and features in machine learning.', 'duration': 26.663, 'max_score': 5161.732, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs5161732.jpg'}], 'start': 4072.16, 'title': 'Gradient descent and neural network optimization', 'summary': 'Delves into challenges of termination in gradient descent, emphasizing efficiency despite local minima issues and the trade-offs between computational complexity and optimization performance for neural networks. it also covers logistic regression in a multi-class setting and its impact on feature selection and vc dimension.', 'chapters': [{'end': 4489.947, 'start': 4072.16, 'title': 'Termination and local minima in gradient descent', 'summary': 'Discusses the challenges of termination in gradient descent, including the criteria for stopping and the issue of local minima, and highlights the efficiency of gradient descent despite local minima challenges and the effectiveness of repeating learning sessions from different initial conditions.', 'duration': 417.787, 'highlights': ['The termination criteria for gradient descent involves minimizing the error, stopping when the changes are small or reaching a target error, or imposing a limit on the number of iterations.', 'Local minima in gradient descent can be addressed by repeating learning sessions from different initial conditions and picking the best minimum, as formally reaching the global minimum is NP-hard and not tractable in terms of computational time.', 'Gradient descent is a remarkably efficient algorithm, especially the stochastic version, and often yields very good results in many applications.']}, {'end': 4722.383, 'start': 4490.328, 'title': 'Optimizing neural networks', 'summary': 'Discusses stochastic gradient descent, cross-entropy, and optimization methods for neural networks, emphasizing the trade-offs between computational complexity and optimization performance.', 'duration': 232.055, 'highlights': ['Stochastic gradient descent involves processing one training example at a time rather than the whole set, with a focus on its applicability to neural networks.', 'Cross-entropy provides a relationship between two probability distributions using logarithmic and expected values, with different definitions and scopes in the literature.', 'Optimization methods like binary search and second-order approximations offer trade-offs between computational cost and speed, emphasizing the need to carefully evaluate their practical applicability.']}, {'end': 5217.342, 'start': 4725.065, 'title': 'Logistic regression and multi-class approach', 'summary': 'Discusses the application of logistic regression in a multi-class setting, including using perceptron for digit recognition and the use of different sigmoid functions such as tanh. it also covers the impact of feature selection on vc dimension and the automatic selection of features in machine learning.', 'duration': 492.277, 'highlights': ['The application of logistic regression in a multi-class setting, including using perceptron for digit recognition and the use of different sigmoid functions such as tanh. The chapter delves into the application of logistic regression in a multi-class setting, particularly using perceptron for digit recognition and exploring alternative sigmoid functions like tanh.', 'The impact of feature selection on VC dimension and the differentiation between meaningful features derived from an understanding of the problem versus features chosen based on the specific dataset. It discusses the impact of feature selection on VC dimension and the distinction between meaningful features derived from problem understanding versus features chosen based on a specific dataset.', 'The automatic selection of features in machine learning, particularly in the context of neural networks and hidden layers. The chapter touches upon the automatic selection of features in machine learning, especially in the context of neural networks and hidden layers.']}], 'duration': 1145.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qSTHZvN8hzs/pics/qSTHZvN8hzs4072160.jpg', 'highlights': ['Gradient descent termination criteria: minimize error, small changes, target error, or iteration limit.', 'Address local minima by repeating learning sessions from different initial conditions.', 'Stochastic gradient descent is efficient, yields good results, applicable to neural networks.', 'Cross-entropy relates probability distributions using logarithmic and expected values.', 'Optimization methods offer trade-offs between computational cost and speed.', 'Application of logistic regression in a multi-class setting, using perceptron for digit recognition.', 'Impact of feature selection on VC dimension and meaningful features in problem understanding.', 'Automatic feature selection in machine learning, especially in neural networks and hidden layers.']}], 'highlights': ['The bias-variance decomposition illustrates a trade-off between bias and variance, where a small hypothesis set leads to significant bias, while a larger hypothesis set reduces bias but introduces variance.', 'Learning curves depict the relationship between in-sample and out-of-sample error as the number of examples increases, showing that increasing the sample size decreases out-of-sample error and leads to decreased in-sample error due to better fitting with fewer examples.', 'The number of examples required to achieve a certain performance is proportional to the VC dimension or the effective degrees of freedom, leading to better generalization performance with a larger number of examples.', 'Logistic regression applies a nonlinearity, the logistic function, to the linear signal, resulting in a real-valued output interpreted as a probability, with the nonlinearity serving as a probability function that ranges from 0 to 1.', "The logistic function's formula involves exponentials and ratios, demonstrating how the function maps large positive signals to probabilities close to 1, large negative signals to probabilities close to 0, and signals of 0 to a probability of 0.5.", 'The error measure is specifically tailored for binary classification tasks, where the target function generates binary outputs.', 'The likelihood of a hypothesis is evaluated based on the probability of generating the data under that assumption, allowing for a comparative way to determine the plausibility of different hypotheses.', 'The chapter emphasizes the iterative nature of minimizing error measures through learning algorithms, particularly in logistic regression, and introduces the concept of gradient descent as a general method for nonlinear optimization.', 'Adjusting learning rate (eta) based on slope to prevent overshooting the minimum is emphasized, leading to a more efficient approach for minimizing errors in gradient descent.', 'Detailed explanation of the logistic regression algorithm, involving iterative computation of gradients and updating weights based on the learning rate.', 'Different error measures for each model and their associated learning algorithms are covered, highlighting the use of combinatorial quantity, pseudo-inverse, and gradient descent.']}