title
Lecture 04 - Error and Noise
description
Error and Noise - The principled choice of error measures. What happens when the target we want to learn is noisy. Lecture 4 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - http://work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, http://creativecommons.org/licenses/by-nc-nd/3.0/
This lecture was recorded on April 12, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.
detail
{'title': 'Lecture 04 - Error and Noise', 'heatmap': [{'end': 1274.673, 'start': 1222.56, 'weight': 1}, {'end': 2495.416, 'start': 2251.736, 'weight': 0.735}, {'end': 2786.686, 'start': 2726.297, 'weight': 0.828}], 'summary': 'Lecture covers linear and nonlinear models in machine learning, emphasizing error measures, noise, and their impact on learning algorithms. it also discusses target distribution, theory of learning, probability distribution, and error quantification in supervised learning, providing practical examples and implications for machine learning.', 'chapters': [{'end': 116.666, 'segs': [{'end': 116.666, 'src': 'embed', 'start': 0.703, 'weight': 0, 'content': [{'end': 3.565, 'text': 'The following program is brought to you by Caltech.', 'start': 0.703, 'duration': 2.862}, {'end': 22.932, 'text': 'Welcome back Last time, we talked about linear models.', 'start': 16.435, 'duration': 6.497}, {'end': 30.718, 'text': 'And linear models share what we would refer to as the signal, which is this formula.', 'start': 24.313, 'duration': 6.405}, {'end': 37.925, 'text': "It's a linear sum involving the input variables and weights that can be put in vector form.", 'start': 31.641, 'duration': 6.284}, {'end': 44.249, 'text': 'And all linear models, in one form or another, have that as their basic building block.', 'start': 38.825, 'duration': 5.424}, {'end': 49.752, 'text': 'And you can have a classification linear system like the perceptron,', 'start': 45.229, 'duration': 4.523}, {'end': 56.232, 'text': 'that uses that signal and takes the sign of it to make a decision plus or minus 1..', 'start': 49.752, 'duration': 6.48}, {'end': 63.139, 'text': 'Or you can take something like regression, which is real value, that takes the signal as it is, and has that as output.', 'start': 56.232, 'duration': 6.907}, {'end': 69.538, 'text': 'We looked at the linear regression algorithm, which was a particularly easy algorithm.', 'start': 65.316, 'duration': 4.222}, {'end': 75.14, 'text': 'All it does, it takes the inputs and puts them in a particular matrix form.', 'start': 70.498, 'duration': 4.642}, {'end': 78.982, 'text': "And so the outputs, that's the inputs and outputs of the data set.", 'start': 75.54, 'duration': 3.442}, {'end': 87.585, 'text': 'And then, by computing this very simple formula, in one shot, it can get you the optimal value of the weight vector.', 'start': 79.942, 'duration': 7.643}, {'end': 94.989, 'text': 'If you look at linear models, you can think of them as an economy car.', 'start': 89.924, 'duration': 5.065}, {'end': 101.154, 'text': "They get you where you want to go, and they don't consume a lot of gas.", 'start': 96.75, 'duration': 4.404}, {'end': 105.758, 'text': 'You may not be very proud of them, but they actually do the job.', 'start': 102.535, 'duration': 3.223}, {'end': 110.603, 'text': 'If you want a luxury car, wait until you get to support vector machines.', 'start': 106.799, 'duration': 3.804}, {'end': 113.105, 'text': "And you'll have to pay the price for that.", 'start': 111.764, 'duration': 1.341}, {'end': 116.666, 'text': 'However, for linear models,', 'start': 114.485, 'duration': 2.181}], 'summary': 'Linear models use a simple formula involving input variables and weights, with linear regression being an easy algorithm that computes the optimal weight vector in one shot.', 'duration': 115.963, 'max_score': 0.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc703.jpg'}], 'start': 0.703, 'title': 'Linear models in machine learning', 'summary': 'Discusses linear models in machine learning, covering the signal concept, linear sum with input variables and weights, and application in classification and regression algorithms.', 'chapters': [{'end': 116.666, 'start': 0.703, 'title': 'Linear models in machine learning', 'summary': 'Discusses linear models in machine learning, including the concept of the signal, the use of linear sum involving input variables and weights, and the application of linear models in classification and regression algorithms.', 'duration': 115.963, 'highlights': ['Linear models share a common signal formula, involving input variables and weights, which serves as their basic building block.', 'Linear regression algorithm simplifies the computation of the optimal weight vector by using a simple matrix form of the inputs and outputs of the dataset.', 'Classification linear systems like the perceptron use the signal to make decisions, while regression uses the signal as its output, providing flexibility in handling both classification and real value scenarios.', 'Linear models are likened to economy cars that efficiently serve their purpose without consuming excessive resources, while luxury car equivalents in machine learning are represented by support vector machines, offering enhanced performance at a higher cost.']}], 'duration': 115.963, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc703.jpg', 'highlights': ['Linear regression algorithm simplifies computation of optimal weight vector using matrix form.', 'Classification linear systems like perceptron use signal to make decisions.', 'Linear models share common signal formula involving input variables and weights.', 'Linear models likened to economy cars efficiently serving purpose without excessive resources.']}, {'end': 590.297, 'segs': [{'end': 181.557, 'src': 'embed', 'start': 157.415, 'weight': 1, 'content': [{'end': 165.603, 'text': 'And the reason this is important is because learning actually modifies w in the learning process until it gets to the optimal one, while x,', 'start': 157.415, 'duration': 8.188}, {'end': 171.108, 'text': 'which you usually think of as a variable, is actually a bunch of constants, which are the data sets that are handed to you.', 'start': 165.603, 'duration': 5.505}, {'end': 174.531, 'text': 'So the linearity in w is the key point.', 'start': 171.648, 'duration': 2.883}, {'end': 181.557, 'text': 'And if you take x and transform it in any way, you form to another vector z in a very nonlinear way, if you want.', 'start': 175.492, 'duration': 6.065}], 'summary': 'Learning modifies w to optimal, x is constants, linearity in w is key.', 'duration': 24.142, 'max_score': 157.415, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc157415.jpg'}, {'end': 248.569, 'src': 'embed', 'start': 224.527, 'weight': 2, 'content': [{'end': 231.053, 'text': 'And these are practical considerations that we have to take when we consider real-life problems.', 'start': 224.527, 'duration': 6.526}, {'end': 238.72, 'text': 'And we are going to modify the learning diagram that we have by incorporating the notion of error and the notion of noise.', 'start': 231.833, 'duration': 6.887}, {'end': 243.264, 'text': 'And I will do that for the bulk of the lecture.', 'start': 240.181, 'duration': 3.083}, {'end': 248.569, 'text': 'However, my starting point will be to wrap up the nonlinear transformation that we started last time.', 'start': 243.664, 'duration': 4.905}], 'summary': 'Incorporating error and noise in learning diagram, starting with nonlinear transformation.', 'duration': 24.042, 'max_score': 224.527, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc224527.jpg'}, {'end': 334.94, 'src': 'embed', 'start': 310.635, 'weight': 0, 'content': [{'end': 319.037, 'text': 'The transformation is, you take every point in the sample space Xn, you put it through a transformation Y, and you get the corresponding point Zn.', 'start': 310.635, 'duration': 8.402}, {'end': 324.038, 'text': 'And now we are working in the feature space, or the nonlinear space Z.', 'start': 319.397, 'duration': 4.641}, {'end': 329.599, 'text': 'When we did this, we realized that a data set like this can become linearly separable in the new space.', 'start': 324.038, 'duration': 5.561}, {'end': 334.94, 'text': 'And that allows us to apply the linear model algorithm here.', 'start': 330.999, 'duration': 3.941}], 'summary': 'By transforming the sample space xn to nonlinear space z, the data set becomes linearly separable, enabling application of the linear model algorithm.', 'duration': 24.305, 'max_score': 310.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc310635.jpg'}, {'end': 484.226, 'src': 'embed', 'start': 450.547, 'weight': 3, 'content': [{'end': 456.489, 'text': 'And, as you can see, although we are illustrating here, in a case where you are going from 2-dimensional to 2-dimensional,', 'start': 450.547, 'duration': 5.942}, {'end': 461.632, 'text': 'you could in principle go from 2-dimensional to 100-dimensional with highly nonlinear coordinates.', 'start': 456.489, 'duration': 5.143}, {'end': 463.192, 'text': 'And the same principle will apply.', 'start': 461.892, 'duration': 1.3}, {'end': 466.014, 'text': 'You would be classifying here with a hyperplane in that case.', 'start': 463.493, 'duration': 2.521}, {'end': 468.895, 'text': 'And then this surface would be very, very complicated.', 'start': 466.514, 'duration': 2.381}, {'end': 470.596, 'text': 'It could be completely jagged and whatnot.', 'start': 468.915, 'duration': 1.681}, {'end': 474.197, 'text': 'And that enables you to implement a lot of sophisticated surfaces.', 'start': 470.976, 'duration': 3.221}, {'end': 484.226, 'text': "So let's look at the nonlinear transformation and ask ourselves what transforms to what, to make sure that all the notions are clear.", 'start': 476.863, 'duration': 7.363}], 'summary': 'Nonlinear transformations can go from 2d to 100d, enabling sophisticated surfaces and classification with hyperplanes.', 'duration': 33.679, 'max_score': 450.547, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc450547.jpg'}, {'end': 601.287, 'src': 'embed', 'start': 570.702, 'weight': 4, 'content': [{'end': 573.244, 'text': 'In this case, generalization-wise, crashing.', 'start': 570.702, 'duration': 2.542}, {'end': 578.848, 'text': 'That is, although you did everything right and you did this transformation and this is a powerful machine,', 'start': 573.664, 'duration': 5.184}, {'end': 583.151, 'text': "you don't know how to drive the powerful machine and you end up with very poor generalization.", 'start': 578.848, 'duration': 4.303}, {'end': 586.454, 'text': "And we will need the theory in order to get our driver's license.", 'start': 583.632, 'duration': 2.822}, {'end': 590.297, 'text': 'That will tell us what to do in order to be able to drive this machine.', 'start': 587.054, 'duration': 3.243}, {'end': 595.022, 'text': 'So that is x.', 'start': 592.059, 'duration': 2.963}, {'end': 601.287, 'text': 'Now, what do x1 up to xn go to? Remember, this is the data set, the inputs of the data set.', 'start': 595.022, 'duration': 6.265}], 'summary': 'Understanding theory leads to better generalization and efficient use of powerful machines.', 'duration': 30.585, 'max_score': 570.702, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc570702.jpg'}], 'start': 116.666, 'title': 'Nonlinear transformation in machine learning', 'summary': 'Discusses the importance of nonlinear transformation in strengthening linear models, highlighting its role in achieving linear separability and the application of simple linear regression algorithm. it also emphasizes the practical considerations of error and noise in real-life problems. additionally, it explores the process of nonlinear transformation in machine learning, illustrating the potential for highly nonlinear coordinates and the implications for classification, while emphasizing the need for caution due to the risk of poor generalization.', 'chapters': [{'end': 402.487, 'start': 116.666, 'title': 'Nonlinear transformation and learning models', 'summary': 'Discusses the importance of nonlinear transformation in strengthening linear models, highlighting its role in achieving linear separability and the application of simple linear regression algorithm. it also emphasizes the practical considerations of error and noise in real-life problems.', 'duration': 285.821, 'highlights': ['Nonlinear transformation allows achieving linear separability in the new feature space Z, enabling the application of the simple linear model algorithm. By transforming the data set into a nonlinear space Z, the chapter demonstrates the possibility of achieving linear separability, which facilitates the application of simple linear model algorithms like linear regression or classification.', 'Emphasizing the importance of nonlinear transformation in modifying W during the learning process to reach the optimal solution. The chapter highlights the significance of nonlinear transformation in modifying W, the vector, during the learning process to attain the optimal solution, providing insights into the key role of linearity in W.', 'Addressing the practical considerations of error and noise in real-life problems, modifying the learning diagram to incorporate these notions. The chapter addresses the practical considerations of error and noise in real-life problems, showcasing the modification of the learning diagram to incorporate these notions, thereby acknowledging the relevance of addressing real-world complexities.']}, {'end': 590.297, 'start': 402.908, 'title': 'Nonlinear transformation in machine learning', 'summary': 'Discusses the process of nonlinear transformation in machine learning, illustrating the potential for highly nonlinear coordinates and the implications for classification, while emphasizing the need for caution due to the risk of poor generalization.', 'duration': 187.389, 'highlights': ['The process of nonlinear transformation in machine learning is illustrated, highlighting the potential for highly nonlinear coordinates and the implications for classification. Illustration of potential highly nonlinear coordinates, implications for classification', 'Emphasizing the need for caution due to the risk of poor generalization when implementing nonlinear transformations in machine learning. Cautionary note on potential for poor generalization']}], 'duration': 473.631, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc116666.jpg', 'highlights': ['Nonlinear transformation enables achieving linear separability in feature space Z, facilitating simple linear model algorithms.', 'Nonlinear transformation plays a significant role in modifying W during the learning process to reach the optimal solution.', 'Addressing practical considerations of error and noise in real-life problems, modifying the learning diagram to incorporate these notions.', 'Illustration of potential highly nonlinear coordinates and implications for classification in the process of nonlinear transformation in machine learning.', 'Emphasizing caution due to the risk of poor generalization when implementing nonlinear transformations in machine learning.']}, {'end': 1384.394, 'segs': [{'end': 671.145, 'src': 'embed', 'start': 617.878, 'weight': 0, 'content': [{'end': 621.199, 'text': 'And each vector can be very long, according to the transformation you chose.', 'start': 617.878, 'duration': 3.321}, {'end': 625.38, 'text': 'Next one, the labels.', 'start': 623.999, 'duration': 1.381}, {'end': 630.781, 'text': 'The data set comes with inputs and outputs, right? So the inputs, I did the transformation.', 'start': 625.54, 'duration': 5.241}, {'end': 638.143, 'text': 'What do y1 up to yn transform to? Well, they transform to y1 up to yn.', 'start': 630.821, 'duration': 7.322}, {'end': 639.932, 'text': 'These are untouched.', 'start': 639.051, 'duration': 0.881}, {'end': 641.273, 'text': 'These are the values.', 'start': 640.413, 'duration': 0.86}, {'end': 642.355, 'text': 'They are not touched.', 'start': 641.654, 'duration': 0.701}, {'end': 643.776, 'text': 'And these are the ones we learn.', 'start': 642.755, 'duration': 1.021}, {'end': 650.022, 'text': "If it's classification, they are plus 1 or minus 1, exactly the same way they were there before.", 'start': 643.816, 'duration': 6.206}, {'end': 657.17, 'text': 'How about the weights? When we use linear models, we have a weight vector.', 'start': 651.424, 'duration': 5.746}, {'end': 660.974, 'text': 'So when we are in the X space here, what are the weights?', 'start': 657.59, 'duration': 3.384}, {'end': 668.098, 'text': 'The answer is that there are no weights in the X space when you do an nonlinear transformation.', 'start': 662.526, 'duration': 5.572}, {'end': 671.145, 'text': 'The weights are done in the Z space.', 'start': 668.76, 'duration': 2.385}], 'summary': 'Data transformation and classification in machine learning.', 'duration': 53.267, 'max_score': 617.878, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc617878.jpg'}, {'end': 741.424, 'src': 'embed', 'start': 715.844, 'weight': 3, 'content': [{'end': 721.361, 'text': 'And it happens to be Exactly the same way, except in z space.', 'start': 715.844, 'duration': 5.517}, {'end': 727.243, 'text': 'So you take the linear form here, and take the sign, and that would be your hypothesis.', 'start': 721.441, 'duration': 5.802}, {'end': 735.465, 'text': "Except it's a little bit annoying, because this is g of x, and you are telling me this is w tilde transpose times z.", 'start': 727.683, 'duration': 7.782}, {'end': 737.085, 'text': "Where is x? Don't worry.", 'start': 735.465, 'duration': 1.62}, {'end': 738.286, 'text': 'Here is x.', 'start': 738.026, 'duration': 0.26}, {'end': 741.424, 'text': 'What Z is, is the transformation of X.', 'start': 739.383, 'duration': 2.041}], 'summary': 'Linear transformation from x to z space, with hypothesis based on sign and g of x.', 'duration': 25.58, 'max_score': 715.844, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc715844.jpg'}, {'end': 932.983, 'src': 'embed', 'start': 901.398, 'weight': 4, 'content': [{'end': 903.98, 'text': 'And these will be h and f.', 'start': 901.398, 'duration': 2.582}, {'end': 908.805, 'text': 'So it returns a number for any two functions you plug in.', 'start': 903.98, 'duration': 4.825}, {'end': 910.726, 'text': 'One of them will be the target function.', 'start': 909.225, 'duration': 1.501}, {'end': 912.988, 'text': 'One of them will be a hypothesis of interest.', 'start': 911.027, 'duration': 1.961}, {'end': 919.033, 'text': 'And you ask yourself, how well, or how badly in this case, does h approximate f? And you get an error.', 'start': 913.349, 'duration': 5.684}, {'end': 922.857, 'text': 'If the error is 0, then h perfectly reflects f, and you are home free.', 'start': 919.113, 'duration': 3.744}, {'end': 927.621, 'text': 'If there is an error, then maybe you need to look for another h that has smaller error.', 'start': 923.397, 'duration': 4.224}, {'end': 932.983, 'text': 'So that formalizes the question of search of the learning algorithm into minimizing an error function.', 'start': 927.981, 'duration': 5.002}], 'summary': 'Learning algorithm minimizes error function to find best function approximation.', 'duration': 31.585, 'max_score': 901.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc901398.jpg'}, {'end': 969.926, 'src': 'embed', 'start': 943.368, 'weight': 5, 'content': [{'end': 947.31, 'text': 'We just take these objects and return a number, and we refer to it as a function.', 'start': 943.368, 'duration': 3.942}, {'end': 952.353, 'text': 'And we talk about error measure in the sense of the English word measure, not the mathematical measure.', 'start': 947.67, 'duration': 4.683}, {'end': 962.799, 'text': 'So the error function, in principle, returns a number for a pair of functions.', 'start': 956.092, 'duration': 6.707}, {'end': 969.926, 'text': 'But it is almost always defined in terms of difference on a particular point, and then you put these points together.', 'start': 962.999, 'duration': 6.927}], 'summary': 'Error function returns a number for a pair of functions, defined in terms of difference on a particular point.', 'duration': 26.558, 'max_score': 943.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc943368.jpg'}, {'end': 1163.24, 'src': 'embed', 'start': 1132.254, 'weight': 6, 'content': [{'end': 1134.776, 'text': "So let's look at the in-sample error.", 'start': 1132.254, 'duration': 2.522}, {'end': 1139.481, 'text': 'When we have the in-sample error, this is the formula for it.', 'start': 1136.418, 'duration': 3.063}, {'end': 1144.345, 'text': 'And now you think of in-sample error as the in-sample version of this.', 'start': 1139.581, 'duration': 4.764}, {'end': 1154.872, 'text': 'Because now we are going to use the pointwise error that goes with that error measure in defining the in-sample error.', 'start': 1146.164, 'duration': 8.708}, {'end': 1163.24, 'text': 'So if you take a single point from your training set, you would be having n going from 1 to N.', 'start': 1155.573, 'duration': 7.667}], 'summary': 'In-sample error formula defines pointwise error for training set.', 'duration': 30.986, 'max_score': 1132.254, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1132254.jpg'}, {'end': 1274.673, 'src': 'heatmap', 'start': 1222.56, 'weight': 1, 'content': [{'end': 1231.262, 'text': 'And in order to get an average in that case, what you do is you get the expected value, in this case, with respect to x.', 'start': 1222.56, 'duration': 8.702}, {'end': 1233.303, 'text': 'So that is the average for the out-of-sample case.', 'start': 1231.262, 'duration': 2.041}, {'end': 1241.908, 'text': 'And again, if you take the binary error, and you take the expected value of this, this will be identically the probability of error overall.', 'start': 1233.903, 'duration': 8.005}, {'end': 1248.233, 'text': 'And we are using the probability distribution over the input space X in order to compute this quantity.', 'start': 1242.709, 'duration': 5.524}, {'end': 1255.838, 'text': "So that's how we get from a definition that you invoke on a single point, to the in-sample and out-of-sample versions.", 'start': 1248.573, 'duration': 7.265}, {'end': 1263.726, 'text': "Now let's revise the learning diagram with this added component.", 'start': 1258.582, 'duration': 5.144}, {'end': 1266.948, 'text': 'Here is the learning diagram.', 'start': 1263.866, 'duration': 3.082}, {'end': 1273.272, 'text': 'There is nothing that changed here, except that now this is the standard color, because we already got used to it.', 'start': 1268.209, 'duration': 5.063}, {'end': 1274.673, 'text': 'The red stuff is the new stuff.', 'start': 1273.392, 'duration': 1.281}], 'summary': "Average for out-of-sample case is probability of error overall using input space x's probability distribution.", 'duration': 52.113, 'max_score': 1222.56, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1222560.jpg'}, {'end': 1314.13, 'src': 'embed', 'start': 1283.938, 'weight': 7, 'content': [{'end': 1288.339, 'text': 'The first one is to realize that we are defining the error measure on a point.', 'start': 1283.938, 'duration': 4.401}, {'end': 1289.699, 'text': "So here's the addition.", 'start': 1288.839, 'duration': 0.86}, {'end': 1301.742, 'text': 'The addition is that, in deciding whether g is close to f, which is the goal of learning, we test this with a point x.', 'start': 1290.72, 'duration': 11.022}, {'end': 1308.523, 'text': 'And the criteria for deciding whether g of x is approximately the same as f of x is our pointwise error measure.', 'start': 1301.742, 'duration': 6.781}, {'end': 1314.13, 'text': 'Furthermore, this X is created from the space using something very specific.', 'start': 1310.049, 'duration': 4.081}], 'summary': 'Defining error measure on a point to test closeness of g to f using pointwise error measure.', 'duration': 30.192, 'max_score': 1283.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1283938.jpg'}], 'start': 592.059, 'title': 'Nonlinear transformation in machine learning', 'summary': 'Discusses the transformation of input data, labels, and weights in a nonlinear space, resulting in a new hypothesis for machine learning. it also introduces the concept of nonlinear transformation, delves into error measures and noisy targets, and provides an in-depth explanation of error measures and their role in approximating functions, as well as the computation of in-sample and out-of-sample errors.', 'chapters': [{'end': 741.424, 'start': 592.059, 'title': 'Nonlinear transformation in machine learning', 'summary': 'Discusses the transformation of input data, labels, and weights in a nonlinear space, resulting in a new hypothesis for machine learning.', 'duration': 149.365, 'highlights': ['The input data set x1 up to xn undergoes a nonlinear transformation to z1 up to zn, resulting in the same number of points in a vector form. The input data set undergoes a transformation to a new vector space, maintaining the same number of points and vector format.', 'The labels y1 up to yn remain untouched during the transformation, and they are the values learned for classification. The labels remain unchanged during the transformation and are utilized as the learned values for classification tasks.', 'Weight vectors are not present in the X space after a nonlinear transformation; instead, they exist in the Z space and are denoted as w tilde. After the transformation, weight vectors are located in the Z space and are represented as w tilde, distinct from the original X space.', 'The hypothesis g of x is derived from the linear form in the z space, resulting in the final hypothesis for the learning process. The final hypothesis for the learning process is derived from the linear form in the z space, producing the hypothesis g of x.']}, {'end': 1384.394, 'start': 741.424, 'title': 'Nonlinear transformation and error measures', 'summary': 'Introduces the concept of nonlinear transformation and delves into error measures and noisy targets, providing an in-depth explanation of error measures and their role in approximating functions, as well as the computation of in-sample and out-of-sample errors.', 'duration': 642.97, 'highlights': ['The error measure quantitatively evaluates the approximation of a hypothesis to a target function, with an error of 0 indicating a perfect reflection, guiding the search for a hypothesis with minimal error.', 'Error measures are defined as a functional that returns a number for a pair of functions, often based on the pointwise difference between the two functions, such as the squared error and binary error.', 'The in-sample error is computed as the average of pointwise errors over the training set, while the out-of-sample error is determined by the expected value with respect to the input space X, reflecting the probability of error overall.', "The addition of error measures to the learning diagram involves testing the closeness of a hypothesis to a target function using a pointwise error measure, with the requirement to test using points drawn from the same distribution as the training examples to invoke Hoeffding's guarantees."]}], 'duration': 792.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc592059.jpg', 'highlights': ['The input data set undergoes a transformation to a new vector space, maintaining the same number of points and vector format.', 'The labels remain unchanged during the transformation and are utilized as the learned values for classification tasks.', 'After the transformation, weight vectors are located in the Z space and are represented as w tilde, distinct from the original X space.', 'The final hypothesis for the learning process is derived from the linear form in the z space, producing the hypothesis g of x.', 'The error measure quantitatively evaluates the approximation of a hypothesis to a target function, with an error of 0 indicating a perfect reflection, guiding the search for a hypothesis with minimal error.', 'Error measures are defined as a functional that returns a number for a pair of functions, often based on the pointwise difference between the two functions, such as the squared error and binary error.', 'The in-sample error is computed as the average of pointwise errors over the training set, while the out-of-sample error is determined by the expected value with respect to the input space X, reflecting the probability of error overall.', "The addition of error measures to the learning diagram involves testing the closeness of a hypothesis to a target function using a pointwise error measure, with the requirement to test using points drawn from the same distribution as the training examples to invoke Hoeffding's guarantees."]}, {'end': 1883.576, 'segs': [{'end': 1509.451, 'src': 'embed', 'start': 1463.373, 'weight': 0, 'content': [{'end': 1469.915, 'text': "Now, in defining an error measure, I'd like to get this case, because there is a great intuition about what is going on.", 'start': 1463.373, 'duration': 6.542}, {'end': 1477.059, 'text': 'So if we can come up with a meaningful error measure here that captures both the false accept and the false reject,', 'start': 1470.276, 'duration': 6.783}, {'end': 1479.72, 'text': 'we will have a handle on what the error measures are all about.', 'start': 1477.059, 'duration': 2.661}, {'end': 1485.181, 'text': "So how do we penalize each type? That's what you do.", 'start': 1482.238, 'duration': 2.943}, {'end': 1488.825, 'text': 'When you give an error, you penalize it, such that the error is large.', 'start': 1485.321, 'duration': 3.504}, {'end': 1492.529, 'text': "So you move away from that hypothesis to get a better hypothesis that doesn't penalize it as much.", 'start': 1488.845, 'duration': 3.684}, {'end': 1497.154, 'text': 'Now, we can put it in a matrix form.', 'start': 1494.19, 'duration': 2.964}, {'end': 1499.869, 'text': 'So this is the target.', 'start': 1498.848, 'duration': 1.021}, {'end': 1501.269, 'text': 'This is the perfect system.', 'start': 1499.909, 'duration': 1.36}, {'end': 1506.391, 'text': "This returns plus 1 whenever it's u, returns minus 1 whenever it's an intruder.", 'start': 1501.809, 'duration': 4.582}, {'end': 1508.251, 'text': "That's our dream system.", 'start': 1507.171, 'duration': 1.08}, {'end': 1509.451, 'text': "We don't have that.", 'start': 1508.771, 'duration': 0.68}], 'summary': 'Defining error measure to capture false accept and reject, penalizing errors and aiming for a perfect system.', 'duration': 46.078, 'max_score': 1463.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1463373.jpg'}, {'end': 1648.527, 'src': 'embed', 'start': 1625.559, 'weight': 3, 'content': [{'end': 1633.243, 'text': 'So on the checkout, you identify yourself, and then you put your finger, and then the system will verify you, or decide that you are an intruder.', 'start': 1625.559, 'duration': 7.684}, {'end': 1640.796, 'text': "Now given this application, let's try to see false accepts and false rejects, and how to penalize them.", 'start': 1634.785, 'duration': 6.011}, {'end': 1646.566, 'text': 'The false reject, in this case, actually is costly.', 'start': 1644.266, 'duration': 2.3}, {'end': 1647.827, 'text': 'Think of it this way.', 'start': 1647.127, 'duration': 0.7}, {'end': 1648.527, 'text': 'You are a customer.', 'start': 1647.867, 'duration': 0.66}], 'summary': 'Biometric system checks identity to prevent intruders, but false rejects are costly.', 'duration': 22.968, 'max_score': 1625.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1625559.jpg'}, {'end': 1837.288, 'src': 'embed', 'start': 1810.429, 'weight': 4, 'content': [{'end': 1814.112, 'text': 'You have to agree with me that false accept, in this case, is an unmitigated disaster.', 'start': 1810.429, 'duration': 3.683}, {'end': 1820.376, 'text': 'Someone got authority to something that they are not authorized in, and national security is at stake.', 'start': 1815.993, 'duration': 4.383}, {'end': 1821.377, 'text': "That's a no-no.", 'start': 1820.777, 'duration': 0.6}, {'end': 1827.561, 'text': 'False reject, in this case, can be tolerated.', 'start': 1825.3, 'duration': 2.261}, {'end': 1831.444, 'text': 'Why? You are not a customer.', 'start': 1828.202, 'duration': 3.242}, {'end': 1832.385, 'text': 'You are an employee.', 'start': 1831.584, 'duration': 0.801}, {'end': 1834.926, 'text': "It's you, but the system rejected you.", 'start': 1833.285, 'duration': 1.641}, {'end': 1837.288, 'text': 'Just try again and again and again.', 'start': 1835.787, 'duration': 1.501}], 'summary': 'Unauthorized access poses a grave threat to national security, while false rejection can be tolerated for employees.', 'duration': 26.859, 'max_score': 1810.429, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1810429.jpg'}], 'start': 1384.434, 'title': 'Error measures and functions in ml', 'summary': 'Covers defining error measures in ml for fingerprint verification, using a matrix to represent the system and hypothesis, and choosing error functions with application-specific examples from supermarket and cia scenarios.', 'chapters': [{'end': 1575.69, 'start': 1384.434, 'title': 'Defining error measures in machine learning', 'summary': 'Discusses how to define an error measure in machine learning, specifically in the context of fingerprint verification, focusing on capturing both false accept and false reject errors and penalizing them to improve the hypothesis, using a matrix to represent the target system and the hypothesis.', 'duration': 191.256, 'highlights': ['The chapter discusses the concept of false accept and false reject errors in the context of fingerprint verification, emphasizing the need to capture both types of errors when defining an error measure in machine learning. This highlights the key concept of capturing both false accept and false reject errors in defining an error measure.', 'The chapter explains the process of penalizing errors to improve the hypothesis in machine learning, aiming to reduce the error and move towards a better hypothesis. This emphasizes the importance of penalizing errors to improve the hypothesis and reduce the error in machine learning.', 'The discussion involves using a matrix to represent the target system and the hypothesis, highlighting four possibilities for errors, including zero error along the diagonal and the need to assign numbers to false accept and false reject errors. This underlines the use of a matrix to represent the target system and the hypothesis, as well as the need to assign numbers to false accept and false reject errors.']}, {'end': 1883.576, 'start': 1577.451, 'title': 'Choosing error functions', 'summary': 'Discusses the application-specific decision of choosing error functions with examples from the supermarket and cia scenarios, emphasizing the importance of penalizing false accepts and rejects differently based on the application domain.', 'duration': 306.125, 'highlights': ['The importance of penalizing false accepts and rejects differently based on the application domain The chapter emphasizes the significant impact of penalizing false accepts and rejects differently in the supermarket and CIA scenarios, highlighting the potential customer loss in the former and the national security risk in the latter.', 'False accept as a significant concern in the CIA scenario In the CIA scenario, false accept is described as an unmitigated disaster due to the potential breach of national security, leading to the recommendation of putting heavier weights on false accepts compared to false rejects.', 'Different penalty weights for false accepts and false rejects The chapter explains the rationale for assigning different penalty weights for false accepts and false rejects in the supermarket and CIA scenarios, based on the varying consequences and priorities of each application domain.']}], 'duration': 499.142, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1384434.jpg', 'highlights': ['The chapter emphasizes the need to capture both false accept and false reject errors when defining an error measure in machine learning.', 'The discussion involves using a matrix to represent the target system and the hypothesis, highlighting four possibilities for errors.', 'The chapter explains the process of penalizing errors to improve the hypothesis in machine learning, aiming to reduce the error and move towards a better hypothesis.', 'The importance of penalizing false accepts and rejects differently based on the application domain, with specific examples from the supermarket and CIA scenarios.', 'In the CIA scenario, false accept is described as an unmitigated disaster due to the potential breach of national security, leading to the recommendation of putting heavier weights on false accepts compared to false rejects.', 'The chapter explains the rationale for assigning different penalty weights for false accepts and false rejects in the supermarket and CIA scenarios, based on the varying consequences and priorities of each application domain.']}, {'end': 2244.233, 'segs': [{'end': 1929.775, 'src': 'embed', 'start': 1904.616, 'weight': 0, 'content': [{'end': 1913.92, 'text': 'You should ask them, how much does it cost you to use my imperfect system in place of the perfect system? That is their decision to make.', 'start': 1904.616, 'duration': 9.304}, {'end': 1918.743, 'text': 'And if they articulate that as a quantitative error function, this is the error function you should work with.', 'start': 1914.261, 'duration': 4.482}, {'end': 1923.188, 'text': 'However, this does not always happen.', 'start': 1919.864, 'duration': 3.324}, {'end': 1929.775, 'text': 'People may not have the formalization that will capture the error measure in reality.', 'start': 1923.868, 'duration': 5.907}], 'summary': 'Quantify the cost of using imperfect system as decision makers may not always formalize the error measure.', 'duration': 25.159, 'max_score': 1904.616, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1904616.jpg'}, {'end': 1979, 'src': 'embed', 'start': 1951.518, 'weight': 1, 'content': [{'end': 1954.34, 'text': 'But you always remember, this is a second choice.', 'start': 1951.518, 'duration': 2.822}, {'end': 1960.146, 'text': 'If we knew what the error measure that needs to be used by the user is, we would use that.', 'start': 1955.141, 'duration': 5.005}, {'end': 1962.728, 'text': 'So here are the two alternatives.', 'start': 1961.407, 'duration': 1.321}, {'end': 1964.869, 'text': "You don't have the user-specified error measure.", 'start': 1962.748, 'duration': 2.121}, {'end': 1973.075, 'text': 'Then you resort to plausible measures, measures that you can argue analytically that they have merit.', 'start': 1965.33, 'duration': 7.745}, {'end': 1979, 'text': 'Usually, the analytic argument starts with an assumption that is usually a loaded assumption.', 'start': 1974.336, 'duration': 4.664}], 'summary': 'In the absence of user-specified error measures, plausible measures are used, often based on loaded assumptions.', 'duration': 27.482, 'max_score': 1951.518, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1951518.jpg'}, {'end': 2198.612, 'src': 'embed', 'start': 2155.318, 'weight': 2, 'content': [{'end': 2161.32, 'text': "Because what does the learning algorithm do when you have an error measure? It minimizes the in-sample error, let's say in this case.", 'start': 2155.318, 'duration': 6.002}, {'end': 2163.921, 'text': 'And the in-sample error depends on your error measure.', 'start': 2161.68, 'duration': 2.241}, {'end': 2167.802, 'text': "If you are minimizing squared error, that's different from minimizing another type of error.", 'start': 2163.981, 'duration': 3.821}, {'end': 2170.103, 'text': 'So the error measure feeds into those two.', 'start': 2168.202, 'duration': 1.901}, {'end': 2176.211, 'text': 'Now we go for the next guy, which is the noisy targets.', 'start': 2172.648, 'duration': 3.563}, {'end': 2179.894, 'text': 'New topic, another addition to the learning diagram.', 'start': 2176.531, 'duration': 3.363}, {'end': 2187.66, 'text': 'So the noisy targets are actually very important, because in reality, these are the only types you are going to encounter in the problems in life.', 'start': 2180.914, 'duration': 6.746}, {'end': 2191.102, 'text': 'Very seldom, you get a very clean target function.', 'start': 2188.12, 'duration': 2.982}, {'end': 2198.612, 'text': 'So the first statement is? The target function is not always a function.', 'start': 2192.123, 'duration': 6.489}], 'summary': 'Learning algorithm minimizes in-sample error, influenced by error measure. noisy targets are important in real-life problems.', 'duration': 43.294, 'max_score': 2155.318, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2155318.jpg'}], 'start': 1886.322, 'title': 'Error measures in learning', 'summary': 'Emphasizes the significance of user-specified error measures in practical learning problems and discusses the impact of different measures on learning algorithms and noisy targets. it highlights the need to consider the cost of using an imperfect system and the alternatives to this approach, as well as the importance of plausible and friendly measures.', 'chapters': [{'end': 1951.237, 'start': 1886.322, 'title': 'Error measure in practical learning', 'summary': 'Emphasizes the importance of user-specified error measure in practical learning problems, highlighting the need to consider the cost of using an imperfect system and the alternatives to this approach.', 'duration': 64.915, 'highlights': ['The error measure should be specified by the user, considering the cost of using an imperfect system instead of the perfect system.', "It is important to consider the user's decision on the cost of using an imperfect system and articulate it as a quantitative error function.", 'The chapter discusses the compromises and alternatives to the user-specified error measure approach, acknowledging their popularity and some favorable properties.']}, {'end': 2244.233, 'start': 1951.518, 'title': 'Error measures in machine learning', 'summary': 'Discusses the importance of error measures in machine learning, highlighting the two alternatives of plausible measures and friendly measures, and their impact on learning algorithms and noisy targets.', 'duration': 292.715, 'highlights': ['The importance of error measures in machine learning is emphasized, with a focus on the two alternatives of plausible measures and friendly measures. The chapter explains the significance of error measures in machine learning, emphasizing the two alternatives of plausible measures and friendly measures for cases where a user-specified error measure is not available.', 'The impact of error measures on learning algorithms and their role in minimizing in-sample error is discussed, highlighting the difference in minimizing squared error compared to other types of error. It discusses the role of error measures in learning algorithms, specifically their impact on minimizing in-sample error, emphasizing the difference in minimizing squared error compared to other types of error.', 'The significance of noisy targets in machine learning problems is highlighted, emphasizing their prevalence and impact on real-life problem solving. The chapter emphasizes the significance of noisy targets in machine learning problems, highlighting their prevalence and impact on real-life problem-solving scenarios.']}], 'duration': 357.911, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc1886322.jpg', 'highlights': ['The error measure should be specified by the user, considering the cost of using an imperfect system instead of the perfect system.', 'The importance of error measures in machine learning is emphasized, with a focus on the two alternatives of plausible measures and friendly measures.', 'The impact of error measures on learning algorithms and their role in minimizing in-sample error is discussed, highlighting the difference in minimizing squared error compared to other types of error.', 'The significance of noisy targets in machine learning problems is highlighted, emphasizing their prevalence and impact on real-life problem solving.']}, {'end': 2812.464, 'segs': [{'end': 2495.416, 'src': 'heatmap', 'start': 2251.736, 'weight': 0.735, 'content': [{'end': 2259.418, 'text': 'So we come to realize that two identical customers, in the sense that their input representation is the same, can have two different behaviors.', 'start': 2251.736, 'duration': 7.682}, {'end': 2267.12, 'text': 'And having this This is one point mapping to two values, so it is not a function.', 'start': 2260.578, 'duration': 6.542}, {'end': 2273.625, 'text': 'What do we do about that? Well, we use a target distribution, as in probability distribution.', 'start': 2268.241, 'duration': 5.384}, {'end': 2282.411, 'text': "So instead of having y equals f of x, you tell me what x is, and I'm going to tell you what the value y is for sure.", 'start': 2273.965, 'duration': 8.446}, {'end': 2291.528, 'text': 'You use a target distribution, and the notation for that is Probability of Y given X.', 'start': 2283.492, 'duration': 8.036}, {'end': 2295.491, 'text': 'So again, it depends on X, but its dependence is probabilistic.', 'start': 2291.528, 'duration': 3.963}, {'end': 2298.694, 'text': "Some Y's are more likely than others in this case.", 'start': 2296.172, 'duration': 2.522}, {'end': 2301.697, 'text': 'Here, one Y was possible, and the rest were impossible.', 'start': 2298.834, 'duration': 2.863}, {'end': 2306.221, 'text': 'So now we make it a little bit more accommodating.', 'start': 2302.357, 'duration': 3.864}, {'end': 2310.444, 'text': 'So now we have a target distribution instead of a target function.', 'start': 2307.602, 'duration': 2.842}, {'end': 2311.866, 'text': "Let's follow it through.", 'start': 2311.125, 'duration': 0.741}, {'end': 2318.983, 'text': 'X used to be generated by the input probability distribution.', 'start': 2314.702, 'duration': 4.281}, {'end': 2321.543, 'text': 'It will still be generated by that distribution.', 'start': 2319.123, 'duration': 2.42}, {'end': 2326.164, 'text': 'This is an artifact that we introduced in order to get the benefit of the Hoeffding type inequalities.', 'start': 2321.583, 'duration': 4.581}, {'end': 2327.145, 'text': 'Nothing has changed.', 'start': 2326.344, 'duration': 0.801}, {'end': 2336.607, 'text': 'But what will change now is that instead of Y being deterministic of X once you generate X, Y is also probabilistic, generated by this fellow.', 'start': 2328.025, 'duration': 8.582}, {'end': 2349.388, 'text': 'So you can think now of X, Y as a pair being generated by the joint distribution, which is P of X times P of Y given X, assuming independence.', 'start': 2338.823, 'duration': 10.565}, {'end': 2356.051, 'text': 'So in this case, there is no assumption of independence once you put it this way.', 'start': 2351.769, 'duration': 4.282}, {'end': 2364.414, 'text': 'But the assumption here is that the P of Y you are given is actually conditional on X.', 'start': 2356.551, 'duration': 7.863}, {'end': 2365.475, 'text': 'Now you get noisy targets.', 'start': 2364.414, 'duration': 1.061}, {'end': 2377.29, 'text': 'What is a noisy target in this case? Well, a noisy target can be posed as a deterministic target, like the one we had before.', 'start': 2368.03, 'duration': 9.26}, {'end': 2378.757, 'text': 'plus noise.', 'start': 2378.137, 'duration': 0.62}, {'end': 2382.198, 'text': 'This applies to any numerical target function.', 'start': 2379.377, 'duration': 2.821}, {'end': 2386.82, 'text': 'So if y is a real number or binary or something numerical,', 'start': 2382.438, 'duration': 4.382}, {'end': 2393.982, 'text': 'you can always pose the question of a target distribution as if it was a deterministic target function proper plus noise.', 'start': 2386.82, 'duration': 7.162}, {'end': 2400.385, 'text': 'This is just a convenience to show you that this is not far from what we have already.', 'start': 2395.363, 'duration': 5.022}, {'end': 2412.03, 'text': 'And why is that? Because if you define now a target function, to be the expected value, the conditional expected value of y given x.', 'start': 2401.065, 'duration': 10.965}, {'end': 2412.57, 'text': "That's a function.", 'start': 2412.03, 'duration': 0.54}, {'end': 2424.378, 'text': "Although p of y given x gives you different values, you take the expected value that's a number, and you call this the value of the function f of x.", 'start': 2412.731, 'duration': 11.647}, {'end': 2426.88, 'text': 'Then whatever is left out, you call the noise.', 'start': 2424.378, 'duration': 2.502}, {'end': 2428.801, 'text': "It's a nice trick.", 'start': 2427.6, 'duration': 1.201}, {'end': 2430.042, 'text': "So you've got the bulk of it.", 'start': 2429.221, 'duration': 0.821}, {'end': 2435.427, 'text': 'And then you go here, and you call the rest the noise.', 'start': 2433.084, 'duration': 2.343}, {'end': 2437.95, 'text': 'And that is usually the form it is given.', 'start': 2436.208, 'duration': 1.742}, {'end': 2441.394, 'text': 'So you think that you are really trying to learn the target function still,', 'start': 2438.271, 'duration': 3.123}, {'end': 2447.582, 'text': "but there is this annoying noise and you're trying to make your algorithm pick this pattern and there is nothing it can do about the remaining noise,", 'start': 2441.394, 'duration': 6.188}, {'end': 2450.445, 'text': 'which averages to 0..', 'start': 2447.582, 'duration': 2.863}, {'end': 2457.714, 'text': 'Now, by the same token, there is no loss of generality when we talk about probability distributions.', 'start': 2450.445, 'duration': 7.269}, {'end': 2464.522, 'text': 'If you actually have a proper function, which happens once in a blue moon, you can still pose this as a probability distribution.', 'start': 2458.235, 'duration': 6.287}, {'end': 2478.411, 'text': 'How do you do that? You get here p of y given x, and you define it to be identically 0, unless y equals f of x that you have in mind.', 'start': 2464.722, 'duration': 13.689}, {'end': 2486.853, 'text': 'So if we were talking about finite domains, you put all the probability 1 on this value, and you put the probability 0 for all other values.', 'start': 2479.251, 'duration': 7.602}, {'end': 2491.535, 'text': 'If it happens to be continuous, which is almost all the case, you put all the mass on the point.', 'start': 2487.293, 'duration': 4.242}, {'end': 2495.416, 'text': 'You put a delta function there, and you let the other ones be identically 0.', 'start': 2491.615, 'duration': 3.801}], 'summary': 'Target distribution introduces probabilistic dependence, allowing for noisy targets and handling different customer behaviors.', 'duration': 243.68, 'max_score': 2251.736, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2251736.jpg'}, {'end': 2441.394, 'src': 'embed', 'start': 2368.03, 'weight': 0, 'content': [{'end': 2377.29, 'text': 'What is a noisy target in this case? Well, a noisy target can be posed as a deterministic target, like the one we had before.', 'start': 2368.03, 'duration': 9.26}, {'end': 2378.757, 'text': 'plus noise.', 'start': 2378.137, 'duration': 0.62}, {'end': 2382.198, 'text': 'This applies to any numerical target function.', 'start': 2379.377, 'duration': 2.821}, {'end': 2386.82, 'text': 'So if y is a real number or binary or something numerical,', 'start': 2382.438, 'duration': 4.382}, {'end': 2393.982, 'text': 'you can always pose the question of a target distribution as if it was a deterministic target function proper plus noise.', 'start': 2386.82, 'duration': 7.162}, {'end': 2400.385, 'text': 'This is just a convenience to show you that this is not far from what we have already.', 'start': 2395.363, 'duration': 5.022}, {'end': 2412.03, 'text': 'And why is that? Because if you define now a target function, to be the expected value, the conditional expected value of y given x.', 'start': 2401.065, 'duration': 10.965}, {'end': 2412.57, 'text': "That's a function.", 'start': 2412.03, 'duration': 0.54}, {'end': 2424.378, 'text': "Although p of y given x gives you different values, you take the expected value that's a number, and you call this the value of the function f of x.", 'start': 2412.731, 'duration': 11.647}, {'end': 2426.88, 'text': 'Then whatever is left out, you call the noise.', 'start': 2424.378, 'duration': 2.502}, {'end': 2428.801, 'text': "It's a nice trick.", 'start': 2427.6, 'duration': 1.201}, {'end': 2430.042, 'text': "So you've got the bulk of it.", 'start': 2429.221, 'duration': 0.821}, {'end': 2435.427, 'text': 'And then you go here, and you call the rest the noise.', 'start': 2433.084, 'duration': 2.343}, {'end': 2437.95, 'text': 'And that is usually the form it is given.', 'start': 2436.208, 'duration': 1.742}, {'end': 2441.394, 'text': 'So you think that you are really trying to learn the target function still,', 'start': 2438.271, 'duration': 3.123}], 'summary': 'Noisy target is a deterministic function plus noise, useful for numerical target functions.', 'duration': 73.364, 'max_score': 2368.03, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2368030.jpg'}, {'end': 2629.397, 'src': 'embed', 'start': 2601.838, 'weight': 4, 'content': [{'end': 2605.381, 'text': 'So that is the final diagram for supervised learning.', 'start': 2601.838, 'duration': 3.543}, {'end': 2613.246, 'text': "Now, I'd like to make one final point about noisy targets, which is the distinction between the two probabilities we have.", 'start': 2606.882, 'duration': 6.364}, {'end': 2618.089, 'text': 'We have probability of X, which we artificially introduced to accommodate Hoeffding.', 'start': 2613.846, 'duration': 4.243}, {'end': 2624.353, 'text': 'And then this was introduced in a completely different context, that is, to accommodate the fact that real,', 'start': 2618.53, 'duration': 5.823}, {'end': 2629.397, 'text': 'as is genuine functions that you encounter in practice are not functions, are actually noisy distributions.', 'start': 2624.353, 'duration': 5.044}], 'summary': 'Final diagram for supervised learning with emphasis on noisy targets and distinction between two probabilities.', 'duration': 27.559, 'max_score': 2601.838, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2601838.jpg'}, {'end': 2786.686, 'src': 'heatmap', 'start': 2699.551, 'weight': 2, 'content': [{'end': 2700.151, 'text': 'We have seen that.', 'start': 2699.551, 'duration': 0.6}, {'end': 2703.532, 'text': 'Now the target distribution.', 'start': 2701.891, 'duration': 1.641}, {'end': 2707.602, 'text': 'is what you are trying to learn.', 'start': 2705.68, 'duration': 1.922}, {'end': 2713.487, 'text': 'You are not trying to learn the input distribution.', 'start': 2711.005, 'duration': 2.482}, {'end': 2717.35, 'text': 'As a matter of fact, when you are done, you will not know what the input distribution is.', 'start': 2713.887, 'duration': 3.463}, {'end': 2726.297, 'text': 'The input distribution is merely playing the role of quantifying the relative importance of the point x.', 'start': 2717.97, 'duration': 8.327}, {'end': 2727.338, 'text': 'Let me give you an example.', 'start': 2726.297, 'duration': 1.041}, {'end': 2729.9, 'text': "Let's say you are approving credit again.", 'start': 2728.399, 'duration': 1.501}, {'end': 2736.579, 'text': 'The target distribution is the probability of creditworthiness, given the input.', 'start': 2731.434, 'duration': 5.145}, {'end': 2738.701, 'text': "Let's simplify the input and say it's the salary.", 'start': 2736.659, 'duration': 2.042}, {'end': 2740.122, 'text': 'So I give you the salary.', 'start': 2739.241, 'duration': 0.881}, {'end': 2743.365, 'text': 'You decide what is the risk of this person defaulting.', 'start': 2740.502, 'duration': 2.863}, {'end': 2752.714, 'text': 'And then you decide that the output is plus 1, approve credit with probability 0.9, and disapprove credit with probability 0.1.', 'start': 2743.405, 'duration': 9.309}, {'end': 2753.895, 'text': 'That is the target distribution.', 'start': 2752.714, 'duration': 1.181}, {'end': 2755.897, 'text': 'And that is what you are trying to learn.', 'start': 2754.215, 'duration': 1.682}, {'end': 2758.36, 'text': 'You are going to approximate it to a hard decision, probably.', 'start': 2756.238, 'duration': 2.122}, {'end': 2761.524, 'text': 'Or you can actually learn the probability distribution, as we will see later on.', 'start': 2758.661, 'duration': 2.863}, {'end': 2768.352, 'text': 'The input distribution just tells you the distribution of salaries in the general population.', 'start': 2763.326, 'duration': 5.026}, {'end': 2773.923, 'text': 'How many people make 100, 000, how many people make 10, 000, et cetera.', 'start': 2770.142, 'duration': 3.781}, {'end': 2786.686, 'text': "So, in spite of the fact that the probability distribution over the input matters in the sense that let's say that you encounter a population where the salaries are very high.", 'start': 2775.023, 'duration': 11.663}], 'summary': 'Learning the target distribution in credit approval using input and target distribution examples.', 'duration': 53.163, 'max_score': 2699.551, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2699551.jpg'}], 'start': 2244.833, 'title': 'Target distribution and learning diagram', 'summary': 'Covers the concept of target distribution in credit evaluation, emphasizing the shift to noisy targets and discusses noise in learning, transforming target functions into probability distributions. it also addresses the significance of target distribution in supervised learning, emphasizing the importance of understanding input and target distributions.', 'chapters': [{'end': 2424.378, 'start': 2244.833, 'title': 'Target distribution in credit evaluation', 'summary': 'Discusses the concept of target distribution in credit evaluation, highlighting the probabilistic nature of credit behavior and the shift from deterministic target functions to noisy targets. it emphasizes the transformation from deterministic targets to a function representing the expected value of y given x.', 'duration': 179.545, 'highlights': ['The concept of target distribution in credit evaluation is discussed, emphasizing the probabilistic nature of credit behavior and the variability in customer outcomes despite identical input representation.', 'The shift from deterministic target functions to noisy targets is explained, illustrating the notion of a deterministic target function plus noise as a representation of a numerical target function.', 'The transformation to a function representing the expected value of y given x is highlighted, emphasizing the use of the conditional expected value as the value of the function f of x.']}, {'end': 2600.677, 'start': 2424.378, 'title': 'Learning diagram for unsupervised learning', 'summary': 'Discusses the concept of noise in learning, transforming target functions into probability distributions, and the final installment of the learning diagram for unsupervised learning, accommodating noisy targets.', 'duration': 176.299, 'highlights': ['The chapter discusses the concept of noise in learning. The concept of noise in learning is explained, where the remaining noise is considered after identifying the bulk of the data.', 'Transforming target functions into probability distributions. The transformation of target functions into probability distributions is described, showcasing how a proper function can be posed as a probability distribution.', 'The final installment of the learning diagram for unsupervised learning, accommodating noisy targets. The final installment of the learning diagram for unsupervised learning is detailed, including the incorporation of noisy targets and the transformation of the unknown target function into an unknown target distribution.']}, {'end': 2812.464, 'start': 2601.838, 'title': 'Supervised learning and noisy targets', 'summary': 'Discusses the distinctions and similarities between the two probabilities in supervised learning and the significance of target distribution, emphasizing the importance of understanding input and target distributions in learning.', 'duration': 210.626, 'highlights': ['The chapter emphasizes the distinction and similarities between the two probabilities in supervised learning, highlighting the importance of understanding input and target distributions in learning. Distinction and similarities between the two probabilities, importance of understanding input and target distributions', 'The target distribution is what is being learned in supervised learning, while the input distribution quantifies the relative importance of the input. Role of target and input distributions in supervised learning', 'An example of credit approval is used to illustrate the target distribution, where the output is based on the probability of creditworthiness given the input (e.g., salary). Example of credit approval to illustrate target distribution']}], 'duration': 567.631, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2244833.jpg', 'highlights': ['The transformation to a function representing the expected value of y given x is highlighted, emphasizing the use of the conditional expected value as the value of the function f of x.', 'The shift from deterministic target functions to noisy targets is explained, illustrating the notion of a deterministic target function plus noise as a representation of a numerical target function.', 'The concept of target distribution in credit evaluation is discussed, emphasizing the probabilistic nature of credit behavior and the variability in customer outcomes despite identical input representation.', 'The chapter discusses the concept of noise in learning. The concept of noise in learning is explained, where the remaining noise is considered after identifying the bulk of the data.', 'The chapter emphasizes the distinction and similarities between the two probabilities in supervised learning, highlighting the importance of understanding input and target distributions in learning.', 'The target distribution is what is being learned in supervised learning, while the input distribution quantifies the relative importance of the input.']}, {'end': 3357.455, 'segs': [{'end': 2883.049, 'src': 'embed', 'start': 2813.784, 'weight': 0, 'content': [{'end': 2820.687, 'text': 'And if you go and put the mass of probability around the borderline cases, the cases where the decision is difficult,', 'start': 2813.784, 'duration': 6.903}, {'end': 2827.01, 'text': 'the same system that you learned will probably perform worse, just because there are so many points that are borderline.', 'start': 2820.687, 'duration': 6.323}, {'end': 2834.994, 'text': 'So it does give the weight that will finally grade your hypothesis, but you are not trying to learn that distribution.', 'start': 2827.931, 'duration': 7.063}, {'end': 2842.419, 'text': 'And when you put them together analytically, which you are allowed to do, you can merge them as P of x and y.', 'start': 2835.854, 'duration': 6.565}, {'end': 2843.88, 'text': "And that's what you will find in the literature.", 'start': 2842.419, 'duration': 1.461}, {'end': 2847.703, 'text': "It's very nice and pleasant, and you generate the example using that joint distribution.", 'start': 2844.18, 'duration': 3.523}, {'end': 2854.328, 'text': 'However, you just need to remember that this merging mixes two concepts that are inherently different.', 'start': 2848.143, 'duration': 6.185}, {'end': 2863.182, 'text': 'Definitely, P of x and y is not a target distribution for supervised learning.', 'start': 2855.68, 'duration': 7.502}, {'end': 2868.163, 'text': 'The target distribution, the one you are actually trying to learn, is this fellow.', 'start': 2863.902, 'duration': 4.261}, {'end': 2872.304, 'text': 'And the other component is just a catalyst in the process.', 'start': 2869.043, 'duration': 3.261}, {'end': 2883.049, 'text': 'OK? That covers the error and noise, and we have arrived at the final statement of the learning problem.', 'start': 2873.084, 'duration': 9.965}], 'summary': 'Supervised learning performance worsens with borderline cases, target distribution differs from joint distribution.', 'duration': 69.265, 'max_score': 2813.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2813784.jpg'}, {'end': 3191.991, 'src': 'embed', 'start': 3162.379, 'weight': 3, 'content': [{'end': 3165.38, 'text': 'This was the subject of lecture 3.', 'start': 3162.379, 'duration': 3.001}, {'end': 3170.943, 'text': 'We had data, and we wanted to get the in-sample error to be small, and we looked for techniques to do that.', 'start': 3165.38, 'duration': 5.563}, {'end': 3174.805, 'text': "So now, because this is important, let's put it in a box.", 'start': 3172.464, 'duration': 2.341}, {'end': 3181.788, 'text': 'Learning reduces to two questions.', 'start': 3179.207, 'duration': 2.581}, {'end': 3183.765, 'text': 'First question.', 'start': 3183.004, 'duration': 0.761}, {'end': 3191.991, 'text': 'Can we make sure that the out-of-sample performance is close enough to the in-sample performance?', 'start': 3185.346, 'duration': 6.645}], 'summary': 'Lecture 3 focused on reducing in-sample error and ensuring close out-of-sample performance.', 'duration': 29.612, 'max_score': 3162.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3162379.jpg'}], 'start': 2813.784, 'title': 'Supervised learning and theory of learning', 'summary': 'Discusses the impact of merging p of x and y with the target distribution on supervised learning and emphasizes the importance of understanding in-sample and out-of-sample errors in the theory of learning, focusing on ensuring close out-of-sample performance and reducing in-sample error.', 'chapters': [{'end': 2883.049, 'start': 2813.784, 'title': 'Supervised learning and target distribution', 'summary': 'Discusses how the merging of two concepts, p of x and y and the target distribution, affects the performance of supervised learning, emphasizing that p of x and y is not the target distribution for supervised learning.', 'duration': 69.265, 'highlights': ['The merging of P of x and y and the target distribution affects the performance of supervised learning as P of x and y is not the target distribution for supervised learning.', 'Putting the mass of probability around the borderline cases can result in the system performing worse due to the difficulty in decision-making.', 'The joint distribution P of x and y is often found in the literature and used for generating examples, but it mixes two inherently different concepts.']}, {'end': 3357.455, 'start': 2883.49, 'title': 'Theory of learning: overview and key concepts', 'summary': 'Presents the theory of learning, emphasizing the importance of understanding how in-sample and out-of-sample errors relate to the feasibility of learning, with a focus on two key questions: ensuring close out-of-sample performance to in-sample performance and reducing in-sample error.', 'duration': 473.965, 'highlights': ['The chapter emphasizes the importance of understanding the relationship between in-sample and out-of-sample errors in the feasibility of learning, with a focus on two key questions: ensuring close out-of-sample performance to in-sample performance and reducing in-sample error. This highlights the central theme of the chapter, providing a clear understanding of the key concepts and objectives of the theory of learning.', 'The chapter stresses the practical and theoretical aspects of learning, presenting the conditions for learning as the in-sample error being close to the out-of-sample error and the in-sample error being small. This highlights the specific conditions necessary for learning, emphasizing the practical and theoretical considerations involved in the process.', 'The chapter discusses the challenges of achieving close to zero out-of-sample performance, particularly in financial forecasting, where the out-of-sample error may not be near zero but still indicate successful learning when consistently smaller than a half. This highlights the practical application of the theory, illustrating how achieving close to zero out-of-sample error may not always be feasible, especially in financial forecasting, and emphasizes the importance of understanding theoretical guarantees in such scenarios.']}], 'duration': 543.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc2813784.jpg', 'highlights': ['The merging of P of x and y and the target distribution affects the performance of supervised learning as P of x and y is not the target distribution for supervised learning.', 'Putting the mass of probability around the borderline cases can result in the system performing worse due to the difficulty in decision-making.', 'The joint distribution P of x and y is often found in the literature and used for generating examples, but it mixes two inherently different concepts.', 'The chapter emphasizes the importance of understanding the relationship between in-sample and out-of-sample errors in the feasibility of learning, with a focus on two key questions: ensuring close out-of-sample performance to in-sample performance and reducing in-sample error.', 'The chapter stresses the practical and theoretical aspects of learning, presenting the conditions for learning as the in-sample error being close to the out-of-sample error and the in-sample error being small.', 'The chapter discusses the challenges of achieving close to zero out-of-sample performance, particularly in financial forecasting, where the out-of-sample error may not be near zero but still indicate successful learning when consistently smaller than a half.']}, {'end': 3970.765, 'segs': [{'end': 3501.321, 'src': 'embed', 'start': 3455.639, 'weight': 0, 'content': [{'end': 3459.68, 'text': 'Perceptron, the linear regression model, all of these are infinite hypotheses.', 'start': 3455.639, 'duration': 4.041}, {'end': 3464.961, 'text': 'And we are going to try to find a way to deal with infinite hypotheses.', 'start': 3460.52, 'duration': 4.441}, {'end': 3467.657, 'text': 'This is the bulk of the development.', 'start': 3465.936, 'duration': 1.721}, {'end': 3478.146, 'text': 'And we are going to measure the model not by the number of hypotheses, but by a single parameter, which tells us the sophistication of the model.', 'start': 3467.978, 'duration': 10.168}, {'end': 3483.531, 'text': 'And that sophistication will reflect the out-of-sample performance as it relates to the in-sample performance.', 'start': 3478.527, 'duration': 5.004}, {'end': 3487.854, 'text': 'Once we do this, lots of doors open.', 'start': 3484.952, 'duration': 2.902}, {'end': 3492.739, 'text': 'So we are going to characterize a trade-off that we observed on and off as we went through the lectures.', 'start': 3488.215, 'duration': 4.524}, {'end': 3501.321, 'text': 'We realized that we would like our model, the hypothesis set, to be elaborate, in order to be able to fit the data.', 'start': 3493.757, 'duration': 7.564}], 'summary': 'Developing a sophisticated model to balance in-sample and out-of-sample performance and characterize a trade-off.', 'duration': 45.682, 'max_score': 3455.639, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3455639.jpg'}, {'end': 3547.278, 'src': 'embed', 'start': 3524.983, 'weight': 4, 'content': [{'end': 3537.711, 'text': 'The good news from the theory is that this will be pinned down so concretely that we are going to derive techniques from this that will make a lot of difference in the practical learning.', 'start': 3524.983, 'duration': 12.728}, {'end': 3540.653, 'text': 'Regularization is a direct result of this.', 'start': 3537.971, 'duration': 2.682}, {'end': 3547.278, 'text': 'And without regularization, you basically cannot do machine learning other than extremely naively.', 'start': 3541.394, 'duration': 5.884}], 'summary': 'Theory will lead to practical techniques, essential for machine learning.', 'duration': 22.295, 'max_score': 3524.983, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3524983.jpg'}, {'end': 3652.491, 'src': 'embed', 'start': 3608.038, 'weight': 3, 'content': [{'end': 3617.544, 'text': 'There is a correction to the theory that takes into consideration the difference between the two probability distributions,', 'start': 3608.038, 'duration': 9.506}, {'end': 3619.264, 'text': 'assuming that they are not extreme.', 'start': 3617.544, 'duration': 1.72}, {'end': 3623.587, 'text': "For example, if one probability distribution completely vanishes, then obviously there's a problem,", 'start': 3619.284, 'duration': 4.303}, {'end': 3628.369, 'text': "because the points in that part of the space will never happen, and you shouldn't be hoping to learn at all from that.", 'start': 3623.587, 'duration': 4.782}, {'end': 3635.913, 'text': 'But there are modifications to the theory, where you get a correction term based on the difference between the two probabilities.', 'start': 3628.85, 'duration': 7.063}, {'end': 3640.758, 'text': "The absolute version, I don't know whether this was asked, but let me address it anyway.", 'start': 3637.074, 'duration': 3.684}, {'end': 3645.404, 'text': 'How does P of X affect the learning algorithm?', 'start': 3642, 'duration': 3.404}, {'end': 3652.491, 'text': 'Well, the emphasis that P of X gives on certain parts of the space over others will affect the choice of the learning examples.', 'start': 3645.804, 'duration': 6.687}], 'summary': 'Theory correction accounts for difference in probability distributions, affecting learning algorithm choice.', 'duration': 44.453, 'max_score': 3608.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3608038.jpg'}, {'end': 3782.173, 'src': 'embed', 'start': 3743.358, 'weight': 5, 'content': [{'end': 3749.463, 'text': 'On the other hand, P of X plays a technical role, and a technical role that is fairly negligible.', 'start': 3743.358, 'duration': 6.105}, {'end': 3755.027, 'text': "It's essential to exist for it, but it's not nearly as important as P of Y given X.", 'start': 3749.523, 'duration': 5.504}, {'end': 3770.264, 'text': 'In the case of considering the target function as a probability distribution, and then what is better to have?', 'start': 3762.156, 'duration': 8.108}, {'end': 3777.811, 'text': 'M pairs of X and Y or M Ys per X or something like that?', 'start': 3770.284, 'duration': 7.527}, {'end': 3782.173, 'text': "I don't have a theoretical proof for it.", 'start': 3780.732, 'duration': 1.441}], 'summary': 'P of x plays a fairly negligible technical role compared to p of y given x.', 'duration': 38.815, 'max_score': 3743.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3743358.jpg'}, {'end': 3880.565, 'src': 'embed', 'start': 3843.835, 'weight': 7, 'content': [{'end': 3849.703, 'text': "Can you clarify what you mean by poor generalization? It's a common question.", 'start': 3843.835, 'duration': 5.868}, {'end': 3852.247, 'text': 'This will be part of the theory.', 'start': 3850.886, 'duration': 1.361}, {'end': 3856.292, 'text': 'There will be a very specific quantity we measure, which is a discrepancy between e out and e in.', 'start': 3852.287, 'duration': 4.005}, {'end': 3859.075, 'text': "And we're going to call this the generalization error.", 'start': 3856.672, 'duration': 2.403}, {'end': 3864.141, 'text': 'And that will quantify poor generalization or good generalization.', 'start': 3860.277, 'duration': 3.864}, {'end': 3879.304, 'text': 'Going back to slide 11 and 12.', 'start': 3867.325, 'duration': 11.979}, {'end': 3880.565, 'text': 'Ah, the supermarket and the CIA.', 'start': 3879.304, 'duration': 1.261}], 'summary': 'The generalization error quantifies poor/good generalization in measuring discrepancy between e out and e in.', 'duration': 36.73, 'max_score': 3843.835, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3843835.jpg'}], 'start': 3362.916, 'title': 'Theory of learning and probability distribution impact on machine learning', 'summary': 'Delves into the theory of learning, particularly examining feasibility for infinite hypothesis sets and model sophistication. additionally, it explores the influence of probability distributions on machine learning, highlighting the practical implications of regularization techniques and the role of probability distributions in influencing the learning algorithm.', 'chapters': [{'end': 3523.054, 'start': 3362.916, 'title': 'Theory of learning: feasibility and model complexity', 'summary': 'Discusses the theory of learning, focusing on the characterization of feasibility for infinite hypothesis sets and the measurement of model sophistication, reflecting the trade-off between in-sample and out-of-sample performance.', 'duration': 160.138, 'highlights': ['The theory characterizes the feasibility of learning for infinite hypothesis sets, measuring model sophistication and reflecting the trade-off between in-sample and out-of-sample performance. Characterization of feasibility for infinite hypothesis sets, measurement of model sophistication, trade-off between in-sample and out-of-sample performance', 'The theory aims to find a way to deal with infinite hypotheses, measuring the model by a single parameter reflecting the sophistication of the model. Dealing with infinite hypotheses, measurement of the model by a single parameter reflecting sophistication', 'The theory discusses the trade-off between the complexity of the hypothesis set and the discrepancy between in-sample and out-of-sample performance. Trade-off between complexity of the hypothesis set and discrepancy between in-sample and out-of-sample performance']}, {'end': 3970.765, 'start': 3524.983, 'title': 'Probability distribution impact on machine learning', 'summary': 'Discusses the impact of probability distributions on machine learning, emphasizing the practical implications of regularization techniques and the foundational significance of understanding p of x and p of y given x. the chapter also addresses the role of probability distributions in influencing the learning algorithm and the concept of generalization error.', 'duration': 445.782, 'highlights': ['The significance of probability distributions in machine learning lies in the derivation of techniques that greatly impact practical learning, particularly through the implementation of regularization, which is essential for effective machine learning. (Relevance: 5)', 'Understanding the impact of P of X on the learning algorithm is crucial, as it influences the emphasis on certain parts of the space over others, affecting the choice of learning examples and the compromise on resource allocation within the hypothesis set. (Relevance: 4)', 'The chapter delves into the importance of P of X and P of Y given X, highlighting the technical role of P of X and its relative significance compared to P of Y given X in the learning problem. (Relevance: 3)', 'The correction to the theory accounts for the difference between two probability distributions, providing a correction term based on the disparity between the two probabilities, thus addressing the impact of varying probability distributions on the learning process. (Relevance: 2)', 'The concept of generalization error, quantifying the discrepancy between e out and e in, is introduced as a means of measuring the quality of generalization in machine learning, setting the stage for further theoretical exploration. (Relevance: 1)']}], 'duration': 607.849, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3362916.jpg', 'highlights': ['Characterization of feasibility for infinite hypothesis sets, measurement of model sophistication, trade-off between in-sample and out-of-sample performance', 'Dealing with infinite hypotheses, measurement of the model by a single parameter reflecting sophistication', 'Trade-off between complexity of the hypothesis set and discrepancy between in-sample and out-of-sample performance', 'Understanding the impact of P of X on the learning algorithm is crucial, as it influences the emphasis on certain parts of the space over others, affecting the choice of learning examples and the compromise on resource allocation within the hypothesis set', 'The significance of probability distributions in machine learning lies in the derivation of techniques that greatly impact practical learning, particularly through the implementation of regularization, which is essential for effective machine learning', 'The chapter delves into the importance of P of X and P of Y given X, highlighting the technical role of P of X and its relative significance compared to P of Y given X in the learning problem', 'The correction to the theory accounts for the difference between two probability distributions, providing a correction term based on the disparity between the two probabilities, thus addressing the impact of varying probability distributions on the learning process', 'The concept of generalization error, quantifying the discrepancy between e out and e in, is introduced as a means of measuring the quality of generalization in machine learning, setting the stage for further theoretical exploration']}, {'end': 4687.855, 'segs': [{'end': 4032.906, 'src': 'embed', 'start': 4007.494, 'weight': 0, 'content': [{'end': 4013.599, 'text': 'As long as you pick the points from the same distribution to train as to test,', 'start': 4007.494, 'duration': 6.105}, {'end': 4017.422, 'text': 'everything that I said and I will say during the theory part will be valid.', 'start': 4013.599, 'duration': 3.823}, {'end': 4023.161, 'text': "If it's a long tail, it's a long tail for training and for testing.", 'start': 4018.499, 'duration': 4.662}, {'end': 4032.906, 'text': "The probability of getting something from, let's say, if it's a heavy tail and I get something that is outlier, I will get a certain error.", 'start': 4024.342, 'duration': 8.564}], 'summary': 'Training and testing from same distribution ensures validity of theory.', 'duration': 25.412, 'max_score': 4007.494, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4007494.jpg'}, {'end': 4089.24, 'src': 'embed', 'start': 4060.75, 'weight': 1, 'content': [{'end': 4063.331, 'text': 'If you scale both of them up, it makes no difference whatsoever.', 'start': 4060.75, 'duration': 2.581}, {'end': 4066.372, 'text': "Then the error measure is scaled up, and you're minimizing it.", 'start': 4063.411, 'duration': 2.961}, {'end': 4067.973, 'text': "So it's just a constant multiplied by it.", 'start': 4066.412, 'duration': 1.561}, {'end': 4074.175, 'text': 'If they are scaled relative to each other, then obviously the emphasis on the system changes,', 'start': 4068.913, 'duration': 5.262}, {'end': 4078.076, 'text': 'trying to get more false positives and less false negatives, or vice versa.', 'start': 4074.175, 'duration': 3.901}, {'end': 4081.417, 'text': "And that's what happens between these two examples.", 'start': 4078.676, 'duration': 2.741}, {'end': 4085.479, 'text': "For the supermarket, here we're trying not to reject customers.", 'start': 4081.437, 'duration': 4.042}, {'end': 4089.24, 'text': 'And in the CIA case, we are trying not to accept people who are intruders.', 'start': 4086.039, 'duration': 3.201}], 'summary': 'Scaling affects error measure, emphasizing false positives or negatives.', 'duration': 28.49, 'max_score': 4060.75, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4060750.jpg'}, {'end': 4175.729, 'src': 'embed', 'start': 4144.634, 'weight': 3, 'content': [{'end': 4147.654, 'text': 'And we know that the Hevding inequality is independent of the value.', 'start': 4144.634, 'duration': 3.02}, {'end': 4151.115, 'text': 'The bound on the right-hand side is independent of the value of mu.', 'start': 4147.895, 'duration': 3.22}, {'end': 4153.796, 'text': 'So any old probability will do.', 'start': 4151.515, 'duration': 2.281}, {'end': 4159.978, 'text': 'Will do what? Will do the legitimization of the learning problem.', 'start': 4154.015, 'duration': 5.963}, {'end': 4162.819, 'text': 'as far as the probabilistic approach is concerned.', 'start': 4160.658, 'duration': 2.161}, {'end': 4171.486, 'text': 'Obviously, we can enter a discussion about the probability of being concentrated, or spread out, or parts of the space being 0.', 'start': 4163.26, 'duration': 8.226}, {'end': 4175.729, 'text': "All of that is good and valid, except that it doesn't affect the basic question,", 'start': 4171.486, 'duration': 4.243}], 'summary': 'Hevding inequality is independent of the value, legitimizes learning problem.', 'duration': 31.095, 'max_score': 4144.634, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4144634.jpg'}, {'end': 4215.417, 'src': 'embed', 'start': 4185.327, 'weight': 2, 'content': [{'end': 4191.25, 'text': 'So some people are asking to exemplify the case of a squared error measure and a closed-form solution.', 'start': 4185.327, 'duration': 5.923}, {'end': 4193.43, 'text': 'So linear regression.', 'start': 4191.45, 'duration': 1.98}, {'end': 4197.412, 'text': 'This actually goes to the review.', 'start': 4193.451, 'duration': 3.961}, {'end': 4202.654, 'text': 'Let me go to the review one, because this is from last lecture.', 'start': 4197.472, 'duration': 5.182}, {'end': 4210.474, 'text': 'There is an algorithm that we derived for linear regression.', 'start': 4206.551, 'duration': 3.923}, {'end': 4215.417, 'text': 'And the algorithm was based on minimizing squared error.', 'start': 4211.375, 'duration': 4.042}], 'summary': 'Linear regression algorithm minimizes squared error', 'duration': 30.09, 'max_score': 4185.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4185327.jpg'}, {'end': 4321.223, 'src': 'embed', 'start': 4291.544, 'weight': 4, 'content': [{'end': 4295.188, 'text': 'Probability of Y given X, and probability of X.', 'start': 4291.544, 'duration': 3.644}, {'end': 4304.156, 'text': 'So if you put them together and you get an imbalanced probability of Y, this means that the building quantities, which is P of X and P of Y, given X,', 'start': 4295.188, 'duration': 8.968}, {'end': 4304.977, 'text': 'are what affected that.', 'start': 4304.156, 'duration': 0.821}, {'end': 4307.88, 'text': 'And those quantities will definitely affect the learning process.', 'start': 4305.377, 'duration': 2.503}, {'end': 4314.531, 'text': 'If you want to answer it for what happens when y is not balanced, go back and see what gave rise to it,', 'start': 4310.204, 'duration': 4.327}, {'end': 4321.223, 'text': 'and then you will be able to find the answer more directly linked through the quantities that directly affect the learning process.', 'start': 4314.531, 'duration': 6.692}], 'summary': 'Imbalanced probability of y is affected by building quantities p of x and p of y, which directly affect the learning process.', 'duration': 29.679, 'max_score': 4291.544, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4291544.jpg'}, {'end': 4672.108, 'src': 'embed', 'start': 4642.393, 'weight': 5, 'content': [{'end': 4646.976, 'text': 'But definitely, it does not touch at all on linear models or any other models.', 'start': 4642.393, 'duration': 4.583}, {'end': 4653.781, 'text': "It's a characterization of the target function versus target distribution.", 'start': 4647.616, 'duration': 6.165}, {'end': 4659.185, 'text': "There's a trade-off between complexity and the performance.", 'start': 4654.642, 'duration': 4.543}, {'end': 4664.809, 'text': 'So, is there a way to simultaneously improve the generalization as well as minimize error?', 'start': 4659.205, 'duration': 5.604}, {'end': 4672.108, 'text': "If you sit through the next four lectures very, very attentively, you'll get the answer to that at the end of the four lectures.", 'start': 4666.846, 'duration': 5.262}], 'summary': 'Lecture focuses on trade-off between complexity and performance, aiming to improve generalization and minimize error.', 'duration': 29.715, 'max_score': 4642.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc4642393.jpg'}], 'start': 3976.908, 'title': 'Supervised learning and error quantification', 'summary': 'Discusses error quantification in supervised learning, emphasizing the validity of theory with consistent data distribution, and explores the impact of scaling false positives and negatives on error measures. it also covers the application of hefding inequality to probability distributions, the importance of error measures in linear regression, the impact of imbalanced probabilities on learning, and the trade-off between complexity and performance.', 'chapters': [{'end': 4081.417, 'start': 3976.908, 'title': 'Supervised learning and error quantification', 'summary': 'Discusses the quantification of errors in supervised learning, emphasizing that as long as training and testing data are from the same distribution, the theory will be valid, and it delves into the impact of scaling false positives and negatives on error measures.', 'duration': 104.509, 'highlights': ['The theory for quantifying errors in supervised learning is valid as long as training and testing data are from the same distribution, with no assumptions about the structure of P of X.', 'The impact of scaling false positives and negatives on error measures is discussed, emphasizing that scaling both up makes no difference, while scaling them relative to each other changes the emphasis on the system.']}, {'end': 4687.855, 'start': 4081.437, 'title': 'Learning probabilistic approaches & error measures', 'summary': 'Covers the application of hefding inequality to probability distributions, the importance of error measures in linear regression, the impact of imbalanced probabilities on learning, and the trade-off between complexity and performance.', 'duration': 606.418, 'highlights': ['The importance of error measures in linear regression The algorithm for linear regression was based on minimizing squared error, resulting in a simple closed-form solution for the final hypothesis.', 'The application of Hefding inequality to probability distributions The Hefding inequality was discussed in relation to the input space, highlighting the need to put a probability distribution over the input space for invoking the probabilistic aspect in machine learning.', 'The impact of imbalanced probabilities on learning The imbalance in the probability of y affects the learning process, with the building quantities P of X and P of Y given X playing a crucial role in the estimation of error and affecting the learning process.', 'Trade-off between complexity and performance The chapter hints at the trade-off between complexity and performance, suggesting that the following lectures will provide tools to simultaneously improve generalization and minimize error.']}], 'duration': 710.947, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/L_0efNkdGMc/pics/L_0efNkdGMc3976908.jpg', 'highlights': ['The theory for quantifying errors in supervised learning is valid as long as training and testing data are from the same distribution, with no assumptions about the structure of P of X.', 'The impact of scaling false positives and negatives on error measures is discussed, emphasizing that scaling both up makes no difference, while scaling them relative to each other changes the emphasis on the system.', 'The importance of error measures in linear regression The algorithm for linear regression was based on minimizing squared error, resulting in a simple closed-form solution for the final hypothesis.', 'The application of Hefding inequality to probability distributions The Hefding inequality was discussed in relation to the input space, highlighting the need to put a probability distribution over the input space for invoking the probabilistic aspect in machine learning.', 'The impact of imbalanced probabilities on learning The imbalance in the probability of y affects the learning process, with the building quantities P of X and P of Y given X playing a crucial role in the estimation of error and affecting the learning process.', 'Trade-off between complexity and performance The chapter hints at the trade-off between complexity and performance, suggesting that the following lectures will provide tools to simultaneously improve generalization and minimize error.']}], 'highlights': ['Linear regression algorithm simplifies computation of optimal weight vector using matrix form.', 'Nonlinear transformation enables achieving linear separability in feature space Z, facilitating simple linear model algorithms.', 'The error measure quantitatively evaluates the approximation of a hypothesis to a target function, with an error of 0 indicating a perfect reflection, guiding the search for a hypothesis with minimal error.', 'The importance of penalizing false accepts and rejects differently based on the application domain, with specific examples from the supermarket and CIA scenarios.', 'The error measure should be specified by the user, considering the cost of using an imperfect system instead of the perfect system.', 'The transformation to a function representing the expected value of y given x is highlighted, emphasizing the use of the conditional expected value as the value of the function f of x.', 'The significance of noisy targets in machine learning problems is highlighted, emphasizing their prevalence and impact on real-life problem solving.', 'The concept of target distribution in credit evaluation is discussed, emphasizing the probabilistic nature of credit behavior and the variability in customer outcomes despite identical input representation.', 'The merging of P of x and y and the target distribution affects the performance of supervised learning as P of x and y is not the target distribution for supervised learning.', 'Understanding the impact of P of X on the learning algorithm is crucial, as it influences the emphasis on certain parts of the space over others, affecting the choice of learning examples and the compromise on resource allocation within the hypothesis set', 'The concept of generalization error, quantifying the discrepancy between e out and e in, is introduced as a means of measuring the quality of generalization in machine learning, setting the stage for further theoretical exploration', 'The theory for quantifying errors in supervised learning is valid as long as training and testing data are from the same distribution, with no assumptions about the structure of P of X.']}