title

Lecture 03 -The Linear Model I

description

The Linear Model I - Linear classification and linear regression. Extending linear models through nonlinear transforms. Lecture 3 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - http://work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, http://creativecommons.org/licenses/by-nc-nd/3.0/
This lecture was recorded on April 10, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.

detail

{'title': 'Lecture 03 -The Linear Model I', 'heatmap': [{'end': 1913.802, 'start': 1861.629, 'weight': 1}, {'end': 2201.758, 'start': 2147.641, 'weight': 0.783}, {'end': 2296.449, 'start': 2244.104, 'weight': 0.851}, {'end': 4357.324, 'start': 4308.027, 'weight': 0.774}], 'summary': 'Lecture covers topics including learning feasibility, model generalization, real data set challenges, feature extraction, symmetry, separability, pocket algorithm, linear regression, credit line decisions, pseudo-inverse, linear regression for classification, nonlinear features, and linear models in machine learning.', 'chapters': [{'end': 443.698, 'segs': [{'end': 69.675, 'src': 'embed', 'start': 44.259, 'weight': 0, 'content': [{'end': 53.404, 'text': 'And in order to be able to tell what E out of H is, H is the hypothesis that corresponds to that particular bin, we look at the in-sample.', 'start': 44.259, 'duration': 9.145}, {'end': 60.869, 'text': 'And we realize that the in-sample tracks the out-of-sample well through the mathematical relationship, which is the Hoeffding inequality.', 'start': 54.045, 'duration': 6.824}, {'end': 69.675, 'text': 'That tells us that the probability that E in deviates from E out by more than our specified tolerance is a small number.', 'start': 61.449, 'duration': 8.226}], 'summary': 'In-sample tracks out-of-sample well using hoeffding inequality, ensuring small deviation probability.', 'duration': 25.416, 'max_score': 44.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ44259.jpg'}, {'end': 127.853, 'src': 'embed', 'start': 98.401, 'weight': 1, 'content': [{'end': 101.102, 'text': 'And we ask ourselves what would apply in this case?', 'start': 98.401, 'duration': 2.701}, {'end': 108.425, 'text': 'We realize that the problem with having multiple hypotheses is that the probability of something bad happens could accumulate.', 'start': 101.722, 'duration': 6.703}, {'end': 117.449, 'text': 'Because if there is a 0.5% chance that the first hypothesis is bad, in the sense of bad generalization, and 0.5% for the second one,', 'start': 109.125, 'duration': 8.324}, {'end': 125.572, 'text': 'we could be so unlucky as to have this 0.5% accumulate and end up with a significant probability that one of the hypotheses will be bad.', 'start': 117.449, 'duration': 8.123}, {'end': 127.853, 'text': 'And when one of the hypotheses will be bad.', 'start': 126.052, 'duration': 1.801}], 'summary': 'Multiple hypotheses increase probability of bad outcomes.', 'duration': 29.452, 'max_score': 98.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ98401.jpg'}, {'end': 173.017, 'src': 'embed', 'start': 141.119, 'weight': 2, 'content': [{'end': 142.76, 'text': 'And the argument was extremely simple.', 'start': 141.119, 'duration': 1.641}, {'end': 146.521, 'text': 'g is our notation for the final hypothesis.', 'start': 143.78, 'duration': 2.741}, {'end': 150.303, 'text': 'It is one of these guys that the algorithm will choose.', 'start': 147.302, 'duration': 3.001}, {'end': 155.406, 'text': "Well, the probability is that E in doesn't track E out.", 'start': 151.224, 'duration': 4.182}, {'end': 167.235, 'text': "will obviously be included in the fact that En for H1 doesn't track the out-of-sample for that one.", 'start': 157.791, 'duration': 9.444}, {'end': 172.016, 'text': "or En for H2 doesn't track, or En of Hm doesn't track.", 'start': 167.235, 'duration': 4.781}, {'end': 173.017, 'text': 'The reason is very simple.', 'start': 172.117, 'duration': 0.9}], 'summary': 'The algorithm chooses a final hypothesis, with en not tracking e out, leading to simple reasoning.', 'duration': 31.898, 'max_score': 141.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ141119.jpg'}, {'end': 272.126, 'src': 'embed', 'start': 237.011, 'weight': 3, 'content': [{'end': 240.893, 'text': 'So it could be that the deviation here is related to the deviation here.', 'start': 237.011, 'duration': 3.882}, {'end': 242.734, 'text': "But the union bound doesn't care.", 'start': 241.394, 'duration': 1.34}, {'end': 247.537, 'text': 'Regardless of such correlations, you will be able to get a bound on the probability of this event.', 'start': 243.135, 'duration': 4.402}, {'end': 253.301, 'text': "And therefore, you'll be able to bound the probability that you care about, which has to do with the generalization.", 'start': 247.878, 'duration': 5.423}, {'end': 258.029, 'text': 'to the individual herding applied to each of those.', 'start': 253.781, 'duration': 4.248}, {'end': 262.136, 'text': 'And since you have M of them, you have an added M factor.', 'start': 258.41, 'duration': 3.726}, {'end': 272.126, 'text': 'So the final answer is that the probability of something bad happens after learning is less than or equal to this quantity,', 'start': 263.24, 'duration': 8.886}], 'summary': 'Union bound provides a probability bound despite correlations, with added m factor.', 'duration': 35.115, 'max_score': 237.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ237011.jpg'}, {'end': 377.106, 'src': 'embed', 'start': 351.607, 'weight': 4, 'content': [{'end': 357.372, 'text': 'The linear model is one of the most important models in machine learning.', 'start': 351.607, 'duration': 5.765}, {'end': 367.639, 'text': 'And what we are going to do in this lecture, we are going to start with a practical data set that we are going to use over and over in this class.', 'start': 358.112, 'duration': 9.527}, {'end': 374.464, 'text': 'And then, if you remember the perceptron that we introduced in the first lecture, the perceptron is a linear model.', 'start': 368.72, 'duration': 5.744}, {'end': 377.106, 'text': 'So here is the sequence of the lecture.', 'start': 375.344, 'duration': 1.762}], 'summary': 'Introduction to linear model in machine learning with practical data set and reference to perceptron.', 'duration': 25.499, 'max_score': 351.607, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ351607.jpg'}], 'start': 2.941, 'title': 'Learning feasibility and model generalization', 'summary': 'Discusses the feasibility of learning in a probabilistic sense, emphasizing the hoeffding inequality and its impact on in-sample performance deviation. it also covers the application of union bound in probability theory and the importance of linear models including perceptron, linear regression, and nonlinear transformations in machine learning.', 'chapters': [{'end': 173.017, 'start': 2.941, 'title': 'Feasibility of learning and the hoeffding inequality', 'summary': 'Discusses the feasibility of learning in a probabilistic sense, emphasizing the hoeffding inequality as a key mathematical relationship that determines the probability of in-sample performance deviating from out-of-sample performance, with a focus on the impact of sample size and the challenge of multiple hypotheses.', 'duration': 170.076, 'highlights': ['The Hoeffding inequality determines the probability that in-sample performance deviates from out-of-sample performance by more than the specified tolerance, with the probability being a negative exponential in N, indicating that a larger sample size leads to more reliable tracking of E in to E out.', 'The discussion emphasizes the challenge of handling multiple hypotheses, highlighting the possible accumulation of the probability of bad generalization across multiple hypotheses and the need to accommodate this scenario in learning processes.', 'The chapter also addresses the selection of the final hypothesis g from the set of hypotheses, emphasizing the inclusion of the probabilities of in-sample performance not tracking out-of-sample performance for each individual hypothesis in the overall probability of the final hypothesis not tracking E out.']}, {'end': 443.698, 'start': 173.277, 'title': 'Union bound and linear models', 'summary': 'Explains the concept of union bound and its application in probability theory, as well as introduces the importance and generalization of linear models in machine learning, including the perceptron, non-separable data, real-valued functions, linear regression, and nonlinear transformations.', 'duration': 270.421, 'highlights': ['The chapter explains the concept of union bound and its application in probability theory, demonstrating that the probability of an event, or another event, or another event is at most the sum of the probabilities, regardless of the correlation between the events, and how it is useful in cases of independent and non-independent events. The probability of an event, or another event, or another event, is at most the sum of the probabilities, regardless of the correlation between these events, because it takes the worst-case scenario. It is useful in cases of independent and non-independent events, providing a bound on the probability of the event.', 'The chapter discusses the generalization of linear models in machine learning, including the perceptron, non-separable data, real-valued functions, linear regression, and nonlinear transformations, highlighting their importance and practical applications in statistics, economics, and machine learning. The chapter introduces the importance and generalization of linear models in machine learning, including the perceptron, non-separable data, real-valued functions, linear regression, and nonlinear transformations, emphasizing their significance in statistics, economics, and machine learning.']}], 'duration': 440.757, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2941.jpg', 'highlights': ['The Hoeffding inequality determines the probability of in-sample performance deviation from out-of-sample performance, with larger sample size leading to more reliable tracking of E in to E out.', 'The challenge of handling multiple hypotheses is emphasized, highlighting the accumulation of the probability of bad generalization across multiple hypotheses and the need to accommodate this scenario in learning processes.', 'The selection of the final hypothesis g from the set of hypotheses is addressed, emphasizing the inclusion of the probabilities of in-sample performance not tracking out-of-sample performance for each individual hypothesis in the overall probability of the final hypothesis not tracking E out.', 'The chapter explains the concept of union bound and its application in probability theory, demonstrating that the probability of an event, or another event, or another event is at most the sum of the probabilities, regardless of the correlation between the events, and how it is useful in cases of independent and non-independent events.', 'The chapter discusses the generalization of linear models in machine learning, including the perceptron, non-separable data, real-valued functions, linear regression, and nonlinear transformations, highlighting their importance and practical applications in statistics, economics, and machine learning.']}, {'end': 844.807, 'segs': [{'end': 504.433, 'src': 'embed', 'start': 470.914, 'weight': 2, 'content': [{'end': 471.974, 'text': 'So here is the data set.', 'start': 470.914, 'duration': 1.06}, {'end': 475.155, 'text': 'It comes from zip codes in the postal office.', 'start': 473.014, 'duration': 2.141}, {'end': 479.136, 'text': 'So people write the zip code, and you extract individual characters, individual digits.', 'start': 475.195, 'duration': 3.941}, {'end': 487.918, 'text': 'And you would like to take the image, which happens to be 16 by 16 gray-level pixels, and be able to decipher what is the number in it.', 'start': 479.796, 'duration': 8.122}, {'end': 493.839, 'text': 'Well, that looks easy, except that people write digits in so many different ways.', 'start': 489.378, 'duration': 4.461}, {'end': 496.739, 'text': 'And if you look at it, there will be some cases like this fellow.', 'start': 494.279, 'duration': 2.46}, {'end': 504.433, 'text': 'Is this a 1 or a 7? Is this a 0 or an 8? So you can see that there is a problem.', 'start': 497.3, 'duration': 7.133}], 'summary': 'Deciphering handwritten digits from 16x16 pixel images extracted from zip codes poses a challenge due to variations in writing styles.', 'duration': 33.519, 'max_score': 470.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ470914.jpg'}, {'end': 568.364, 'src': 'embed', 'start': 532.096, 'weight': 0, 'content': [{'end': 535.657, 'text': "We are going to try on this, and then we're going to generalize it a little bit.", 'start': 532.096, 'duration': 3.561}, {'end': 540.578, 'text': 'The first item is the question of input representation.', 'start': 537.737, 'duration': 2.841}, {'end': 546.439, 'text': 'What do I mean? This is your input, the raw input, if you will.', 'start': 541.038, 'duration': 5.401}, {'end': 549.8, 'text': 'Now, this is 16 pixels.', 'start': 547.979, 'duration': 1.821}, {'end': 557.539, 'text': 'by 16 pixels, so there are 256 real numbers in that input.', 'start': 551.676, 'duration': 5.863}, {'end': 568.364, 'text': 'So if you look at the row input x, this would be x1, x2, x3, dot, dot, dot, dot, dot, dot, dot, and x256.', 'start': 559.38, 'duration': 8.984}], 'summary': 'Discussing input representation with 16x16 pixels, totaling 256 real numbers.', 'duration': 36.268, 'max_score': 532.096, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ532096.jpg'}, {'end': 760.03, 'src': 'embed', 'start': 724.013, 'weight': 1, 'content': [{'end': 726.034, 'text': 'So you get another guy, which is the symmetry.', 'start': 724.013, 'duration': 2.021}, {'end': 729.035, 'text': 'So now, x1 is the intensity variable.', 'start': 726.054, 'duration': 2.981}, {'end': 731.136, 'text': 'x2 is the symmetry variable.', 'start': 729.375, 'duration': 1.761}, {'end': 735.517, 'text': 'Now, admittedly, you have lost information in that process.', 'start': 732.336, 'duration': 3.181}, {'end': 741.239, 'text': 'But the chances are, you lost as much irrelevant information as relevant information.', 'start': 736.538, 'duration': 4.701}, {'end': 745.641, 'text': 'So this is a pretty good representation of the input, as far as the learning algorithm is concerned.', 'start': 741.579, 'duration': 4.062}, {'end': 747.321, 'text': 'And you went from 257-dimensional to 3-dimensional.', 'start': 746.061, 'duration': 1.26}, {'end': 751.923, 'text': "That's a pretty good situation.", 'start': 750.602, 'duration': 1.321}, {'end': 760.03, 'text': 'And you probably realize that having 257 parameters is bad news for generalization, if you extrapolate from what we said.', 'start': 752.264, 'duration': 7.766}], 'summary': 'Reduced input from 257 to 3 dimensions, a good representation for learning algorithms.', 'duration': 36.017, 'max_score': 724.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ724013.jpg'}, {'end': 801.25, 'src': 'embed', 'start': 771.562, 'weight': 3, 'content': [{'end': 778.084, 'text': "And that's what the perceptron algorithm, for example, needs to use, to determine.", 'start': 771.562, 'duration': 6.522}, {'end': 782.005, 'text': "Now let's look at the illustration of these features.", 'start': 779.305, 'duration': 2.7}, {'end': 785.806, 'text': 'You have these as your inputs.', 'start': 784.046, 'duration': 1.76}, {'end': 788.447, 'text': 'And x1 is the intensity.', 'start': 787.247, 'duration': 1.2}, {'end': 790.328, 'text': 'x2 is the symmetry.', 'start': 789.427, 'duration': 0.901}, {'end': 793.727, 'text': 'What do they look like? They look like this.', 'start': 790.468, 'duration': 3.259}, {'end': 796.508, 'text': 'So this is a scatter diagram.', 'start': 794.608, 'duration': 1.9}, {'end': 799.009, 'text': 'Every point here is a data point.', 'start': 797.149, 'duration': 1.86}, {'end': 801.25, 'text': "It's one of the digits, one of the images you have.", 'start': 799.13, 'duration': 2.12}], 'summary': 'Perceptron algorithm uses intensity and symmetry as inputs to classify data points.', 'duration': 29.688, 'max_score': 771.562, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ771562.jpg'}], 'start': 444.199, 'title': 'Real data set challenges and feature extraction', 'summary': 'Discusses challenges of using a real data set from postal office zip codes to train a machine learning model aiming to match at least the 2.5% error rate of human operators. it also covers input representation and feature extraction, focusing on transforming raw input data to higher-level features and reducing dimensionality from 257 to 3, with visualization in a scatter diagram for distinguishing between digits 1 and 5.', 'chapters': [{'end': 616.593, 'start': 444.199, 'title': 'Real data set for machine learning', 'summary': 'Discusses the challenges of using a real data set from zip codes in the postal office to train a machine learning model that can automate the process of deciphering handwritten digits, aiming to at least match the 2.5% error rate of human operators.', 'duration': 172.394, 'highlights': ['The data set comes from zip codes in the postal office, involving deciphering handwritten digits, where human operators have an error rate of about 2.5%.', 'The input for the algorithm consists of 256 real numbers (16x16 pixels) plus an additional constant coordinate, resulting in a 257-dimensional space for the perceptron learning algorithm to work with.', 'The perceptron learning algorithm faces challenges due to the huge number of parameters (257) in the linear model, making it difficult to simultaneously determine the values of all these parameters based on the given set.']}, {'end': 844.807, 'start': 618.694, 'title': 'Input representation and feature extraction', 'summary': 'Discusses the concept of input representation and feature extraction, emphasizing the transformation of raw input data to higher-level features like intensity and symmetry, reducing the dimensionality from 257 to 3, and demonstrating their visualization in a scatter diagram for distinguishing between digits 1 and 5.', 'duration': 226.113, 'highlights': ['The transformation of raw input data to higher-level features like intensity and symmetry is emphasized, reducing the dimensionality from 257 to 3.', 'Visualization of the features in a scatter diagram for distinguishing between digits 1 and 5 is demonstrated, revealing the distinction in intensity and the corresponding tilt of the data points.']}], 'duration': 400.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ444199.jpg', 'highlights': ['The input for the algorithm consists of 256 real numbers (16x16 pixels) plus an additional constant coordinate, resulting in a 257-dimensional space for the perceptron learning algorithm to work with.', 'The transformation of raw input data to higher-level features like intensity and symmetry is emphasized, reducing the dimensionality from 257 to 3.', 'The data set comes from zip codes in the postal office, involving deciphering handwritten digits, where human operators have an error rate of about 2.5%.', 'Visualization of the features in a scatter diagram for distinguishing between digits 1 and 5 is demonstrated, revealing the distinction in intensity and the corresponding tilt of the data points.']}, {'end': 1224.394, 'segs': [{'end': 876.053, 'src': 'embed', 'start': 845.527, 'weight': 0, 'content': [{'end': 850.308, 'text': 'If you look at the other coordinate, which is symmetry, the one is often more symmetric than the five.', 'start': 845.527, 'duration': 4.781}, {'end': 856.83, 'text': 'Therefore, the guys that happen to be the ones, that are the blue, tend to be higher on the vertical coordinate.', 'start': 850.688, 'duration': 6.142}, {'end': 862.491, 'text': 'And just by these two coordinates, you already see that this is almost linearly separable.', 'start': 857.79, 'duration': 4.701}, {'end': 869.133, 'text': "Not quite, but it's separable enough that if you pass a boundary here, you'll be getting most of them right.", 'start': 862.631, 'duration': 6.502}, {'end': 876.053, 'text': "Now you realize that it's impossible really to ask to get all of them right, because, believe it or not, this fellow is a 5,", 'start': 870.051, 'duration': 6.002}], 'summary': 'Symmetric ones tend to be higher on the vertical coordinate, making it almost linearly separable.', 'duration': 30.526, 'max_score': 845.527, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ845527.jpg'}, {'end': 988.851, 'src': 'embed', 'start': 961.099, 'weight': 4, 'content': [{'end': 965.921, 'text': 'It can go from something pretty good to something pretty bad in just one iteration.', 'start': 961.099, 'duration': 4.822}, {'end': 970.624, 'text': 'So this is a very typical behavior of the perceptron learning algorithm.', 'start': 966.882, 'duration': 3.742}, {'end': 976.206, 'text': 'Because the data is not linearly separable, the perceptron learning algorithm will never converge.', 'start': 972.044, 'duration': 4.162}, {'end': 980.348, 'text': 'So what do we do? We force it to terminate at iteration 1, 000.', 'start': 976.366, 'duration': 3.982}, {'end': 988.851, 'text': 'That is, we stop at 1, 000 and take whatever weight vector we have, and we call this the final hypothesis of the perceptron learning algorithm.', 'start': 980.348, 'duration': 8.503}], 'summary': 'Perceptron learning algorithm stops at 1,000 iterations, resulting in final hypothesis.', 'duration': 27.752, 'max_score': 961.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ961099.jpg'}, {'end': 1183.261, 'src': 'embed', 'start': 1155.33, 'weight': 1, 'content': [{'end': 1159.972, 'text': "So you're going to continue as if it's really the perceptron learning algorithm.", 'start': 1155.33, 'duration': 4.642}, {'end': 1165.395, 'text': 'But when you are at the end, you keep this guy and report it as the final hypothesis.', 'start': 1160.372, 'duration': 5.023}, {'end': 1167.096, 'text': 'What an ingenious idea.', 'start': 1165.955, 'duration': 1.141}, {'end': 1175.06, 'text': 'Now, the reason the algorithm is called the pocket algorithm is because the whole idea is to put the best solution so far in your pocket.', 'start': 1168.196, 'duration': 6.864}, {'end': 1180.681, 'text': 'And when you get a better one, you take the better one, put it in your pocket, and throw the old one.', 'start': 1176.54, 'duration': 4.141}, {'end': 1183.261, 'text': 'And when you are done, report the guy in your pocket.', 'start': 1181.061, 'duration': 2.2}], 'summary': 'Pocket algorithm: keep best solution, replace with better, report final hypothesis.', 'duration': 27.931, 'max_score': 1155.33, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1155330.jpg'}], 'start': 845.527, 'title': 'Symmetry and separability', 'summary': 'Discusses the relationship between symmetry and separability in data, showing that based on two coordinates, the data is almost linearly separable, allowing for a margin of error in classification.', 'chapters': [{'end': 885.555, 'start': 845.527, 'title': 'Symmetry and separability', 'summary': 'Discusses the relationship between symmetry and separability in data, showing that based on two coordinates, the data is almost linearly separable, allowing for a margin of error in classification.', 'duration': 40.028, 'highlights': ['The blue ones tend to be higher on the vertical coordinate. The blue data points (ones) are higher on the vertical coordinate, indicating a correlation between the color and position.', 'Data is almost linearly separable, allowing for a margin of error in classification. Based on two coordinates, the data is almost linearly separable, indicating that a boundary can be drawn to classify most of the data correctly, while accepting a margin of error.', "It's impossible really to ask to get all of them right, because, believe it or not, this fellow is a 5, at least meant to be a 5 by the guy who wrote it. It is acknowledged that it's impossible to classify all data correctly, as there will be instances like the 5 written to resemble a 1, leading to acceptance of errors in classification."]}, {'end': 1224.394, 'start': 885.615, 'title': 'Perceptron learning algorithm', 'summary': 'Explains the behavior of the perceptron learning algorithm, which iterates through examples to minimize in-sample error, demonstrating that it behaves poorly when data is not linearly separable and introducing the pocket algorithm as a modification to improve performance.', 'duration': 338.779, 'highlights': ["The perceptron learning algorithm behaves poorly when the data is not linearly separable, as it may go from something pretty good to something pretty bad in just one iteration, and will never converge in such cases. The perceptron learning algorithm's behavior deteriorates when dealing with non-linearly separable data, potentially going from good to bad performance in one iteration and failing to converge.", 'The pocket algorithm is introduced as a modification to the perceptron learning algorithm, aiming to keep track of the best solution so far and report it as the final hypothesis, resulting in improved performance. The pocket algorithm is presented as a modification to the perceptron learning algorithm, aiming to maintain the best solution encountered so far, leading to improved performance by reporting the best hypothesis.', "The pocket algorithm outperforms the perceptron learning algorithm by consistently keeping the best solution in its 'pocket', resulting in decreased in-sample error as it iterates through examples. The pocket algorithm demonstrates improved performance by consistently retaining the best solution, leading to decreased in-sample error as it iterates through examples."]}], 'duration': 378.867, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ845527.jpg', 'highlights': ['Data is almost linearly separable, allowing for a margin of error in classification. Based on two coordinates, the data is almost linearly separable, indicating that a boundary can be drawn to classify most of the data correctly, while accepting a margin of error.', "The pocket algorithm outperforms the perceptron learning algorithm by consistently keeping the best solution in its 'pocket', resulting in decreased in-sample error as it iterates through examples.", 'The blue data points (ones) are higher on the vertical coordinate, indicating a correlation between the color and position.', 'The pocket algorithm is introduced as a modification to the perceptron learning algorithm, aiming to keep track of the best solution so far and report it as the final hypothesis, resulting in improved performance.', "The perceptron learning algorithm's behavior deteriorates when dealing with non-linearly separable data, potentially going from good to bad performance in one iteration and failing to converge."]}, {'end': 1510.807, 'segs': [{'end': 1298.57, 'src': 'embed', 'start': 1270.014, 'weight': 1, 'content': [{'end': 1274.098, 'text': 'With this very simple algorithm, you can actually deal with general inseparable data.', 'start': 1270.014, 'duration': 4.084}, {'end': 1278.321, 'text': "But inseparable data in the sense that it's basically separable.", 'start': 1274.638, 'duration': 3.683}, {'end': 1284.526, 'text': 'However, it really has some, this guy is bad, and this guy is bad.', 'start': 1279.002, 'duration': 5.524}, {'end': 1285.627, 'text': "There's nothing we can do about them.", 'start': 1284.566, 'duration': 1.061}, {'end': 1287.689, 'text': 'But there are few, so we will just settle for this.', 'start': 1285.647, 'duration': 2.042}, {'end': 1294.314, 'text': "We'll see that there are other cases of inseparable data that is truly inseparable, in which we have to do something a little bit more drastic.", 'start': 1288.089, 'duration': 6.225}, {'end': 1298.57, 'text': "That's as far as the classification is concerned.", 'start': 1296.686, 'duration': 1.884}], 'summary': 'Algorithm can handle general inseparable data for classification.', 'duration': 28.556, 'max_score': 1270.014, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1270014.jpg'}, {'end': 1347.306, 'src': 'embed', 'start': 1314.412, 'weight': 0, 'content': [{'end': 1316.554, 'text': 'And it comes from earlier work in statistics.', 'start': 1314.412, 'duration': 2.142}, {'end': 1320.438, 'text': 'And there is so much work on it that people could not get rid of that term.', 'start': 1317.095, 'duration': 3.343}, {'end': 1322.12, 'text': 'And it is now the standard term.', 'start': 1320.819, 'duration': 1.301}, {'end': 1324.984, 'text': 'Whenever you have a real-valued function, you call it a regression problem.', 'start': 1322.14, 'duration': 2.844}, {'end': 1327.748, 'text': 'With that out of the way.', 'start': 1326.387, 'duration': 1.361}, {'end': 1332.813, 'text': 'Now, linear regression is used incredibly often in statistics and economics.', 'start': 1328.128, 'duration': 4.685}, {'end': 1339.239, 'text': 'Every time you say, are these variables related to that variable? The first thing that comes to mind is linear regression.', 'start': 1333.253, 'duration': 5.986}, {'end': 1340.88, 'text': 'So let me give you an example.', 'start': 1339.699, 'duration': 1.181}, {'end': 1347.306, 'text': "Let's say that you would like to relate your performance in different types of courses to your future earnings.", 'start': 1341.3, 'duration': 6.006}], 'summary': 'Linear regression is a standard term in statistics, used frequently in economics.', 'duration': 32.894, 'max_score': 1314.412, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1314412.jpg'}, {'end': 1407.802, 'src': 'embed', 'start': 1381.169, 'weight': 2, 'content': [{'end': 1384.953, 'text': "I'm going to look 10 years after graduation and see their annual income.", 'start': 1381.169, 'duration': 3.784}, {'end': 1389.859, 'text': 'So the input are the GPAs in the courses at the time they graduated.', 'start': 1386.495, 'duration': 3.364}, {'end': 1394.744, 'text': 'The output is how much money they make per year 10 years away from graduation.', 'start': 1390.62, 'duration': 4.124}, {'end': 1402.819, 'text': 'Now you ask yourself, how do these things affect the output? So apply linear regression, as you will see it in detail.', 'start': 1395.734, 'duration': 7.085}, {'end': 1407.802, 'text': 'And you finally find, oh, OK, maybe the math and sciences are more important.', 'start': 1403.299, 'duration': 4.503}], 'summary': 'Using gpas to predict income 10 years after graduation through linear regression, revealing the importance of math and sciences.', 'duration': 26.633, 'max_score': 1381.169, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1381169.jpg'}], 'start': 1226.212, 'title': 'Pocket algorithm and linear regression', 'summary': 'Discusses the application of pocket algorithm for general inseparable data and the frequent use of linear regression in statistics and economics, with a practical example linking academic performance to future earnings.', 'chapters': [{'end': 1510.807, 'start': 1226.212, 'title': 'Pocket algorithm and linear regression', 'summary': 'Covers the application of pocket algorithm to deal with general inseparable data and the frequent use of linear regression in statistics and economics, with a practical example of relating academic performance to future earnings through linear regression.', 'duration': 284.595, 'highlights': ['The pocket algorithm allows dealing with general inseparable data, providing a good hypothesis report for such data.', 'Linear regression is used frequently in statistics and economics, especially in relating variables and predicting real-valued outputs.', 'Linear regression can be applied practically, such as in relating academic performance to future earnings, by using GPAs as input to predict future income.']}], 'duration': 284.595, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1226212.jpg', 'highlights': ['Linear regression is used frequently in statistics and economics, especially in relating variables and predicting real-valued outputs.', 'The pocket algorithm allows dealing with general inseparable data, providing a good hypothesis report for such data.', 'Linear regression can be applied practically, such as in relating academic performance to future earnings, by using GPAs as input to predict future income.']}, {'end': 2041.865, 'segs': [{'end': 1628.439, 'src': 'embed', 'start': 1589.176, 'weight': 1, 'content': [{'end': 1595.017, 'text': 'Now, the signal here will play a very important role in all the linear algorithms.', 'start': 1589.176, 'duration': 5.841}, {'end': 1596.978, 'text': 'This is what makes the algorithm linear.', 'start': 1595.037, 'duration': 1.941}, {'end': 1603.74, 'text': 'And whether you leave it alone, as in linear regression, you take a hard threshold, as in classification, or, as we will see later,', 'start': 1597.758, 'duration': 5.982}, {'end': 1608.281, 'text': 'you can take a soft threshold and you get a probability and all of that, all of these are considered linear models.', 'start': 1603.74, 'duration': 4.541}, {'end': 1612.922, 'text': 'And the algorithm depends on this particular part, which is the signal being linear.', 'start': 1608.741, 'duration': 4.181}, {'end': 1616.197, 'text': 'We also took the trouble to put it in vector form.', 'start': 1613.956, 'duration': 2.241}, {'end': 1622.678, 'text': 'And the vector form will simplify the calculus that we do in this lecture, in order to derive the linear regression algorithm.', 'start': 1616.597, 'duration': 6.081}, {'end': 1626.979, 'text': 'But if you hate the vector form, you can always go back to this.', 'start': 1623.018, 'duration': 3.961}, {'end': 1628.439, 'text': 'There is nothing mysterious about this.', 'start': 1627.079, 'duration': 1.36}], 'summary': 'Signal plays crucial role in linear algorithms, including linear regression and classification.', 'duration': 39.263, 'max_score': 1589.176, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1589176.jpg'}, {'end': 1679.011, 'src': 'embed', 'start': 1652.767, 'weight': 3, 'content': [{'end': 1658.49, 'text': "What is the data set in this case? Well, it's historical data, but it's a different set of historical data.", 'start': 1652.767, 'duration': 5.723}, {'end': 1661.58, 'text': 'The credit line is decided by different officers.', 'start': 1659.338, 'duration': 2.242}, {'end': 1668.004, 'text': 'Someone sits down and evaluates your application, and decides that this person gets 1, 000 limit, this person gets 5, 000 limit, and whatnot.', 'start': 1661.62, 'duration': 6.384}, {'end': 1673.127, 'text': 'All we are trying to do in this particular example is to replicate what they are doing.', 'start': 1668.904, 'duration': 4.223}, {'end': 1676.169, 'text': "So we don't want the credit officer to do that.", 'start': 1673.588, 'duration': 2.581}, {'end': 1679.011, 'text': 'The credit officers sometimes are inconsistent from one another.', 'start': 1676.55, 'duration': 2.461}], 'summary': "Historical data used to replicate credit officers' decisions, aiming to eliminate inconsistency.", 'duration': 26.244, 'max_score': 1652.767, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1652767.jpg'}, {'end': 1754.657, 'src': 'embed', 'start': 1720.695, 'weight': 0, 'content': [{'end': 1722.776, 'text': 'And that real number will likely be a positive integer.', 'start': 1720.695, 'duration': 2.081}, {'end': 1723.476, 'text': "It's a credit line.", 'start': 1722.796, 'duration': 0.68}, {'end': 1726.637, 'text': "It's a dollar amount.", 'start': 1723.496, 'duration': 3.141}, {'end': 1729.218, 'text': 'And what we are doing is trying to replicate that.', 'start': 1727.017, 'duration': 2.201}, {'end': 1730.699, 'text': "That's the statement of the problem.", 'start': 1729.358, 'duration': 1.341}, {'end': 1736.765, 'text': 'So what does linear regression do? First, we have to measure the error.', 'start': 1732.139, 'duration': 4.626}, {'end': 1740.708, 'text': "We didn't talk about that in the case of classification, because it was so simple.", 'start': 1737.346, 'duration': 3.362}, {'end': 1744.01, 'text': "Here, it's a little bit less simple.", 'start': 1740.768, 'duration': 3.242}, {'end': 1748.393, 'text': "And then we'll be able to discuss the error function for classification as well.", 'start': 1744.41, 'duration': 3.983}, {'end': 1754.657, 'text': 'What do we mean by that? You will have an algorithm that tries to find the optimal weights.', 'start': 1748.973, 'duration': 5.684}], 'summary': 'The problem statement involves replicating a positive integer credit line using linear regression to find optimal weights.', 'duration': 33.962, 'max_score': 1720.695, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1720695.jpg'}, {'end': 1798.235, 'src': 'embed', 'start': 1768.375, 'weight': 4, 'content': [{'end': 1774.282, 'text': 'We would like to quantify that to give a guidance to the algorithm in order to move from one hypothesis to another.', 'start': 1768.375, 'duration': 5.907}, {'end': 1776.305, 'text': "So we'll define an error measure.", 'start': 1774.663, 'duration': 1.642}, {'end': 1781.752, 'text': 'And the algorithm will try to minimize the error measure by moving from one hypothesis to the next.', 'start': 1777.146, 'duration': 4.606}, {'end': 1790.488, 'text': 'So if you take linear regression, the standard error function used there is the squared error.', 'start': 1784.403, 'duration': 6.085}, {'end': 1791.889, 'text': 'So let me write it down.', 'start': 1790.849, 'duration': 1.04}, {'end': 1798.235, 'text': 'Well, if you had a classification, there is only a simple agreement on a particular example.', 'start': 1792.57, 'duration': 5.665}], 'summary': 'Algorithm minimizes error measure to move between hypotheses, e.g. linear regression uses squared error.', 'duration': 29.86, 'max_score': 1768.375, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1768375.jpg'}, {'end': 1917.265, 'src': 'heatmap', 'start': 1858.168, 'weight': 5, 'content': [{'end': 1860.769, 'text': 'When you look at the in-sample error, you use the error measure.', 'start': 1858.168, 'duration': 2.601}, {'end': 1869.112, 'text': 'So on the particular example, small n, small n from 1 to N, for each example, this is the contribution of the error.', 'start': 1861.629, 'duration': 7.483}, {'end': 1873.434, 'text': 'Each of these is affected by the same w, because h depends on w.', 'start': 1869.553, 'duration': 3.881}, {'end': 1877.095, 'text': 'So as you change w, this value will change for every example.', 'start': 1873.434, 'duration': 3.661}, {'end': 1878.916, 'text': 'And this is the error in that example.', 'start': 1877.416, 'duration': 1.5}, {'end': 1883.358, 'text': 'And if you want to get all the in-sample error, you simply take the average of those.', 'start': 1878.996, 'duration': 4.362}, {'end': 1889.217, 'text': 'So that will give me a snapshot of how my hypothesis is doing on the data set.', 'start': 1884.935, 'duration': 4.282}, {'end': 1896.061, 'text': 'And now we are going to ask our algorithm to take this error and minimize it.', 'start': 1890.238, 'duration': 5.823}, {'end': 1901.863, 'text': "So let's actually just look at what happens as an illustration.", 'start': 1897.001, 'duration': 4.862}, {'end': 1904.905, 'text': 'This is the simplest case for linear regression.', 'start': 1902.864, 'duration': 2.041}, {'end': 1906.665, 'text': 'The input is one-dimensional.', 'start': 1905.045, 'duration': 1.62}, {'end': 1908.406, 'text': 'I have only one relevant variable.', 'start': 1906.685, 'duration': 1.721}, {'end': 1911.408, 'text': 'I want to relate your overall GPA.', 'start': 1908.907, 'duration': 2.501}, {'end': 1913.802, 'text': 'to your earnings 10 years from now.', 'start': 1912.281, 'duration': 1.521}, {'end': 1917.265, 'text': 'Your overall GPA is x.', 'start': 1914.683, 'duration': 2.582}], 'summary': 'In-sample error measures how hypothesis performs on data set, algorithm minimizes error.', 'duration': 59.097, 'max_score': 1858.168, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1858168.jpg'}], 'start': 1510.847, 'title': 'Linear regression in credit line decisions', 'summary': 'Discusses the form and role of linear regression output, emphasizing its impact on credit approval. it also covers using linear regression to replicate credit line decisions, aiming to minimize errors and automate the process, with a focus on in-sample error analysis.', 'chapters': [{'end': 1652.427, 'start': 1510.847, 'title': 'Linear regression and its algorithm', 'summary': 'Discusses the form and role of linear regression output, emphasizing the importance of the signal in linear algorithms and its impact on credit approval.', 'duration': 141.58, 'highlights': ['The output of linear regression is a real number, representing the dollar amount for credit approval, without a threshold for plus or minus values.', "The signal's linearity is crucial for all linear algorithms, whether for linear regression, classification with a hard threshold, or obtaining probabilities with a soft threshold.", 'The vector form simplifies the calculus for deriving the linear regression algorithm, offering an easier approach than using scalar variables.']}, {'end': 2041.865, 'start': 1652.767, 'title': 'Credit line linear regression', 'summary': 'Discusses using linear regression to replicate the credit line decisions made by officers based on historical data, aiming to minimize errors and automate the process, with a focus on the squared error function and its application to in-sample error analysis.', 'duration': 389.098, 'highlights': ['The historical data consists of credit line decisions made by different officers for previous customers, with the goal of replicating their decision-making process using an automated system. The historical data used in this case is comprised of credit line decisions made by different officers for previous customers, with the aim of replicating this decision-making process using an automated system.', 'The chapter emphasizes the use of the squared error function in linear regression to measure and minimize errors when estimating credit lines, providing a simple analytic solution. Emphasizing the use of the squared error function in linear regression, the chapter focuses on measuring and minimizing errors when estimating credit lines, citing its simplicity in providing an analytic solution.', 'The concept of in-sample error is introduced, with a focus on using the error measure to evaluate the performance of the hypothesis on the data set and subsequently minimizing it. The concept of in-sample error is introduced, highlighting the use of the error measure to evaluate the performance of the hypothesis on the data set and the subsequent minimization of errors.']}], 'duration': 531.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ1510847.jpg', 'highlights': ['The output of linear regression is a real number, representing the dollar amount for credit approval, without a threshold for plus or minus values.', "The signal's linearity is crucial for all linear algorithms, whether for linear regression, classification with a hard threshold, or obtaining probabilities with a soft threshold.", 'The vector form simplifies the calculus for deriving the linear regression algorithm, offering an easier approach than using scalar variables.', 'The historical data consists of credit line decisions made by different officers for previous customers, with the goal of replicating their decision-making process using an automated system.', 'The chapter emphasizes the use of the squared error function in linear regression to measure and minimize errors when estimating credit lines, providing a simple analytic solution.', 'The concept of in-sample error is introduced, with a focus on using the error measure to evaluate the performance of the hypothesis on the data set and subsequently minimizing it.']}, {'end': 2749.976, 'segs': [{'end': 2215.888, 'src': 'heatmap', 'start': 2147.641, 'weight': 0, 'content': [{'end': 2153.346, 'text': 'We reduce them to 3, for example, in the case of the classification of the digits.', 'start': 2147.641, 'duration': 5.705}, {'end': 2157.329, 'text': 'But you usually have many, many examples in the thousands.', 'start': 2153.846, 'duration': 3.483}, {'end': 2159.491, 'text': 'So this would be a very, very long matrix.', 'start': 2157.529, 'duration': 1.962}, {'end': 2168.842, 'text': 'Now, the way you take this, well, the norm squared will be simply this vector transposed times itself.', 'start': 2160.72, 'duration': 8.122}, {'end': 2174.604, 'text': 'And when you do it, you realize that what you are doing is summing up contributions from the different components.', 'start': 2169.523, 'duration': 5.081}, {'end': 2178.325, 'text': 'And each component happens to be exactly what you are having here.', 'start': 2175.064, 'duration': 3.261}, {'end': 2181.766, 'text': 'So this becomes a shorthand for writing this expression.', 'start': 2179.005, 'duration': 2.761}, {'end': 2186.928, 'text': "Now let's look at minimizing em.", 'start': 2185.247, 'duration': 1.681}, {'end': 2197.896, 'text': 'When you look at minimizing, you realize that the matrix X, which has the inputs of the data, and Y, which has the outputs of the data, are,', 'start': 2189.71, 'duration': 8.186}, {'end': 2199.657, 'text': 'as far as we are concerned, constants.', 'start': 2197.896, 'duration': 1.761}, {'end': 2201.758, 'text': 'This is the data set someone gave me.', 'start': 2200.057, 'duration': 1.701}, {'end': 2207.282, 'text': "The parameter I'm actually playing with in order to get a good hypothesis is w.", 'start': 2202.399, 'duration': 4.883}, {'end': 2211.165, 'text': 'So E in is of w, and w appears here, and the rest are constants.', 'start': 2207.282, 'duration': 3.883}, {'end': 2215.888, 'text': 'If I do any calculus of minimization, it is with respect to w.', 'start': 2211.445, 'duration': 4.443}], 'summary': 'Reducing thousands of examples to 3 in digit classification, minimizing matrix x and y for good hypothesis with respect to w.', 'duration': 68.247, 'max_score': 2147.641, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2147641.jpg'}, {'end': 2296.449, 'src': 'heatmap', 'start': 2244.104, 'weight': 0.851, 'content': [{'end': 2249.826, 'text': 'Get partial E by partial every w, partial w0, partial w1, partial wd.', 'start': 2244.104, 'duration': 5.722}, {'end': 2253.008, 'text': 'Get a formula that is a pretty hairy one, and then try to reduce it.', 'start': 2250.306, 'duration': 2.702}, {'end': 2257.309, 'text': 'And surprise, surprise, you will get the solution here that we have in matrix form in two steps.', 'start': 2253.488, 'duration': 3.821}, {'end': 2263.552, 'text': 'Now, if you look at this, deal with it in terms of calculus as if it was just a simple square.', 'start': 2258.65, 'duration': 4.902}, {'end': 2271.215, 'text': 'If this was a simple square, and w was the variable, what would the derivative be? You will get 2 sitting outside.', 'start': 2264.192, 'duration': 7.023}, {'end': 2272.036, 'text': 'Well, you got it here.', 'start': 2271.355, 'duration': 0.681}, {'end': 2275.277, 'text': 'And then you will get the same thing in a linear form.', 'start': 2273.196, 'duration': 2.081}, {'end': 2275.877, 'text': 'You got it here.', 'start': 2275.317, 'duration': 0.56}, {'end': 2280.259, 'text': 'And then you will get whatever constant was multiplied by w to sit outside, which you got it here.', 'start': 2276.417, 'duration': 3.842}, {'end': 2283.2, 'text': 'You just got here with a transpose, because this is really not a square.', 'start': 2280.659, 'duration': 2.541}, {'end': 2285.701, 'text': 'This is the transpose of this times itself.', 'start': 2283.44, 'duration': 2.261}, {'end': 2286.962, 'text': "That's where you get the transpose.", 'start': 2285.801, 'duration': 1.161}, {'end': 2290.905, 'text': 'Pretty straightforward, and standard matrix calculus.', 'start': 2287.822, 'duration': 3.083}, {'end': 2292.226, 'text': "So that's what you have.", 'start': 2291.365, 'duration': 0.861}, {'end': 2295.288, 'text': "And then you equate this to 0, but it's a fat 0.", 'start': 2292.566, 'duration': 2.722}, {'end': 2296.449, 'text': "It's a vector of 0's.", 'start': 2295.288, 'duration': 1.161}], 'summary': "Derive matrix form solution using calculus, obtaining transpose and vector of 0's.", 'duration': 52.345, 'max_score': 2244.104, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2244104.jpg'}, {'end': 2356.261, 'src': 'embed', 'start': 2330.35, 'weight': 2, 'content': [{'end': 2337.312, 'text': 'The interesting thing is that, in spite of the fact that X, the matrix X, is a very tall matrix, definitely not square,', 'start': 2330.35, 'duration': 6.962}, {'end': 2345.835, 'text': 'hence not invertible X transpose, X is actually a square matrix because X transpose is this way and X is this way.', 'start': 2337.312, 'duration': 8.523}, {'end': 2349.257, 'text': 'Multiply them, and you get a pretty small square matrix.', 'start': 2346.036, 'duration': 3.221}, {'end': 2352.618, 'text': 'And as we will see, the chances are overwhelming that it will be invertible.', 'start': 2349.677, 'duration': 2.941}, {'end': 2356.261, 'text': 'So you can actually solve this very simply by inverting this.', 'start': 2353.058, 'duration': 3.203}], 'summary': 'X transpose, x results in a small invertible square matrix.', 'duration': 25.911, 'max_score': 2330.35, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2330350.jpg'}, {'end': 2672.823, 'src': 'embed', 'start': 2644.965, 'weight': 1, 'content': [{'end': 2649.627, 'text': 'What is the proper form? You construct the matrix X and the vector Y.', 'start': 2644.965, 'duration': 4.662}, {'end': 2651.068, 'text': 'And these are what we introduced before.', 'start': 2649.627, 'duration': 1.441}, {'end': 2654.911, 'text': 'This will be the input data matrix, and this would be the target vector.', 'start': 2651.148, 'duration': 3.763}, {'end': 2658.513, 'text': 'And once you construct them, you are basically done.', 'start': 2656.352, 'duration': 2.161}, {'end': 2665.157, 'text': "Because all you're going to do, you plug this into a formula which is the pseudo-inverse, and then you will return the value w,", 'start': 2658.913, 'duration': 6.244}, {'end': 2667.919, 'text': 'that is the multiplication of that pseudo-inverse with Y.', 'start': 2665.157, 'duration': 2.762}, {'end': 2668.399, 'text': 'And you are done.', 'start': 2667.919, 'duration': 0.48}, {'end': 2672.823, 'text': 'Now, you can call this one-step learning, if you want.', 'start': 2669.72, 'duration': 3.103}], 'summary': 'Construct matrix x and vector y, use pseudo-inverse formula for one-step learning.', 'duration': 27.858, 'max_score': 2644.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2644965.jpg'}], 'start': 2042.685, 'title': 'Linear regression and pseudo-inverse', 'summary': 'Covers the derivation of linear regression algorithm, minimizing expression for e n using matrix x and vector y, solving for w, and explains the concept of pseudo-inverse, its computational implications, practical applications, and likelihood of matrix invertibility in real-world scenarios.', 'chapters': [{'end': 2311.199, 'start': 2042.685, 'title': 'Linear regression algorithm', 'summary': 'Explains the derivation of the linear regression algorithm by minimizing the expression for e n, using matrix x and vector y, and solving for w in the derivative by equating it to 0.', 'duration': 268.514, 'highlights': ['The matrix X and vector y are used to minimize the expression for E n with respect to the parameter w, by solving for the derivative and equating it to 0. The matrix X and vector y are constants, while the parameter being minimized is w. The derivative is solved for w and equated to 0.', 'The norm squared of X is explained as the vector transposed times itself, summing up contributions from different components. The norm squared of X is obtained by multiplying the vector transposed by itself, summing up contributions from different components.', "The solution for the derivative in matrix form is obtained by dealing with it in terms of calculus, resulting in a simple quadratic form with the solution being a vector of 0's. The solution for the derivative in matrix form is obtained by dealing with it in terms of calculus, resulting in a simple quadratic form with the solution being a vector of 0's."]}, {'end': 2749.976, 'start': 2311.939, 'title': 'Pseudo-inverse in linear regression', 'summary': 'Explains the concept of pseudo-inverse in linear regression, highlighting its computational implications and practical applications, emphasizing the simplicity of using pseudo-inverse for solving linear regression problems, and the likelihood of the matrix being invertible in real-world applications due to the large number of examples.', 'duration': 438.037, 'highlights': ['The pseudo-inverse provides a simple solution for solving linear regression problems, as it can be obtained by inverting the matrix X^TX and multiplying it by X^Ty, yielding an explicit formula for w. The pseudo-inverse provides a simple solution for solving linear regression problems, as it can be obtained by inverting the matrix X^TX and multiplying it by X^Ty, yielding an explicit formula for w.', 'The pseudo-inverse, denoted as x dagger, is the pseudo-inverse of X, which has interesting properties such as multiplying x dagger by x results in an identity matrix, resembling an inverse. The pseudo-inverse, denoted as x dagger, is the pseudo-inverse of X, which has interesting properties such as multiplying x dagger by x results in an identity matrix, resembling an inverse.', 'The matrix X^TX is a small square matrix, making it highly likely to be invertible, simplifying the computational process and making it suitable for practical applications. The matrix X^TX is a small square matrix, making it highly likely to be invertible, simplifying the computational process and making it suitable for practical applications.', 'In real-world applications, the large number of examples and few parameters make the matrix X^TX overwhelmingly likely to be invertible, ensuring the feasibility of using the pseudo-inverse for linear regression. In real-world applications, the large number of examples and few parameters make the matrix X^TX overwhelmingly likely to be invertible, ensuring the feasibility of using the pseudo-inverse for linear regression.']}], 'duration': 707.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2042685.jpg', 'highlights': ['The matrix X and vector y are used to minimize the expression for E n with respect to the parameter w, by solving for the derivative and equating it to 0.', 'The pseudo-inverse provides a simple solution for solving linear regression problems, as it can be obtained by inverting the matrix X^TX and multiplying it by X^Ty, yielding an explicit formula for w.', 'The matrix X^TX is a small square matrix, making it highly likely to be invertible, simplifying the computational process and making it suitable for practical applications.', 'In real-world applications, the large number of examples and few parameters make the matrix X^TX overwhelmingly likely to be invertible, ensuring the feasibility of using the pseudo-inverse for linear regression.']}, {'end': 3203.032, 'segs': [{'end': 2816.463, 'src': 'embed', 'start': 2772.192, 'weight': 0, 'content': [{'end': 2774.754, 'text': 'but you are also going to be able to use it for classification.', 'start': 2772.192, 'duration': 2.562}, {'end': 2778.857, 'text': 'Maybe the perceptron is now going out of business.', 'start': 2776.975, 'duration': 1.882}, {'end': 2779.977, 'text': 'It has a competitor now.', 'start': 2778.897, 'duration': 1.08}, {'end': 2782.359, 'text': 'And the competitor has a very simple algorithm.', 'start': 2780.538, 'duration': 1.821}, {'end': 2784.02, 'text': "So let's see how this works.", 'start': 2783.019, 'duration': 1.001}, {'end': 2786.521, 'text': 'The idea is incredibly simple.', 'start': 2784.88, 'duration': 1.641}, {'end': 2790.764, 'text': 'Linear regression learns a real-valued function.', 'start': 2788.463, 'duration': 2.301}, {'end': 2791.825, 'text': 'We know that.', 'start': 2791.405, 'duration': 0.42}, {'end': 2796.757, 'text': 'That is the real-valued function.', 'start': 2795.436, 'duration': 1.321}, {'end': 2798.497, 'text': 'The value belongs to the real numbers.', 'start': 2796.797, 'duration': 1.7}, {'end': 2807.8, 'text': 'Fine. Now the main observation, the ingenious observation, is that binary-valued functions, which are the classification functions, are also real-valued.', 'start': 2798.917, 'duration': 8.883}, {'end': 2811.601, 'text': 'Plus 1 and minus 1, among other things, happen to be real numbers.', 'start': 2808.44, 'duration': 3.161}, {'end': 2816.463, 'text': 'So linear regression is not going to refuse to learn them as real numbers.', 'start': 2812.581, 'duration': 3.882}], 'summary': 'Linear regression can be used for classification with real-valued functions.', 'duration': 44.271, 'max_score': 2772.192, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2772192.jpg'}, {'end': 2922.252, 'src': 'embed', 'start': 2890.267, 'weight': 2, 'content': [{'end': 2891.948, 'text': 'All you need to do is have a classification problem.', 'start': 2890.267, 'duration': 1.681}, {'end': 2893.209, 'text': "Let's run linear regression.", 'start': 2892.148, 'duration': 1.061}, {'end': 2894.09, 'text': "It's almost for free.", 'start': 2893.249, 'duration': 0.841}, {'end': 2897.993, 'text': 'Do this one-step learning, get a solution, and use it for classification.', 'start': 2894.43, 'duration': 3.563}, {'end': 2901.195, 'text': "Now let's see if this is as good as it sounds.", 'start': 2899.034, 'duration': 2.161}, {'end': 2906.179, 'text': 'Well, the weights are good for classification, so to speak, just by conjecture.', 'start': 2901.836, 'duration': 4.343}, {'end': 2911.543, 'text': 'But they also may serve as good initial weights for classification.', 'start': 2906.64, 'duration': 4.903}, {'end': 2916.788, 'text': 'Remember that the perceptron algorithm, or the pocket algorithm, are really very slow to get there.', 'start': 2912.444, 'duration': 4.344}, {'end': 2922.252, 'text': 'You start with a random guy, half the guys are misclassified and it just goes around, tries to correct one,', 'start': 2917.008, 'duration': 5.244}], 'summary': 'Using linear regression for classification, with potential benefits as initial weights. perceptron algorithm is slow.', 'duration': 31.985, 'max_score': 2890.267, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2890267.jpg'}, {'end': 3068.769, 'src': 'embed', 'start': 3043.955, 'weight': 3, 'content': [{'end': 3050.158, 'text': 'And then you take the classification now that forgets about the values and try to adjust it according to the classification and you will get a good boundary.', 'start': 3043.955, 'duration': 6.203}, {'end': 3057.422, 'text': "That's the contrast between applying linear classification, linear regression for classification, and linear classification outright.", 'start': 3050.538, 'duration': 6.884}, {'end': 3061.224, 'text': 'Now we are done.', 'start': 3060.724, 'duration': 0.5}, {'end': 3068.769, 'text': "I'm going to start on nonlinear transformation, and I'm going to give you a very interesting tool to play with.", 'start': 3061.244, 'duration': 7.525}], 'summary': 'Contrast linear regression and classification for better boundaries. now moving on to nonlinear transformation.', 'duration': 24.814, 'max_score': 3043.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3043955.jpg'}, {'end': 3165.005, 'src': 'embed', 'start': 3137.994, 'weight': 4, 'content': [{'end': 3146.543, 'text': "Wouldn't it be nice if, in two-view graphs, you can use linear regression and linear classification, the perceptron or the pocket,", 'start': 3137.994, 'duration': 8.549}, {'end': 3147.404, 'text': 'to apply it to this guy?', 'start': 3146.543, 'duration': 0.861}, {'end': 3149.006, 'text': "That's what will happen.", 'start': 3148.345, 'duration': 0.661}, {'end': 3151.409, 'text': 'I told you this is a practical lecture.', 'start': 3149.967, 'duration': 1.442}, {'end': 3157.679, 'text': 'So we take another example of nonlinearity.', 'start': 3155.016, 'duration': 2.663}, {'end': 3159.38, 'text': 'We take the credit line.', 'start': 3158.499, 'duration': 0.881}, {'end': 3165.005, 'text': 'Now, if you look at the credit line, the credit line is affected by years in residence.', 'start': 3160.821, 'duration': 4.184}], 'summary': 'Lecture covers linear regression, classification, and nonlinearity in credit line analysis.', 'duration': 27.011, 'max_score': 3137.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3137994.jpg'}], 'start': 2750.396, 'title': 'Linear regression for classification', 'summary': 'Introduces the concept of using linear regression for classification, highlighting its effectiveness in representing binary-valued functions as real-valued, and its role as an initial step for classification algorithms like perceptron or pocket algorithm.', 'chapters': [{'end': 2816.463, 'start': 2750.396, 'title': 'Linear regression for classification', 'summary': 'Introduces the concept of using linear regression for classification, offering a simple yet effective algorithm and highlighting that binary-valued functions can also be represented as real-valued, thus making them learnable by linear regression.', 'duration': 66.067, 'highlights': ['Linear regression can be used for both real-valued function regression problems and classification, which offers a simple yet effective algorithm (relevance score: 5)', 'Binary-valued functions, which are the classification functions, can also be represented as real-valued, allowing linear regression to learn them as real numbers (relevance score: 4)', 'The pseudo-inverse can be defined if the linear regression is not invertible, but it will not be unique and has some elaborate features, which is not commonly encountered in practice (relevance score: 3)']}, {'end': 3203.032, 'start': 2817.503, 'title': 'Linear regression for classification', 'summary': 'Discusses using linear regression for classification, highlighting how it can be used as an initial step for classification algorithms such as perceptron or pocket algorithm, and the contrast between applying linear regression and linear classification outright.', 'duration': 385.529, 'highlights': ['Linear regression as an initial step for classification algorithms Running linear regression as an initial step for classification algorithms such as perceptron or pocket algorithm, to provide a jump start and good initial weights for classification.', 'Challenges of linear regression for classification The challenges with linear regression for classification, where the algorithm attempts to make all values equal to a specific number, leading to incorrect boundary placement and the need for subsequent adjustments for a good boundary.', 'Application of linear regression and classification to non-separable data Exploring the practical application of linear regression and linear classification for non-separable data, emphasizing the need for nonlinear transformation and discussing examples such as credit line determination based on years in residence.']}], 'duration': 452.636, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ2750396.jpg', 'highlights': ['Linear regression can be used for both real-valued function regression problems and classification, offering a simple yet effective algorithm (relevance score: 5)', 'Binary-valued functions, which are the classification functions, can also be represented as real-valued, allowing linear regression to learn them as real numbers (relevance score: 4)', 'Linear regression as an initial step for classification algorithms such as perceptron or pocket algorithm, providing a jump start and good initial weights for classification (relevance score: 3)', 'Challenges with linear regression for classification, where the algorithm attempts to make all values equal to a specific number, leading to incorrect boundary placement and the need for subsequent adjustments for a good boundary (relevance score: 2)', 'Application of linear regression and classification to non-separable data, emphasizing the need for nonlinear transformation and discussing examples such as credit line determination based on years in residence (relevance score: 1)']}, {'end': 3874.716, 'segs': [{'end': 3228.778, 'src': 'embed', 'start': 3204.172, 'weight': 0, 'content': [{'end': 3214.035, 'text': 'So it would be very nice if I can, instead of using the linear one, define nonlinear features, which is the following.', 'start': 3204.172, 'duration': 9.863}, {'end': 3220.476, 'text': "Let's take the logical condition that the years in residence are less than 1.", 'start': 3214.795, 'duration': 5.681}, {'end': 3224.037, 'text': "And in my mind, I'm considering that this is not very stable.", 'start': 3220.476, 'duration': 3.561}, {'end': 3225.498, 'text': "You haven't been there for very long.", 'start': 3224.257, 'duration': 1.241}, {'end': 3228.778, 'text': 'And another guy, which is xi greater than 5.', 'start': 3226.238, 'duration': 2.54}], 'summary': 'Desire to use nonlinear features, such as years in residence and xi greater than 5, for better stability.', 'duration': 24.606, 'max_score': 3204.172, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3204172.jpg'}, {'end': 3395.217, 'src': 'embed', 'start': 3346.363, 'weight': 1, 'content': [{'end': 3354.126, 'text': "The fact that it's linear in the parameters is what matters in deriving the perceptron learning algorithm and the linear regression algorithm.", 'start': 3346.363, 'duration': 7.763}, {'end': 3357.867, 'text': "If you go back to the derivation, it didn't matter what the x's were.", 'start': 3354.286, 'duration': 3.581}, {'end': 3363.169, 'text': "The x's were sitting there as constants, and their linearity in w is what enabled the derivation.", 'start': 3357.927, 'duration': 5.242}, {'end': 3369.771, 'text': 'So that results in the algorithm work because of linearity in the weights.', 'start': 3364.449, 'duration': 5.322}, {'end': 3373.618, 'text': 'Now, that opens a fantastic possibility.', 'start': 3371.496, 'duration': 2.122}, {'end': 3377.32, 'text': 'Because now I can take the inputs which are just constants.', 'start': 3374.078, 'duration': 3.242}, {'end': 3383.785, 'text': 'Someone gives me data, and I can do incredible nonlinear transformations to that data.', 'start': 3377.621, 'duration': 6.164}, {'end': 3388.028, 'text': 'And it will just remain more elaborate data, but constant.', 'start': 3384.126, 'duration': 3.902}, {'end': 3395.217, 'text': "When I get to learn using the nonlinearly transformed data, I'm still in the realm of linear models,", 'start': 3388.849, 'duration': 6.368}], 'summary': 'Linearity in the parameters enables perceptron and linear regression algorithms, allowing for nonlinear transformations while remaining in the realm of linear models.', 'duration': 48.854, 'max_score': 3346.363, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3346363.jpg'}], 'start': 3204.172, 'title': 'Nonlinear features and transformation in regression analysis', 'summary': 'Delves into defining nonlinear features in regression using logical conditions, introducing nonlinearity to the model, and discussing the importance of linearity in parameters and the potential for using nonlinear transformations while staying within the realm of linear models.', 'chapters': [{'end': 3248.846, 'start': 3204.172, 'title': 'Nonlinear features in regression', 'summary': 'Discusses the concept of defining nonlinear features in regression by using logical conditions to represent stability, with the example of years in residence less than 1 and greater than 5, returning 1 for true and 0 for false, thus introducing nonlinearity to the regression model.', 'duration': 44.674, 'highlights': ['The concept of defining nonlinear features in regression by using logical conditions to represent stability, with the example of years in residence less than 1 and greater than 5, returning 1 for true and 0 for false, thus introducing nonlinearity to the regression model.', 'Illustration of using logical conditions to represent stability in regression, such as years in residence less than 1 and greater than 5, which returns 1 for true and 0 for false, allowing for nonlinear representation in the model.']}, {'end': 3874.716, 'start': 3248.846, 'title': 'Nonlinear transformation in linear models', 'summary': 'Discusses the concept of nonlinear transformation in linear models, emphasizing the importance of linearity in parameters and the potential for using nonlinear transformations while staying within the realm of linear models, and ends with a mention of the upcoming discussion on guidelines for choosing nonlinear transformations and their sensitivity to the issue of generalization.', 'duration': 625.87, 'highlights': ['The importance of linearity in parameters is crucial in deriving the perceptron learning algorithm and the linear regression algorithm. The linearity in parameters enables the derivation of the perceptron learning algorithm and the linear regression algorithm, emphasizing the significance of linearity in the learning process.', 'The potential to apply nonlinear transformations to data while remaining within the realm of linear models, by ensuring that the weights given to the nonlinear features have a linear dependency. The ability to perform nonlinear transformations on data while still working within the scope of linear models is highlighted, emphasizing the linear dependency of weights given to nonlinear features.', 'Discussion about upcoming guidelines for choosing nonlinear transformations and their sensitivity to the issue of generalization. The upcoming discussion on guidelines for choosing nonlinear transformations and their sensitivity to the issue of generalization is mentioned, indicating the importance of understanding the implications of nonlinear transformations in linear models.']}], 'duration': 670.544, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3204172.jpg', 'highlights': ['Illustration of using logical conditions to represent stability in regression, such as years in residence less than 1 and greater than 5, which returns 1 for true and 0 for false, allowing for nonlinear representation in the model.', 'The potential to apply nonlinear transformations to data while remaining within the realm of linear models, by ensuring that the weights given to the nonlinear features have a linear dependency.', 'The concept of defining nonlinear features in regression by using logical conditions to represent stability, with the example of years in residence less than 1 and greater than 5, returning 1 for true and 0 for false, thus introducing nonlinearity to the regression model.', 'The importance of linearity in parameters is crucial in deriving the perceptron learning algorithm and the linear regression algorithm.']}, {'end': 4765.615, 'segs': [{'end': 4008.975, 'src': 'embed', 'start': 3955.496, 'weight': 3, 'content': [{'end': 3960.12, 'text': 'And in general, there is an offset, depending on the values of these variables.', 'start': 3955.496, 'duration': 4.624}, {'end': 3962.481, 'text': 'And the offset is compensated for by the threshold.', 'start': 3960.36, 'duration': 2.121}, {'end': 3965.704, 'text': "So that's why we need the threshold for linear regression.", 'start': 3962.541, 'duration': 3.163}, {'end': 3967.265, 'text': 'What is the second question?', 'start': 3966.404, 'duration': 0.861}, {'end': 3975.383, 'text': 'So, in the binary case, when you use y as plus 1 or minus 1, why does that just work?', 'start': 3967.856, 'duration': 7.527}, {'end': 3982.489, 'text': 'Well, if you apply linear regression, you have the following guarantee at the end', 'start': 3976.544, 'duration': 5.945}, {'end': 3990.116, 'text': 'The hypothesis you have has the least squared error from the targets on the examples.', 'start': 3983.41, 'duration': 6.706}, {'end': 3993.719, 'text': "That's what has been achieved by the linear regression algorithm.", 'start': 3990.336, 'duration': 3.383}, {'end': 3999.325, 'text': 'Now, the outputs of the examples being plus or minus 1, we can put that together with the first statement.', 'start': 3994.52, 'duration': 4.805}, {'end': 4008.975, 'text': 'And then we realize that the output of my hypothesis is closest to the value plus 1 or minus 1, with a mean square error.', 'start': 3999.445, 'duration': 9.53}], 'summary': 'Linear regression ensures least squared error for binary outputs like plus or minus 1.', 'duration': 53.479, 'max_score': 3955.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3955496.jpg'}, {'end': 4143.421, 'src': 'embed', 'start': 4115.823, 'weight': 5, 'content': [{'end': 4126.131, 'text': 'The best approach is to look at the row input and look at the problem statement and then try to infer what would be a meaningful feature for this problem.', 'start': 4115.823, 'duration': 10.308}, {'end': 4130.594, 'text': 'For example, the case where I talked about the years in residence.', 'start': 4126.171, 'duration': 4.423}, {'end': 4138.339, 'text': 'It does make sense to derive some features that are closer to the linear dependency.', 'start': 4131.234, 'duration': 7.105}, {'end': 4143.421, 'text': 'There is no general algorithm for getting features.', 'start': 4138.999, 'duration': 4.422}], 'summary': 'Derive meaningful features from input for linear dependency.', 'duration': 27.598, 'max_score': 4115.823, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ4115823.jpg'}, {'end': 4370.409, 'src': 'heatmap', 'start': 4308.027, 'weight': 0, 'content': [{'end': 4317.579, 'text': 'So another question is, are there methods that use different hyperplanes and intersections of them to separate data? Correct.', 'start': 4308.027, 'duration': 9.552}, {'end': 4325.829, 'text': 'The linear model that we have described is the building block of so many models in machine learning.', 'start': 4318.079, 'duration': 7.75}, {'end': 4333.773, 'text': 'You will find that if you take a linear model with a soft threshold not the hard threshold version and you put a bunch of them together,', 'start': 4326.289, 'duration': 7.484}, {'end': 4335.254, 'text': 'you will get a neural network.', 'start': 4333.773, 'duration': 1.481}, {'end': 4344.54, 'text': 'If you take the linear model, and you try to pick the separating boundary in a principled way, you get support vector machines.', 'start': 4335.815, 'duration': 8.725}, {'end': 4352.803, 'text': 'If you take the nonlinear transformation, and you try to find a computationally efficient way of doing it, you get kernel methods.', 'start': 4345.2, 'duration': 7.603}, {'end': 4357.324, 'text': 'So there are lots of methods within machine learning that build on the linear model.', 'start': 4353.083, 'duration': 4.241}, {'end': 4360.285, 'text': 'The linear model is somewhat underutilized.', 'start': 4357.845, 'duration': 2.44}, {'end': 4364.987, 'text': "It's not glorious, but it does the job.", 'start': 4360.746, 'duration': 4.241}, {'end': 4370.409, 'text': 'The interesting thing is that if you have a problem, there is a very good chance that if you take a simple linear model,', 'start': 4365.107, 'duration': 5.302}], 'summary': 'Linear model is foundational in machine learning, leading to neural networks, support vector machines, and kernel methods.', 'duration': 62.382, 'max_score': 4308.027, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ4308027.jpg'}, {'end': 4438.907, 'src': 'embed', 'start': 4419.638, 'weight': 1, 'content': [{'end': 4434.325, 'text': 'I rely on the theory that guarantees that the in-sample error tracks the out-of-sample error in order to go all out for the in-sample error and hope that the out-of-sample error follows which we have seen in the graph when we were looking at the evolution of the perceptron.', 'start': 4419.638, 'duration': 14.687}, {'end': 4436.486, 'text': 'And the in-sample error was going down and up.', 'start': 4434.665, 'duration': 1.821}, {'end': 4438.907, 'text': 'And the out-of-sample error was also going down and up.', 'start': 4436.826, 'duration': 2.081}], 'summary': "Relying on theory to minimize in-sample error, perceptron's in and out-of-sample errors fluctuated.", 'duration': 19.269, 'max_score': 4419.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ4419638.jpg'}], 'start': 3876.357, 'title': 'Linear models in machine learning', 'summary': 'Covers the necessity of constant term in linear regression, use of plus 1s and minus 1s in binary classification, deriving meaningful features, importance and applications of linear models in ml, role as building blocks for various methods, relationship with maximum likelihood estimation, assessment of in-sample and out-of-sample errors, and use of nonlinear transformations.', 'chapters': [{'end': 4240.659, 'start': 3876.357, 'title': 'Linear regression and binary classification', 'summary': 'Covers the necessity of the constant term in linear regression, the use of plus 1s and minus 1s in binary classification, and the approach to deriving meaningful features for problems.', 'duration': 364.302, 'highlights': ['The necessity of the constant term in linear regression is explained, emphasizing that it allows the model to capture an offset and compensate for the values of variables, ultimately contributing to a proper model. The constant term in linear regression is necessary to capture an offset and compensate for the values of variables, contributing to a proper model.', 'The application of linear regression in the binary case is discussed, highlighting that the hypothesis has the least squared error from the targets, and the leap of faith that the output of the hypothesis is closest to the value of plus 1 or minus 1, supporting the classification process. In the binary case, the hypothesis from linear regression has the least squared error from the targets and is closest to the value of plus 1 or minus 1, supporting the classification process.', 'The approach to deriving meaningful features for problems is outlined, indicating that it involves looking at the raw input and inferring features based on the problem statement, with the caveat that deriving too many features may lead to generalization issues. Deriving meaningful features involves looking at the raw input and inferring features based on the problem statement, with the caveat that deriving too many features may lead to generalization issues.']}, {'end': 4765.615, 'start': 4241.099, 'title': 'Linear models in machine learning', 'summary': 'Covers the importance and applications of linear models in machine learning, highlighting their role as building blocks for various methods such as neural networks, support vector machines, and kernel methods, and the relationship between linear regression and maximum likelihood estimation, while also addressing the assessment of in-sample and out-of-sample errors and the use of nonlinear transformations.', 'duration': 524.516, 'highlights': ['Linear models as building blocks for various methods Linear models serve as the foundation for numerous models in machine learning, including neural networks, support vector machines, and kernel methods, showcasing their versatility and crucial role in the field.', "Assessment of in-sample and out-of-sample errors The algorithm evaluates in-sample error to pick the best hypothesis, while relying on theory to track and minimize the out-of-sample error, as evidenced by the perceptron's in-sample and out-of-sample errors tracking each other.", 'Role of nonlinear transformations Nonlinear transformations play a significant role in machine learning, allowing the transformation of data to tackle non-separable and real-valued situations, providing a toolbox for diverse applications and datasets.']}], 'duration': 889.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FIbVs5GbBlQ/pics/FIbVs5GbBlQ3876357.jpg', 'highlights': ['Linear models serve as the foundation for numerous models in machine learning, including neural networks, support vector machines, and kernel methods, showcasing their versatility and crucial role in the field.', "The algorithm evaluates in-sample error to pick the best hypothesis, while relying on theory to track and minimize the out-of-sample error, as evidenced by the perceptron's in-sample and out-of-sample errors tracking each other.", 'Nonlinear transformations play a significant role in machine learning, allowing the transformation of data to tackle non-separable and real-valued situations, providing a toolbox for diverse applications and datasets.', 'The application of linear regression in the binary case is discussed, highlighting that the hypothesis from linear regression has the least squared error from the targets and is closest to the value of plus 1 or minus 1, supporting the classification process.', 'The necessity of the constant term in linear regression is explained, emphasizing that it allows the model to capture an offset and compensate for the values of variables, ultimately contributing to a proper model.', 'Deriving meaningful features involves looking at the raw input and inferring features based on the problem statement, with the caveat that deriving too many features may lead to generalization issues.']}], 'highlights': ['The pseudo-inverse provides a simple solution for solving linear regression problems, as it can be obtained by inverting the matrix X^TX and multiplying it by X^Ty, yielding an explicit formula for w.', 'Linear regression is used frequently in statistics and economics, especially in relating variables and predicting real-valued outputs.', 'The output of linear regression is a real number, representing the dollar amount for credit approval, without a threshold for plus or minus values.', 'The concept of in-sample error is introduced, with a focus on using the error measure to evaluate the performance of the hypothesis on the data set and subsequently minimizing it.', "The algorithm evaluates in-sample error to pick the best hypothesis, while relying on theory to track and minimize the out-of-sample error, as evidenced by the perceptron's in-sample and out-of-sample errors tracking each other.", 'Linear models serve as the foundation for numerous models in machine learning, including neural networks, support vector machines, and kernel methods, showcasing their versatility and crucial role in the field.', 'Nonlinear transformations play a significant role in machine learning, allowing the transformation of data to tackle non-separable and real-valued situations, providing a toolbox for diverse applications and datasets.', 'The selection of the final hypothesis g from the set of hypotheses is addressed, emphasizing the inclusion of the probabilities of in-sample performance not tracking out-of-sample performance for each individual hypothesis in the overall probability of the final hypothesis not tracking E out.', 'The challenge of handling multiple hypotheses is emphasized, highlighting the accumulation of the probability of bad generalization across multiple hypotheses and the need to accommodate this scenario in learning processes.', 'The chapter explains the concept of union bound and its application in probability theory, demonstrating that the probability of an event, or another event, or another event is at most the sum of the probabilities, regardless of the correlation between the events, and how it is useful in cases of independent and non-independent events.']}