title

Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)

description

For more information about Stanfordâ€™s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3GnSw3o
Anand Avati
PhD Candidate and CS229 Head TA
To follow along with the course schedule and syllabus, visit:
http://cs229.stanford.edu/syllabus-autumn2018.html

detail

{'title': 'Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)', 'heatmap': [{'end': 2118.342, 'start': 1968.467, 'weight': 0.845}, {'end': 2417.665, 'start': 2360.74, 'weight': 0.814}, {'end': 2762.837, 'start': 2575.303, 'weight': 0.728}, {'end': 2865.699, 'start': 2793.716, 'weight': 0.734}, {'end': 4384.373, 'start': 4322.332, 'weight': 0.722}, {'end': 4781.959, 'start': 4725.594, 'weight': 0.787}, {'end': 4916.732, 'start': 4869.51, 'weight': 0.708}], 'summary': 'The lecture series covers various topics including perceptron algorithm, logistic regression, exponential families, distribution derivatives, update rules in statistical models, glm in regression and classification, and softmax regression, providing insights into practical usage and historical context, and emphasizing the relationship between glms and exponential family distributions.', 'chapters': [{'end': 76.731, 'segs': [{'end': 37.391, 'src': 'embed', 'start': 4.554, 'weight': 0, 'content': [{'end': 7.115, 'text': 'Couple of announcements before we get started.', 'start': 4.554, 'duration': 2.561}, {'end': 10.977, 'text': 'So first of all, PS1 is out, problem set one.', 'start': 7.175, 'duration': 3.802}, {'end': 18.041, 'text': "It is due on 17th, that's two weeks from today.", 'start': 12.898, 'duration': 5.143}, {'end': 20.442, 'text': 'You have exactly two weeks to work on it.', 'start': 18.081, 'duration': 2.361}, {'end': 23.684, 'text': 'You can take up to two or three late days.', 'start': 20.742, 'duration': 2.942}, {'end': 26.805, 'text': 'I think you can take up to three late days.', 'start': 23.724, 'duration': 3.081}, {'end': 27.766, 'text': 'There is.', 'start': 27.405, 'duration': 0.361}, {'end': 35.25, 'text': "There's a good amount of programming and a good amount of math you need to do so.", 'start': 30.387, 'duration': 4.863}, {'end': 37.391, 'text': 'PS1 needs to be uploaded.', 'start': 35.89, 'duration': 1.501}], 'summary': 'Ps1 is out, due on 17th, with 2-3 late days allowed.', 'duration': 32.837, 'max_score': 4.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4554.jpg'}], 'start': 4.554, 'title': 'Ps1 submission and deadlines', 'summary': 'Discusses the submission deadline for ps1, due in two weeks, with the option of up to three late days, and outlines the requirements for the written and programming parts of the assignment.', 'chapters': [{'end': 76.731, 'start': 4.554, 'title': 'Ps1 submission and deadlines', 'summary': 'Discusses the submission deadline for ps1, which is due in two weeks, with the option of up to three late days. it also outlines the requirements for the written and programming parts of the assignment.', 'duration': 72.177, 'highlights': ['PS1 is due on 17th, two weeks from today, with up to three late days allowed.', 'Two submissions required for PS1, one for the written part and one for the programming part.']}], 'duration': 72.177, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4554.jpg', 'highlights': ['PS1 is due on 17th, two weeks from today, with up to three late days allowed.', 'Two submissions required for PS1, one for the written part and one for the programming part.']}, {'end': 858.529, 'segs': [{'end': 154.657, 'src': 'embed', 'start': 102.502, 'weight': 2, 'content': [{'end': 108.543, 'text': 'So first of all, the perceptron algorithm I should mention is not something that is widely used in practice.', 'start': 102.502, 'duration': 6.041}, {'end': 120.446, 'text': "We study it mostly for historical reasons and also because it's nice and simple and it's easy to analyze and you also have homework questions on it.", 'start': 109.364, 'duration': 11.082}, {'end': 123.752, 'text': 'So logistic regression.', 'start': 121.69, 'duration': 2.062}, {'end': 154.657, 'text': 'We saw logistic regression uses the sigmoid function Right?', 'start': 124.492, 'duration': 30.165}], 'summary': 'The perceptron algorithm is not widely used in practice; logistic regression uses the sigmoid function.', 'duration': 52.155, 'max_score': 102.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ102502.jpg'}, {'end': 343.855, 'src': 'embed', 'start': 301.207, 'weight': 0, 'content': [{'end': 307.091, 'text': 'Yeah, essentially, um, g of- g of z where g is- is, um, the sigma- sigmoid function.', 'start': 301.207, 'duration': 5.884}, {'end': 315.637, 'text': 'Um, both of them have a common update rule, um, which, you know, on the surface looks similar.', 'start': 308.552, 'duration': 7.085}, {'end': 327.251, 'text': 'So theta j equal to theta j plus alpha times y i minus h.', 'start': 315.717, 'duration': 11.534}, {'end': 336.994, 'text': 'theta of of xi times xi j, right?', 'start': 327.251, 'duration': 9.743}, {'end': 343.855, 'text': 'So the update rules for um, the perceptron and logistic regression, they look the same, except h.', 'start': 338.334, 'duration': 5.521}], 'summary': 'Both perceptron and logistic regression have similar update rules, except for h.', 'duration': 42.648, 'max_score': 301.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ301207.jpg'}, {'end': 732.375, 'src': 'embed', 'start': 700.274, 'weight': 1, 'content': [{'end': 705.698, 'text': 'And so this- this is a very common technique used in lots of algorithms where if you add a vector to another vector,', 'start': 700.274, 'duration': 5.424}, {'end': 708.301, 'text': 'you make the second one kind of closer to the first one.', 'start': 705.698, 'duration': 2.603}, {'end': 712.547, 'text': 'essentially So this is the perceptron algorithm.', 'start': 709.246, 'duration': 3.301}, {'end': 716.449, 'text': 'You go example by example in an online manner.', 'start': 713.588, 'duration': 2.861}, {'end': 721.03, 'text': 'And if the example is already classified, you do nothing.', 'start': 716.909, 'duration': 4.121}, {'end': 722.211, 'text': 'You get a zero over here.', 'start': 721.271, 'duration': 0.94}, {'end': 732.375, 'text': 'If it is misclassified, you either add a small component of, you add the vector itself, the example itself, to your theta,', 'start': 722.591, 'duration': 9.784}], 'summary': 'Perceptron algorithm updates vectors to minimize misclassifications.', 'duration': 32.101, 'max_score': 700.274, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ700274.jpg'}], 'start': 76.951, 'title': 'Perceptron algorithm and logistic regression', 'summary': 'Covers the perceptron algorithm, logistic regression, and the exponential family, emphasizing historical context and practical usage, with a focus on understanding hypothesis functions and update rules.', 'chapters': [{'end': 858.529, 'start': 76.951, 'title': 'Perceptron algorithm and logistic regression', 'summary': 'Covers the perceptron algorithm, logistic regression, and the exponential family, emphasizing the historical context and the practical usage of the techniques, with a focus on understanding the hypothesis functions and update rules.', 'duration': 781.578, 'highlights': ['The perceptron algorithm is studied for historical reasons and simplicity, with logistic regression being a softer version of the perceptron.', 'Detailed comparison of hypothesis functions and update rules between perceptron and logistic regression.', "Explanation of the perceptron algorithm's decision boundary and update process using vector manipulation.", 'Discussion on the practical usage and limitations of the perceptron algorithm and logistic regression.']}], 'duration': 781.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ76951.jpg', 'highlights': ['Detailed comparison of hypothesis functions and update rules between perceptron and logistic regression.', "Explanation of the perceptron algorithm's decision boundary and update process using vector manipulation.", 'The perceptron algorithm is studied for historical reasons and simplicity, with logistic regression being a softer version of the perceptron.', 'Discussion on the practical usage and limitations of the perceptron algorithm and logistic regression.']}, {'end': 1646.16, 'segs': [{'end': 887.295, 'src': 'embed', 'start': 858.549, 'weight': 0, 'content': [{'end': 864.49, 'text': 'Uh, a common thing is to just decrease the learning rate, uh, with every time step until you stop making changes.', 'start': 858.549, 'duration': 5.941}, {'end': 870.411, 'text': "All right, let's move on to exponential families.", 'start': 868.891, 'duration': 1.52}, {'end': 879.028, 'text': 'So, uh, exponential families is, uh is a class of probability distributions which are somewhat nice mathematically, right?', 'start': 870.751, 'duration': 8.277}, {'end': 887.295, 'text': "Um, they're also very closely related to GLMs, uh, which we will be going over next, right?", 'start': 879.609, 'duration': 7.686}], 'summary': 'Decrease learning rate with time, move to exponential families related to glms.', 'duration': 28.746, 'max_score': 858.549, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ858549.jpg'}, {'end': 1057.233, 'src': 'embed', 'start': 1031.064, 'weight': 3, 'content': [{'end': 1037.391, 'text': "If you have a statistics background and you know if you come across the word sufficient statistic before, it's the exact same thing,", 'start': 1031.064, 'duration': 6.327}, {'end': 1041.156, 'text': "but you don't need to know much about this because for.", 'start': 1037.391, 'duration': 3.765}, {'end': 1049.568, 'text': "All the distributions that we're going to be seeing today or in this class, t of y will be equal to just y.", 'start': 1042.324, 'duration': 7.244}, {'end': 1050.329, 'text': 'so you can.', 'start': 1049.568, 'duration': 0.761}, {'end': 1053.091, 'text': 'you can just replace t of y with y for.', 'start': 1050.329, 'duration': 2.762}, {'end': 1057.233, 'text': 'For all the examples today and in the rest of the class.', 'start': 1054.411, 'duration': 2.822}], 'summary': 'Sufficient statistic equals y for all distributions in this class.', 'duration': 26.169, 'max_score': 1031.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1031064.jpg'}, {'end': 1212.153, 'src': 'embed', 'start': 1177.671, 'weight': 5, 'content': [{'end': 1178.651, 'text': 'These were exactly the same.', 'start': 1177.671, 'duration': 0.98}, {'end': 1184.453, 'text': "Oh yeah, you're right.", 'start': 1183.832, 'duration': 0.621}, {'end': 1185.233, 'text': 'This should be positive.', 'start': 1184.513, 'duration': 0.72}, {'end': 1189.944, 'text': 'Thank you.', 'start': 1189.684, 'duration': 0.26}, {'end': 1200.408, 'text': 'So, um, this is um, you can think of this as a normalizing constant of the distribution, such that the uh, the whole thing integrates to one right?', 'start': 1190.404, 'duration': 10.004}, {'end': 1202.429, 'text': 'Um and uh.', 'start': 1201.289, 'duration': 1.14}, {'end': 1204.55, 'text': 'therefore, the log of this will be a of eta.', 'start': 1202.429, 'duration': 2.121}, {'end': 1206.491, 'text': "that's why it's just called the log of the partition function.", 'start': 1204.55, 'duration': 1.941}, {'end': 1212.153, 'text': 'So the partition function is a technical term to indicate the normalizing constant of, um, probability distributions.', 'start': 1206.511, 'duration': 5.642}], 'summary': 'Discussion on normalizing constant and partition function in probability distributions.', 'duration': 34.482, 'max_score': 1177.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1177671.jpg'}, {'end': 1276.412, 'src': 'embed', 'start': 1246.077, 'weight': 2, 'content': [{'end': 1247.898, 'text': 'T of y has to match.', 'start': 1246.077, 'duration': 1.821}, {'end': 1250.259, 'text': 'So these- the dimension of these two has to match.', 'start': 1248.078, 'duration': 2.181}, {'end': 1269.047, 'text': 'And these are scalars, right? So for any choice of, A, B, and T that can be your choice completely.', 'start': 1255.74, 'duration': 13.307}, {'end': 1276.412, 'text': 'As long as the expression integrates to one, you have a family in the exponential family.', 'start': 1269.667, 'duration': 6.745}], 'summary': 'For any choice of a, b, and t, as long as the expression integrates to one, there is a family in the exponential family.', 'duration': 30.335, 'max_score': 1246.077, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1246077.jpg'}, {'end': 1323.083, 'src': 'embed', 'start': 1297.002, 'weight': 6, 'content': [{'end': 1303.227, 'text': 'a family of Gaussian distribution, such that for any value of the parameter, you get a member of the Gaussian family.', 'start': 1297.002, 'duration': 6.225}, {'end': 1308.497, 'text': 'Right?. And this is mostly uh.', 'start': 1304.315, 'duration': 4.182}, {'end': 1311.338, 'text': 'to show that, uh, a distribution is in the exponential family.', 'start': 1308.497, 'duration': 2.841}, {'end': 1323.083, 'text': 'um, the most straightforward way to do it is to write out the PDF of the distribution in the form that you know and just do some algebraic massaging to bring it into this form right?', 'start': 1311.338, 'duration': 11.745}], 'summary': 'Demonstrating how to identify a distribution in the exponential family using algebraic manipulation.', 'duration': 26.081, 'max_score': 1297.002, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1297002.jpg'}], 'start': 858.549, 'title': 'Exponential families and their parameters', 'summary': 'Covers exponential families, learning rates, and their characteristics, emphasizing decreasing learning rates and components of the exponential family such as y, eta, t of y, b of y, and a of eta, with examples and criteria for distribution inclusion.', 'chapters': [{'end': 948.408, 'start': 858.549, 'title': 'Exponential families and learning rates', 'summary': 'Discusses the concept of exponential families and learning rates, emphasizing the process of decreasing the learning rate with every time step and the characteristics of exponential families and their relation to glms.', 'duration': 89.859, 'highlights': ['Exponential families are a class of probability distributions closely related to GLMs, and their PDF can be written in a specific form.', 'Decreasing the learning rate with every time step is a common practice to stop making changes in the learning process.']}, {'end': 1646.16, 'start': 949.189, 'title': 'Exponential family and its parameters', 'summary': 'Discusses the components of the exponential family, including y as the data, eta as the natural parameter, and the functions t of y, b of y, and a of eta, with examples of their applications and conversions, culminating in understanding the criteria for a distribution to be part of the exponential family.', 'duration': 696.971, 'highlights': ['The chapter explains the components of the exponential family, including y as the data, eta as the natural parameter, and the functions T of y, B of y, and A of eta, with a focus on their role in modeling the output of data in a supervised learning setting.', 'It details the specific properties and functions associated with the exponential family, such as the sufficient statistic being mostly equal to y for the distributions discussed, and the base measure being solely a function of y.', 'The discussion includes the significance of the log partition function, A of eta, as a technical term indicating the normalizing constant of probability distributions within the exponential family, with a focus on its mathematical representation and implications for modeling.', 'The chapter presents the process of converting a distribution, such as the Bernoulli distribution, into the form of the exponential family, demonstrating the application of algebraic manipulation and pattern matching to determine the components T of y, B of y, and A of eta, showcasing the criteria for a distribution to be part of the exponential family.']}], 'duration': 787.611, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ858549.jpg', 'highlights': ['Exponential families are closely related to GLMs, with a specific PDF form.', 'Decreasing learning rate with every time step is a common practice.', 'Components of exponential family: y, eta, T of y, B of y, A of eta.', 'Sufficient statistic mostly equal to y for the discussed distributions.', 'Base measure is solely a function of y for the exponential family.', 'Significance of the log partition function, A of eta, as a normalizing constant.', 'Process of converting a distribution into the exponential family form.']}, {'end': 2067.367, 'segs': [{'end': 1680.993, 'src': 'embed', 'start': 1647.141, 'weight': 1, 'content': [{'end': 1650.001, 'text': 'The reason is because we want an expression in terms of eta.', 'start': 1647.141, 'duration': 2.86}, {'end': 1656.082, 'text': 'Here we got it in terms of phi, but we need to um plug in, um plug in eta over here.', 'start': 1650.081, 'duration': 6.001}, {'end': 1665.324, 'text': 'uh, eta, and this will just be um log of 1 plus e to the eta, right?', 'start': 1656.082, 'duration': 9.242}, {'end': 1667.348, 'text': 'So there you go.', 'start': 1666.388, 'duration': 0.96}, {'end': 1673.271, 'text': 'So, uh, this- this kind of, uh, verifies that the Bernoulli distribution is a member of the exponential family.', 'start': 1667.428, 'duration': 5.843}, {'end': 1680.993, 'text': 'Any questions here? So note that this might look familiar.', 'start': 1674.671, 'duration': 6.322}], 'summary': 'Derivation shows bernoulli distribution in exponential family.', 'duration': 33.852, 'max_score': 1647.141, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1647141.jpg'}, {'end': 1773.724, 'src': 'embed', 'start': 1740.872, 'weight': 2, 'content': [{'end': 1748.514, 'text': "Uh, but for- for, uh, our course we're-, we're only interested in um Gaussians with fixed variance and we are going to assume,", 'start': 1740.872, 'duration': 7.642}, {'end': 1756.905, 'text': 'assume that variance is equal to 1..', 'start': 1752.259, 'duration': 4.646}, {'end': 1765.858, 'text': 'So this gives the PDF of a Gaussian to look like this, P of y, parameterized as mu.', 'start': 1756.905, 'duration': 8.953}, {'end': 1773.724, 'text': 'So note here when we start writing out, we start with the uh parameters that we are uh commonly used to.', 'start': 1766.158, 'duration': 7.566}], 'summary': 'Course focuses on gaussians with fixed variance and assumes variance equals 1.', 'duration': 32.852, 'max_score': 1740.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1740872.jpg'}, {'end': 1903.255, 'src': 'embed', 'start': 1865.605, 'weight': 4, 'content': [{'end': 1877.734, 'text': 'So we have b of y equals 1 over root 2 pi minus y squared by 2.', 'start': 1865.605, 'duration': 12.129}, {'end': 1879.936, 'text': 'Note that this is a function of only y.', 'start': 1877.734, 'duration': 2.202}, {'end': 1880.737, 'text': "There's no eta here.", 'start': 1879.936, 'duration': 0.801}, {'end': 1884.507, 'text': 'T of y is just y.', 'start': 1882.506, 'duration': 2.001}, {'end': 1890.87, 'text': 'And in this case, natural parameter is mu, eta is mu.', 'start': 1884.507, 'duration': 6.363}, {'end': 1898.033, 'text': 'And the log partition function is equal to mu squared by 2.', 'start': 1890.89, 'duration': 7.143}, {'end': 1903.255, 'text': 'And when we- and we repeat the same exercise we did here.', 'start': 1898.033, 'duration': 5.222}], 'summary': 'The function b of y equals 1 over root 2 pi minus y squared by 2, with the log partition function equal to mu squared by 2.', 'duration': 37.65, 'max_score': 1865.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1865605.jpg'}, {'end': 2030.421, 'src': 'embed', 'start': 1991.125, 'weight': 0, 'content': [{'end': 2005.944, 'text': 'So, um, So one property is- now, um, if we perform maximum likelihood on um, on the exponential family, um as- as, uh,', 'start': 1991.125, 'duration': 14.819}, {'end': 2014.091, 'text': 'when- when the- when the exponential family is parameterized in the natural parameters, then um, the optimization problem is concave.', 'start': 2005.944, 'duration': 8.147}, {'end': 2022.898, 'text': 'So MLE with respect to eta is concave.', 'start': 2014.431, 'duration': 8.467}, {'end': 2030.421, 'text': "Similarly, if you, uh, flip the sign and use the- the, uh, what's called the negative log likelihood.", 'start': 2025, 'duration': 5.421}], 'summary': 'Performing maximum likelihood on exponential family results in concave optimization problem.', 'duration': 39.296, 'max_score': 1991.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1991125.jpg'}], 'start': 1647.141, 'title': 'Exponential family distributions and properties', 'summary': 'Covers the bernoulli distribution as a member of the exponential family, similarity to the sigmoid function, and the pdf of a gaussian with fixed variance equal to 1. it also discusses exponential families, their properties, parameterization, concavity of the optimization problem, and the relationship between canonical and natural parameters.', 'chapters': [{'end': 1809.939, 'start': 1647.141, 'title': 'Exponential family distributions', 'summary': 'Discusses the bernoulli distribution as a member of the exponential family, the similarity to the sigmoid function, and the pdf of a gaussian with fixed variance equal to 1.', 'duration': 162.798, 'highlights': ['The Bernoulli distribution is verified as a member of the exponential family by expressing it in terms of eta and identifying its relation to the sigmoid function.', 'The PDF of a Gaussian with fixed variance equal to 1 is derived and parameterized as mu, showcasing the link between canonical and natural parameters.', 'The transcript also mentions the possibility of considering Gaussians with variable variance, but the focus of the course is on Gaussians with fixed variance.']}, {'end': 2067.367, 'start': 1810.359, 'title': 'Exponential families and properties', 'summary': 'Discusses exponential families, their properties, and parameterization, including the concavity of the optimization problem and the relationship between canonical and natural parameters.', 'duration': 257.008, 'highlights': ['Exponential families have nice mathematical properties, such as concave optimization problem for MLE with respect to natural parameters and convex optimization problem for negative log likelihood.', 'Parameterizing exponential families with natural parameters leads to concave optimization problem for maximum likelihood estimation (MLE) and convex optimization problem for negative log likelihood (NLL).', 'When variance is unknown, exponential families can be represented using a vector for eta, and there exists a mapping between canonical and natural parameters.', 'The log partition function for a specific exponential family is equal to mu squared by 2 when the natural parameter is mu.', 'The log partition function is parameterized by the canonical parameters and is linked to the natural parameters through inversion.']}], 'duration': 420.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ1647141.jpg', 'highlights': ['Exponential families have nice mathematical properties, such as concave optimization problem for MLE with respect to natural parameters and convex optimization problem for negative log likelihood.', 'The Bernoulli distribution is verified as a member of the exponential family by expressing it in terms of eta and identifying its relation to the sigmoid function.', 'The PDF of a Gaussian with fixed variance equal to 1 is derived and parameterized as mu, showcasing the link between canonical and natural parameters.', 'Parameterizing exponential families with natural parameters leads to concave optimization problem for maximum likelihood estimation (MLE) and convex optimization problem for negative log likelihood (NLL).', 'The log partition function for a specific exponential family is equal to mu squared by 2 when the natural parameter is mu.']}, {'end': 3094.26, 'segs': [{'end': 2123.283, 'src': 'embed', 'start': 2067.387, 'weight': 0, 'content': [{'end': 2071.976, 'text': 'Um, each of the distribution.', 'start': 2067.387, 'duration': 4.589}, {'end': 2075.199, 'text': 'uh, we start with um A of eta.', 'start': 2071.976, 'duration': 3.223}, {'end': 2083.425, 'text': 'differentiate this with respect to eta, the log partition function with respect to eta, and you get another function with respect to eta,', 'start': 2075.199, 'duration': 8.226}, {'end': 2085.446, 'text': 'and that function will-, is-, is-.', 'start': 2083.425, 'duration': 2.021}, {'end': 2088.728, 'text': 'is the mean of the distribution as parameterized by eta, right?', 'start': 2085.446, 'duration': 3.282}, {'end': 2095.454, 'text': 'And similarly, the variance of y.', 'start': 2089.389, 'duration': 6.065}, {'end': 2097.956, 'text': "I'm gonna trace my eta.", 'start': 2097.056, 'duration': 0.9}, {'end': 2099.757, 'text': "It's just the second derivative.", 'start': 2098.356, 'duration': 1.401}, {'end': 2101.717, 'text': 'This was the first derivative, this is the second derivative.', 'start': 2099.777, 'duration': 1.94}, {'end': 2108.039, 'text': 'This is eta.', 'start': 2107.399, 'duration': 0.64}, {'end': 2118.342, 'text': 'So um, the reason why this is nice is because, in general, for probability distributions to calculate the mean and the variance,', 'start': 2109.5, 'duration': 8.842}, {'end': 2119.942, 'text': 'you generally need to integrate something.', 'start': 2118.342, 'duration': 1.6}, {'end': 2123.283, 'text': 'But over here, you just need to differentiate, which is a lot easier operation.', 'start': 2120.103, 'duration': 3.18}], 'summary': 'Deriving mean and variance from distribution parameterized by eta.', 'duration': 55.896, 'max_score': 2067.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2067387.jpg'}, {'end': 2291.42, 'src': 'embed', 'start': 2241.456, 'weight': 3, 'content': [{'end': 2245.417, 'text': "assumptions we're gonna make for GLM is that one?", 'start': 2241.456, 'duration': 3.961}, {'end': 2261.977, 'text': 'um, so these are the assumptions or design choices that are gonna take us from exponential families to, uh, generalized linear models.', 'start': 2245.417, 'duration': 16.56}, {'end': 2273.886, 'text': 'So the most important assumption is that, uh well, yeah, assumption is that y, given x, parameterized by a Theta,', 'start': 2262.597, 'duration': 11.289}, {'end': 2277.189, 'text': 'is a member of an exponential family.', 'start': 2273.886, 'duration': 3.303}, {'end': 2291.42, 'text': 'By exponential family of eta, I mean that form.', 'start': 2287.838, 'duration': 3.582}], 'summary': 'Glm assumes y, given x, is in exponential family.', 'duration': 49.964, 'max_score': 2241.456, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2241456.jpg'}, {'end': 2417.665, 'src': 'heatmap', 'start': 2360.74, 'weight': 0.814, 'content': [{'end': 2366.504, 'text': 'So here you can use um like Gamma or Exponential.', 'start': 2360.74, 'duration': 5.764}, {'end': 2377.067, 'text': 'So um, so there is the exponential family and there is also a distribution called the exponential distribution, which are, you know,', 'start': 2370.342, 'duration': 6.725}, {'end': 2377.748, 'text': 'two distinct things.', 'start': 2377.067, 'duration': 0.681}, {'end': 2383.072, 'text': "The exponential distribution happens to be a member of the exponential family as well, but they're not the same thing.", 'start': 2378.008, 'duration': 5.064}, {'end': 2392.979, 'text': 'Um, exponential and, um, yeah, and you can also have, um, you can also have probability distributions over probability distributions.', 'start': 2383.852, 'duration': 9.127}, {'end': 2405.274, 'text': 'Like, uh, beta, Dirichlet, these mostly show up in Bayesian machine learning or Bayesian statistics.', 'start': 2395.981, 'duration': 9.293}, {'end': 2417.665, 'text': 'So, depending on the kind of data that you have, if your y variable is-, is-, is-.', 'start': 2411.798, 'duration': 5.867}], 'summary': 'Discussed using gamma or exponential distribution for probability distributions in bayesian statistics.', 'duration': 56.925, 'max_score': 2360.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2360740.jpg'}, {'end': 2405.274, 'src': 'embed', 'start': 2378.008, 'weight': 4, 'content': [{'end': 2383.072, 'text': "The exponential distribution happens to be a member of the exponential family as well, but they're not the same thing.", 'start': 2378.008, 'duration': 5.064}, {'end': 2392.979, 'text': 'Um, exponential and, um, yeah, and you can also have, um, you can also have probability distributions over probability distributions.', 'start': 2383.852, 'duration': 9.127}, {'end': 2405.274, 'text': 'Like, uh, beta, Dirichlet, these mostly show up in Bayesian machine learning or Bayesian statistics.', 'start': 2395.981, 'duration': 9.293}], 'summary': 'Exponential distribution is a member of the exponential family. beta and dirichlet distributions are used in bayesian machine learning or statistics.', 'duration': 27.266, 'max_score': 2378.008, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2378008.jpg'}, {'end': 2621.534, 'src': 'embed', 'start': 2552.932, 'weight': 6, 'content': [{'end': 2554.713, 'text': 'Right? This is our hypothesis function.', 'start': 2552.932, 'duration': 1.781}, {'end': 2557.234, 'text': "And we'll see that you know what we do over here.", 'start': 2554.933, 'duration': 2.301}, {'end': 2563.637, 'text': 'if you plug in the uh um exponential family, uh as- as Gaussian, then the hypothesis will be the same.', 'start': 2557.234, 'duration': 6.403}, {'end': 2566.619, 'text': 'you know Gaussian hypothesis that we saw in linear regression.', 'start': 2563.637, 'duration': 2.982}, {'end': 2573.262, 'text': 'If we plug in, uh, Bernoulli, then this will turn out to be the same hypothesis that we saw in logistic regression and so on.', 'start': 2566.699, 'duration': 6.563}, {'end': 2621.534, 'text': 'So, uh, one way to kind of, um visualize, this is Right?', 'start': 2575.303, 'duration': 46.231}], 'summary': 'The hypothesis function remains the same for different exponential families, like gaussian and bernoulli.', 'duration': 68.602, 'max_score': 2552.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2552932.jpg'}, {'end': 2767.38, 'src': 'heatmap', 'start': 2575.303, 'weight': 7, 'content': [{'end': 2621.534, 'text': 'So, uh, one way to kind of, um visualize, this is Right?', 'start': 2575.303, 'duration': 46.231}, {'end': 2627.277, 'text': 'So one way to think of is- of- of this is there is a model and there is a distribution right?', 'start': 2621.914, 'duration': 5.363}, {'end': 2628.137, 'text': 'So the model.', 'start': 2627.497, 'duration': 0.64}, {'end': 2630.538, 'text': 'we are assuming it to be a linear model, right?', 'start': 2628.137, 'duration': 2.401}, {'end': 2636.981, 'text': 'Given x, there is a learnable parameter theta and theta transpose x will give you a parameter right?', 'start': 2630.598, 'duration': 6.383}, {'end': 2639.642, 'text': 'This is the model and here is the distribution.', 'start': 2637.021, 'duration': 2.621}, {'end': 2649.167, 'text': 'Now the distribution, um, is a member of the exponential family and the parameter for this distribution is the output of the linear model.', 'start': 2639.662, 'duration': 9.505}, {'end': 2652.147, 'text': 'Right? This- this is the picture you want to have in your mind.', 'start': 2649.946, 'duration': 2.201}, {'end': 2655.729, 'text': 'And the exponential family we make.', 'start': 2652.747, 'duration': 2.982}, {'end': 2660.911, 'text': "uh, depending on the data that we have whether it's you know whether it's a classification problem or a regression problem,", 'start': 2655.729, 'duration': 5.182}, {'end': 2669.755, 'text': 'or a time-to-event problem you would choose an appropriate b, a and t uh based on the distribution of your choice.', 'start': 2660.911, 'duration': 8.844}, {'end': 2679.3, 'text': 'Right? So this entire thing, uh, and- and from this, you can say, uh, get the, uh, expectation.', 'start': 2671.136, 'duration': 8.164}, {'end': 2697.294, 'text': 'of y given Theta and this is the same as expectation of y given Theta transpose x, right? And this is essentially our hypothesis function.', 'start': 2680.169, 'duration': 17.125}, {'end': 2714.404, 'text': "That's exactly right.", 'start': 2713.643, 'duration': 0.761}, {'end': 2719.007, 'text': 'Uh, so, uh, so the question is um.', 'start': 2714.664, 'duration': 4.343}, {'end': 2730.136, 'text': 'are we training Theta to uh, uh, um to predict the parameter of the um exponential family distribution, whose mean is um, the uh,', 'start': 2719.007, 'duration': 11.129}, {'end': 2731.637, 'text': "uh prediction that we're gonna make for y?", 'start': 2730.136, 'duration': 1.501}, {'end': 2734.059, 'text': "That's- that's correct, right?", 'start': 2731.637, 'duration': 2.422}, {'end': 2739.963, 'text': 'And um so, this is what we do at test time, right?', 'start': 2734.679, 'duration': 5.284}, {'end': 2741.905, 'text': 'And during train time.', 'start': 2740.724, 'duration': 1.181}, {'end': 2746.448, 'text': 'How do we train this model?', 'start': 2745.387, 'duration': 1.061}, {'end': 2752.771, 'text': 'So in this model, the parameter that we are learning by doing gradient descent, are these parameters right?', 'start': 2746.768, 'duration': 6.003}, {'end': 2759.095, 'text': "So you're not learning any of the parameters in the uh, in the uh exponential family.", 'start': 2753.092, 'duration': 6.003}, {'end': 2762.837, 'text': "We're not learning mu or sigma square or, or eta.", 'start': 2759.135, 'duration': 3.702}, {'end': 2763.538, 'text': "We're not learning this.", 'start': 2762.877, 'duration': 0.661}, {'end': 2767.38, 'text': "We are learning theta, that's part of the model and not part of the distribution.", 'start': 2763.578, 'duration': 3.802}], 'summary': 'Explains the relationship between model, distribution, and parameter learning through gradient descent.', 'duration': 20.612, 'max_score': 2575.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2575303.jpg'}, {'end': 2905.795, 'src': 'heatmap', 'start': 2793.716, 'weight': 5, 'content': [{'end': 2814.046, 'text': 'So, during learning, what we do is maximum likelihood maximize with respect to Theta log p of y i given right?', 'start': 2793.716, 'duration': 20.33}, {'end': 2828.831, 'text': "So you're doing gradient ascent on the log probability of of y, where um the the um natural parameter was re-parameterized with a linear model.", 'start': 2814.466, 'duration': 14.365}, {'end': 2833.732, 'text': 'Right? And we are doing gradient ascent by taking gradients on Theta.', 'start': 2830.25, 'duration': 3.482}, {'end': 2841.477, 'text': "This is like the big picture of what's happening with GLMs and how they kind of are an extension of exponential families.", 'start': 2834.873, 'duration': 6.604}, {'end': 2845.519, 'text': 'You re-parameterize the parameters with a linear model and you get a GLM.', 'start': 2841.537, 'duration': 3.982}, {'end': 2865.699, 'text': "So let's- let's look at, uh, some more detail on what happens at train time.", 'start': 2860.365, 'duration': 5.334}, {'end': 2905.795, 'text': 'So another um kind of incidental benefit of using um,', 'start': 2900.612, 'duration': 5.183}], 'summary': 'Learning involves maximizing log probability with gradient ascent on theta.', 'duration': 70.922, 'max_score': 2793.716, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2793716.jpg'}, {'end': 2983.193, 'src': 'embed', 'start': 2940.951, 'weight': 8, 'content': [{'end': 2946.433, 'text': 'transpose x and take the derivatives, and you know, uh, come up with a gradient update rule and so on.', 'start': 2940.951, 'duration': 5.482}, {'end': 2957.278, 'text': "But it turns out that, um, no matter which, uh what kind of GLM you're doing, no matter which choice of distribution that you make,", 'start': 2946.934, 'duration': 10.344}, {'end': 2961.703, 'text': 'the Learning update rule is the same.', 'start': 2957.278, 'duration': 4.425}, {'end': 2983.193, 'text': 'The learning update rule is theta equals j plus alpha times yi minus h theta of xi.', 'start': 2967.726, 'duration': 15.467}], 'summary': 'Learning update rule is theta = j + alpha(yi - h(theta(xi)))', 'duration': 42.242, 'max_score': 2940.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2940951.jpg'}], 'start': 2067.387, 'title': 'Distribution derivatives and generalized linear models', 'summary': 'Discusses differentiation of log partition function for obtaining mean and variance of a distribution and introduces generalized linear models, emphasizing the relationship between glms and exponential family distributions.', 'chapters': [{'end': 2123.283, 'start': 2067.387, 'title': 'Distribution derivatives and parameters', 'summary': 'Discusses the differentiation of log partition function with respect to a parameter to obtain the mean and variance of a distribution, providing a simpler alternative to integrating for calculating these statistics.', 'duration': 55.896, 'highlights': ['The differentiation of the log partition function with respect to a parameter yields the mean of the distribution as parameterized by that parameter, providing a simpler method for calculating the mean.', 'The second derivative of the log partition function with respect to a parameter yields the variance of the distribution, offering a straightforward alternative to integrating for variance calculation.', 'This method simplifies the calculation of mean and variance for probability distributions as it replaces the need for integration with the easier operation of differentiation.']}, {'end': 2524.438, 'start': 2124.484, 'title': 'Generalized linear models', 'summary': 'Introduces generalized linear models, discussing the assumptions and design choices that take us from exponential families to glms, including the types of distributions based on data types and the parameterization of eta.', 'duration': 399.954, 'highlights': ['The assumptions and design choices for GLMs involve y given x being a member of the exponential family, eta equalling theta transpose x, and the selection of distributions based on the type of data (e.g., Gaussian for regression, Bernoulli for binary classification).', 'The types of distributions in the exponential family include Bernoulli, Gaussian, Poisson, Gamma, Exponential, and others, each suitable for different types of data such as binary, real-valued, or counts.', 'The parameterization of eta, the design choice of eta being equal to theta transpose x, and the output generation at test time are key aspects of GLMs and their application to various types of data.']}, {'end': 3094.26, 'start': 2524.719, 'title': 'Generalized linear models and exponential families', 'summary': 'Discusses the relationship between generalized linear models and exponential family distributions, emphasizing that the hypothesis function is re-parameterized with a linear model, and the learning update rule remains the same regardless of the choice of distribution for glms.', 'duration': 569.541, 'highlights': ['The hypothesis function is re-parameterized with a linear model, resulting in the same hypothesis for various distributions, such as Gaussian for linear regression and Bernoulli for logistic regression.', "The parameter being learned during training is not part of the distribution, but rather the model, and the output of this parameter becomes the distribution's parameter.", 'The learning update rule, theta equals j plus alpha times yi minus h theta of xi, remains the same for any specific type of GLM, allowing for straightforward application without further algebraic calculations.']}], 'duration': 1026.873, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ2067387.jpg', 'highlights': ['The differentiation of the log partition function yields the mean of the distribution, simplifying mean calculation', 'The second derivative yields the variance, offering a straightforward alternative for variance calculation', 'The method simplifies the calculation of mean and variance for probability distributions', 'The assumptions and design choices for GLMs involve y given x being a member of the exponential family', 'The types of distributions in the exponential family include Bernoulli, Gaussian, Poisson, Gamma, Exponential, and others', 'The parameterization of eta and the output generation at test time are key aspects of GLMs', 'The hypothesis function is re-parameterized with a linear model, resulting in the same hypothesis for various distributions', 'The parameter being learned during training is not part of the distribution, but rather the model', 'The learning update rule remains the same for any specific type of GLM']}, {'end': 3585.984, 'segs': [{'end': 3177.494, 'src': 'embed', 'start': 3094.26, 'weight': 0, 'content': [{'end': 3099.785, 'text': "um, you're doing classification whether you're doing regression, whether you're doing, you know, a Poisson regression.", 'start': 3094.26, 'duration': 5.525}, {'end': 3100.926, 'text': 'the update rule is the same.', 'start': 3099.785, 'duration': 1.141}, {'end': 3104.368, 'text': 'You just plug in a different h theta of x and you get your learning rule.', 'start': 3101.406, 'duration': 2.962}, {'end': 3111.654, 'text': 'Another, um, some more terminology.', 'start': 3107.791, 'duration': 3.863}, {'end': 3123.892, 'text': 'So eta is what we call the natural parameter.', 'start': 3119.747, 'duration': 4.145}, {'end': 3148.873, 'text': 'So eta is the natural parameter and the function that links the natural parameter to the mean of the distribution.', 'start': 3132.263, 'duration': 16.61}, {'end': 3152.476, 'text': "And this has a name, it's called the canonical response function.", 'start': 3149.413, 'duration': 3.063}, {'end': 3168.107, 'text': "Right? And, um, similarly, you can also, let's call it mu.", 'start': 3163.104, 'duration': 5.003}, {'end': 3170.249, 'text': "It's like the mean of the distribution.", 'start': 3168.468, 'duration': 1.781}, {'end': 3177.494, 'text': 'Uh, similarly, you can go from, mu back to eta with the inverse of this.', 'start': 3170.829, 'duration': 6.665}], 'summary': 'In machine learning, the update rule remains the same for classification and regression, and eta represents the natural parameter in the canonical response function.', 'duration': 83.234, 'max_score': 3094.26, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3094260.jpg'}, {'end': 3281.767, 'src': 'embed', 'start': 3229.778, 'weight': 2, 'content': [{'end': 3231.478, 'text': 'So we have three parameterizations.', 'start': 3229.778, 'duration': 1.7}, {'end': 3256.736, 'text': "So we have the model parameters, that's Theta, the natural parameters, that's Eta, and we have the, uh, canonical parameters.", 'start': 3237.642, 'duration': 19.094}, {'end': 3273.083, 'text': 'And this is say Phi for Bernoulli Mu and Sigma square for Gaussian, Lambda for Poisson right?', 'start': 3261.018, 'duration': 12.065}, {'end': 3281.767, 'text': 'So these are three different ways we are- we can parameterize um either the exponential family or- or- or- or the uh GLM.', 'start': 3273.343, 'duration': 8.424}], 'summary': 'Three parameterizations: model parameters (theta), natural parameters (eta), and canonical parameters (e.g. phi, sigma square, lambda).', 'duration': 51.989, 'max_score': 3229.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3229778.jpg'}, {'end': 3542.801, 'src': 'embed', 'start': 3521.093, 'weight': 4, 'content': [{'end': 3529.434, 'text': "Yeah So the, uh, the choice of what distribution you're going to choose is really dependent on the task that you have.", 'start': 3521.093, 'duration': 8.341}, {'end': 3536.757, 'text': 'So if your task is regression, where you want to output real, valued numbers like price of the house or something, uh,', 'start': 3529.754, 'duration': 7.003}, {'end': 3542.801, 'text': 'then you choose a distribution over the real va- real, uh, real numbers like a Gaussian.', 'start': 3536.757, 'duration': 6.044}], 'summary': 'Choice of distribution depends on the task; for regression, use gaussian distribution for real-valued numbers.', 'duration': 21.708, 'max_score': 3521.093, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3521093.jpg'}], 'start': 3094.26, 'title': 'Update rules and parameterization in statistical models', 'summary': 'Delves into the update rule in statistical models, showcasing its consistency across model types and emphasizing the concept of natural parameter (eta) and its connection with the distribution mean. it also explores the three parameterizations in generalized linear models and the impact of task in distribution selection for regression or classification.', 'chapters': [{'end': 3229.457, 'start': 3094.26, 'title': 'Update rules in statistical models', 'summary': 'Discusses the update rule in statistical models, emphasizing that regardless of the type of model (classification, regression, poisson regression), the update rule remains the same, highlighting the concept of natural parameter (eta) and its relationship with the mean of the distribution, referred to as the canonical response function, and the distinction between different parameterizations.', 'duration': 135.197, 'highlights': ['The update rule remains the same across various types of models (classification, regression, Poisson regression), involving different h theta of x for each, illustrating the flexibility and consistency in the learning process.', 'The concept of natural parameter (eta) and its linkage to the mean of the distribution, known as the canonical response function, is explained, shedding light on the fundamental relationship between these elements in statistical models.', 'The discussion on the distinction between three different kinds of parameterizations clarifies the various approaches and considerations in statistical modeling.']}, {'end': 3585.984, 'start': 3229.778, 'title': 'Parameterization in generalized linear models', 'summary': 'Discusses the three parameterizations in generalized linear models, the connection between model parameters, natural parameters, and canonical parameters, and the influence of task in selecting the distribution for regression or classification.', 'duration': 356.206, 'highlights': ['The chapter explains the three parameterizations in generalized linear models: model parameters (Theta), natural parameters (Eta), and canonical parameters (Phi for Bernoulli, Mu and Sigma square for Gaussian, Lambda for Poisson).', 'The connection between model parameters and natural parameters is linear, where Theta transpose x yields the natural parameter, representing a design choice in re-parameterizing Eta by a linear model.', 'The chapter emphasizes the influence of the task in selecting the distribution for regression or classification, with the choice of distribution being dependent on the nature of the output (real-valued numbers for regression, binary data for classification, and Poisson distribution for count data).']}], 'duration': 491.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3094260.jpg', 'highlights': ['The update rule remains consistent across model types, showcasing flexibility and consistency in learning process.', 'The concept of natural parameter (eta) and its linkage to the mean of the distribution is explained, shedding light on the fundamental relationship in statistical models.', 'The discussion on the three different kinds of parameterizations clarifies various approaches and considerations in statistical modeling.', 'The chapter explains the three parameterizations in generalized linear models: model parameters (Theta), natural parameters (Eta), and canonical parameters (Phi for Bernoulli, Mu and Sigma square for Gaussian, Lambda for Poisson).', 'The chapter emphasizes the influence of the task in selecting the distribution for regression or classification, with the choice of distribution being dependent on the nature of the output.']}, {'end': 4051.578, 'segs': [{'end': 3690.869, 'src': 'embed', 'start': 3653.156, 'weight': 0, 'content': [{'end': 3659.2, 'text': 'You know, GLMs are just a general way to model data and that data could be, you know, uh, binary, it could be real valued.', 'start': 3653.156, 'duration': 6.044}, {'end': 3666.305, 'text': 'And- and uh, as long as you have a distribution that can model uh that kind of data and falls in the exponential family,', 'start': 3659.52, 'duration': 6.785}, {'end': 3669.567, 'text': 'it can be just plugged into a GLM and everything just, uh.', 'start': 3666.305, 'duration': 3.262}, {'end': 3670.668, 'text': 'uh, works out nicely.', 'start': 3669.567, 'duration': 1.101}, {'end': 3690.869, 'text': "So, uh, so the assumptions that we made, uh, let's start with regression, right? So for regression, we assume there is some x.", 'start': 3672.819, 'duration': 18.05}], 'summary': 'Glms can model binary or real-valued data using exponential family distributions, making the assumptions work out nicely.', 'duration': 37.713, 'max_score': 3653.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3653156.jpg'}, {'end': 3754.14, 'src': 'embed', 'start': 3713.557, 'weight': 1, 'content': [{'end': 3727.107, 'text': 'And this we assume is eta, right? And in case of regression, eta was also mu.', 'start': 3713.557, 'duration': 13.55}, {'end': 3741.634, 'text': 'So eta was also mu, right? Um, and then we are assuming that the y for any given x is distributed as a Gaussian with mu as the mean.', 'start': 3730.689, 'duration': 10.945}, {'end': 3749.498, 'text': 'So which means for every x, every possible x, you have the appropriate, uh, um, eta.', 'start': 3741.955, 'duration': 7.543}, {'end': 3754.14, 'text': "And with this as the mean, let's- let's think of this as y.", 'start': 3750.098, 'duration': 4.042}], 'summary': 'In regression, eta and mu are assumed to represent gaussian distribution with mu as the mean.', 'duration': 40.583, 'max_score': 3713.557, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3713557.jpg'}, {'end': 3874.021, 'src': 'embed', 'start': 3845.433, 'weight': 2, 'content': [{'end': 3852.113, 'text': 'We make an assumption that there is some, linear model from which the data was- was- was- was generated in this format.', 'start': 3845.433, 'duration': 6.68}, {'end': 3860.256, 'text': 'And we want to work backwards right to find Theta that will give us this line right?', 'start': 3852.653, 'duration': 7.603}, {'end': 3864.738, 'text': 'So for a different choice of Theta, we get a different line right?', 'start': 3860.576, 'duration': 4.162}, {'end': 3874.021, 'text': "We assume that you know if- if that line represents the- the mus or the means of the y's for that particular x, uh, from which it's sampled from.", 'start': 3864.958, 'duration': 9.063}], 'summary': "Working backwards to find theta for different lines representing means of y's for particular x", 'duration': 28.588, 'max_score': 3845.433, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3845433.jpg'}, {'end': 3964.944, 'src': 'embed', 'start': 3929.799, 'weight': 3, 'content': [{'end': 3941.27, 'text': 'from this eta we- we run this through the sigmoid function uh 1 over 1 plus e to the minus eta to get phi right?', 'start': 3929.799, 'duration': 11.471}, {'end': 3953.018, 'text': 'So if these are the etas for each um, for each eta, we run it through the sigmoid and we get something like this right?', 'start': 3941.49, 'duration': 11.528}, {'end': 3956.841, 'text': 'So this tends to uh 1, this tends to 0..', 'start': 3953.058, 'duration': 3.783}, {'end': 3964.944, 'text': 'And, um, when- at this point when eta is 0, the sigmoid is- is 0.5.', 'start': 3956.841, 'duration': 8.103}], 'summary': 'Using sigmoid function to get phi from etas, tends to 1, 0, and 0.5 at eta 0.', 'duration': 35.145, 'max_score': 3929.799, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3929799.jpg'}, {'end': 4039.633, 'src': 'embed', 'start': 3996.607, 'weight': 5, 'content': [{'end': 4002.989, 'text': 'where you know the- the probability of y is the- is the height to the um um sigmoid through uh, the natural parameter.', 'start': 3996.607, 'duration': 6.382}, {'end': 4008.211, 'text': 'And from this, you have a data generating distribution that would look like this.', 'start': 4003.469, 'duration': 4.742}, {'end': 4013.293, 'text': "So x and um, you have a few x's in your training set,", 'start': 4008.572, 'duration': 4.721}, {'end': 4019.529, 'text': "and for those x's you calc- you- you figure out what your you know y distribution is and sample from it.", 'start': 4013.293, 'duration': 6.236}, {'end': 4039.633, 'text': "So let's say, right? And now, um, again, our goal is to stop- given- given this data, so- so over here, this is the x and this is y.", 'start': 4019.889, 'duration': 19.744}], 'summary': 'Modeling data distribution with sigmoid function for training set.', 'duration': 43.026, 'max_score': 3996.607, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3996607.jpg'}], 'start': 3585.984, 'title': 'Glm in regression and classification', 'summary': 'Covers glm assumptions for regression including distribution and mean assumptions, and the process of finding optimal theta. it also discusses maximum likelihood estimation and classification in glm, including the transformation of theta to eta, the sigmoid function, and the generation of data distribution from the probability of y. it also covers the goal of working backwards to find theta given the data.', 'chapters': [{'end': 3885.632, 'start': 3585.984, 'title': 'Glm assumptions and model visualization', 'summary': 'Discusses the assumptions of glms for regression, including the distribution and mean assumptions, and the process of working backward to find the optimal theta to generate the linear model.', 'duration': 299.648, 'highlights': ['GLMs are a general way to model data, applicable to binary or real-valued data that falls in the exponential family, offering flexibility in data representation.', 'The assumption for regression involves a linear model with x and theta, where eta represents the mean, and the y is distributed as a Gaussian with mu as the mean and a variance of 1.', "The process involves working backward to find the optimal Theta that gives the line representing the means of the y's for a particular x, aiming to find the line from which the y's are most likely to have been sampled."]}, {'end': 4051.578, 'start': 3885.672, 'title': 'Maximum likelihood and classification in glm', 'summary': 'Discusses the process of maximum likelihood estimation and classification in generalized linear models (glm), including the transformation of theta to eta, the sigmoid function, and the generation of data distribution from the probability of y. it also covers the goal of working backwards to find theta given the data.', 'duration': 165.906, 'highlights': ['The process of transforming theta to eta and running it through the sigmoid function to obtain phi is discussed, leading to the generation of a data distribution from the probability of y.', 'The goal of working backwards to find the theta given the data is emphasized as a key aspect of the discussion.', 'Explanation of how the probability of y is correlated to the sigmoid line and the generation of different Bernoulli distributions for each x in the training set.', "Detailed explanation of the probability distribution and sampling process for y given the x's in the training set."]}], 'duration': 465.594, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ3585984.jpg', 'highlights': ['GLMs are a general way to model data, applicable to binary or real-valued data in the exponential family.', 'The assumption for regression involves a linear model with x and theta, where y is distributed as a Gaussian with mu as the mean and a variance of 1.', "The process involves working backward to find the optimal Theta that gives the line representing the means of the y's for a particular x.", 'The process of transforming theta to eta and running it through the sigmoid function to obtain phi is discussed, leading to the generation of a data distribution from the probability of y.', 'The goal of working backwards to find the theta given the data is emphasized as a key aspect of the discussion.', 'Explanation of how the probability of y is correlated to the sigmoid line and the generation of different Bernoulli distributions for each x in the training set.']}, {'end': 4916.954, 'segs': [{'end': 4141.229, 'src': 'embed', 'start': 4111.738, 'weight': 0, 'content': [{'end': 4122.548, 'text': 'So softmax regression is, so in the lecture notes, softmax regression is explained as yet another member of the GLM family.', 'start': 4111.738, 'duration': 10.81}, {'end': 4135.38, 'text': "However, in today's lecture we'll be taking a non-GLM approach and kind of seeing and see how softmax is essentially doing what's also called as cross entropy minimization.", 'start': 4123.788, 'duration': 11.592}, {'end': 4141.229, 'text': "we'll end up with the same- same formulas and equations.", 'start': 4138.769, 'duration': 2.46}], 'summary': 'Softmax regression explained as cross entropy minimization, part of non-glm approach', 'duration': 29.491, 'max_score': 4111.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4111738.jpg'}, {'end': 4384.373, 'src': 'heatmap', 'start': 4322.332, 'weight': 0.722, 'content': [{'end': 4334.217, 'text': 'And there are k such things where class is in your triangle, circle, square, etc.', 'start': 4322.332, 'duration': 11.885}, {'end': 4341.735, 'text': 'Right? So in logistic regression, we had just one Theta, which would do a binary, you know, yes versus no.', 'start': 4335.814, 'duration': 5.921}, {'end': 4349.057, 'text': 'Um, in Softmax we have one such vector of Theta per class, right?', 'start': 4342.376, 'duration': 6.681}, {'end': 4360.98, 'text': 'So you could also optionally represent them as a matrix, which is an N by K matrix, where you know you have a Theta class, Theta class right?', 'start': 4349.257, 'duration': 11.723}, {'end': 4373.083, 'text': "Um, so, in Softmax, uh regression um it's- it's a generalization of logistic regression, where you have a set of parameters per class, right?", 'start': 4361.66, 'duration': 11.423}, {'end': 4384.373, 'text': "And we're gonna do something um, something similar to um.", 'start': 4373.824, 'duration': 10.549}], 'summary': 'Softmax regression generalizes logistic regression to have a set of parameters per class.', 'duration': 62.041, 'max_score': 4322.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4322332.jpg'}, {'end': 4384.373, 'src': 'embed', 'start': 4349.257, 'weight': 3, 'content': [{'end': 4360.98, 'text': 'So you could also optionally represent them as a matrix, which is an N by K matrix, where you know you have a Theta class, Theta class right?', 'start': 4349.257, 'duration': 11.723}, {'end': 4373.083, 'text': "Um, so, in Softmax, uh regression um it's- it's a generalization of logistic regression, where you have a set of parameters per class, right?", 'start': 4361.66, 'duration': 11.423}, {'end': 4384.373, 'text': "And we're gonna do something um, something similar to um.", 'start': 4373.824, 'duration': 10.549}], 'summary': 'Softmax regression is a generalization of logistic regression with a set of parameters per class.', 'duration': 35.116, 'max_score': 4349.257, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4349257.jpg'}, {'end': 4541.985, 'src': 'embed', 'start': 4498.944, 'weight': 4, 'content': [{'end': 4512.41, 'text': "Um, and now, um, our goal is to take these parameters and let's see what happens when we feed a new example.", 'start': 4498.944, 'duration': 13.466}, {'end': 4527.598, 'text': 'So given an example x, we get a set of- given x, um, and over here we have classes.', 'start': 4513.231, 'duration': 14.367}, {'end': 4540.385, 'text': 'Right? So we have the circle class, the triangle class, the square class, right? So, um, over here we plot Theta class transpose x.', 'start': 4529.241, 'duration': 11.144}, {'end': 4541.985, 'text': 'So we may get something that looks like this.', 'start': 4540.385, 'duration': 1.6}], 'summary': 'Goal is to use parameters to classify new examples into classes.', 'duration': 43.041, 'max_score': 4498.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4498944.jpg'}, {'end': 4832.601, 'src': 'heatmap', 'start': 4725.594, 'weight': 5, 'content': [{'end': 4731.737, 'text': "Let's say the point over there was- let- let's say it was a triangle for- for whatever reason.", 'start': 4725.594, 'duration': 6.143}, {'end': 4740.516, 'text': 'Right?. If that was the triangle, then the P of y, which is also called the label.', 'start': 4732.093, 'duration': 8.423}, {'end': 4749.819, 'text': 'you can think of that as a probability distribution, which is 1 over the um, um correct class, and 0 elsewhere right?', 'start': 4740.516, 'duration': 9.303}, {'end': 4751.62, 'text': 'So P of y.', 'start': 4749.859, 'duration': 1.761}, {'end': 4755.841, 'text': 'This is essentially representing the one-hot representation as a probability distribution right?', 'start': 4751.62, 'duration': 4.221}, {'end': 4768.153, 'text': "Now the goal or- or um, the learning approach that we're gonna do is, in a way, minimize the distance between these two distributions right?", 'start': 4756.141, 'duration': 12.012}, {'end': 4769.414, 'text': 'This is one distribution.', 'start': 4768.433, 'duration': 0.981}, {'end': 4770.394, 'text': 'this is another distribution.', 'start': 4769.414, 'duration': 0.98}, {'end': 4774.436, 'text': 'We wanna change this distribution to look like that distribution right?', 'start': 4770.774, 'duration': 3.662}, {'end': 4781.959, 'text': 'Uh and- and uh, technically that- the term for that is minimize the cross entropy between the two distributions.', 'start': 4775.096, 'duration': 6.863}, {'end': 4825.628, 'text': 'So the cross entropy between P and P hat is equal to for y in circle angle square, P of y times log P hat of y.', 'start': 4796.465, 'duration': 29.163}, {'end': 4832.601, 'text': "I don't think we'll have time to go over the interpretation of cross entropy, but you can pick that up.", 'start': 4828.359, 'duration': 4.242}], 'summary': 'The goal is to minimize cross entropy between two distributions in the learning approach.', 'duration': 57.505, 'max_score': 4725.594, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4725594.jpg'}, {'end': 4916.732, 'src': 'heatmap', 'start': 4869.51, 'weight': 0.708, 'content': [{'end': 4880.676, 'text': 'transpose x over sum of class in triangle square circle e to the angle.', 'start': 4869.51, 'duration': 11.166}, {'end': 4889.729, 'text': 'And on this you-, you-, you- you treat this as the loss and do gradient descent.', 'start': 4884.305, 'duration': 5.424}, {'end': 4896.095, 'text': 'Gradient descent.', 'start': 4895.194, 'duration': 0.901}, {'end': 4898.577, 'text': 'with respect to the parameters, right?', 'start': 4896.095, 'duration': 2.482}, {'end': 4904.763, 'text': 'Um yeah, with- with- with that, I think.', 'start': 4899.458, 'duration': 5.305}, {'end': 4907.205, 'text': 'uh any- any questions on Softmax?', 'start': 4904.763, 'duration': 2.442}, {'end': 4916.732, 'text': "So we'll break for today in that case.", 'start': 4914.695, 'duration': 2.037}], 'summary': 'Discussing gradient descent and softmax in parameter optimization.', 'duration': 47.222, 'max_score': 4869.51, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4869510.jpg'}], 'start': 4052.219, 'title': 'Softmax regression', 'summary': 'Covers the non-glm approach to softmax regression, emphasizing cross entropy minimization and its interpretation in multiclass classification. it also explains softmax regression for multi-class classification, with the goal of predicting classes given new data points, using a set of parameters per class and minimizing cross entropy between predicted and true distributions.', 'chapters': [{'end': 4160.062, 'start': 4052.219, 'title': 'Softmax regression overview', 'summary': 'Covers the non-glm approach to softmax regression, emphasizing cross entropy minimization and its interpretation in multiclass classification.', 'duration': 107.843, 'highlights': ['Softmax regression is explained as a non-GLM approach and is essentially doing cross entropy minimization, providing a clearer interpretation of multiclass classification.', 'The lecture focuses on a more intuitive understanding of softmax regression, emphasizing its cross entropy interpretation and its relevance in multiclass classification.', 'The chapter concludes with a discussion on multiclass classification and the application of softmax regression in this context.']}, {'end': 4916.954, 'start': 4161.042, 'title': 'Softmax regression classification', 'summary': 'Explains softmax regression for multi-class classification, with the goal of predicting classes given new data points, using a set of parameters per class and minimizing cross entropy between predicted and true distributions.', 'duration': 755.912, 'highlights': ['Softmax regression for multi-class classification', 'Use of a set of parameters per class', 'Minimizing cross entropy between predicted and true distributions']}], 'duration': 864.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iZTeva0WSTQ/pics/iZTeva0WSTQ4052219.jpg', 'highlights': ['Softmax regression is explained as a non-GLM approach and is essentially doing cross entropy minimization, providing a clearer interpretation of multiclass classification.', 'The lecture focuses on a more intuitive understanding of softmax regression, emphasizing its cross entropy interpretation and its relevance in multiclass classification.', 'The chapter concludes with a discussion on multiclass classification and the application of softmax regression in this context.', 'Softmax regression for multi-class classification', 'Use of a set of parameters per class', 'Minimizing cross entropy between predicted and true distributions']}], 'highlights': ['Detailed comparison of hypothesis functions and update rules between perceptron and logistic regression.', 'The differentiation of the log partition function yields the mean of the distribution, simplifying mean calculation', 'The second derivative yields the variance, offering a straightforward alternative for variance calculation', 'Softmax regression for multi-class classification', 'The update rule remains consistent across model types, showcasing flexibility and consistency in learning process.']}