title
Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018)
description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3G6tSE6
Andrew Ng
Adjunct Professor of Computer Science
https://www.andrewng.org/
To follow along with the course schedule and syllabus, visit:
http://cs229.stanford.edu/syllabus-autumn2018.html
0:00 Introduction
1:15 Unsupervised learning
1:38 First unsupervised learning algorithm
1:54 Market Segmentation
5:33 Clustering algorithm
5:37 K-means clustering
5:52 Initialize the cluster centroids
12:10 Cost function
16:32 Density Estimation
18:01 Anomaly Detection
20:40 Mixture of Gaussians Volatile
29:27 Maximum Likelihood Estimates
31:44 Bayes Rule
48:12 Jensen's Inequality
57:57 Density Estimation Problem
59:32 Maximum Likelihood Estimation
1:07:16 Concave form of Jensen's Inequality
detail
{'title': 'Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018)', 'heatmap': [{'end': 1643.912, 'start': 1540.061, 'weight': 1}, {'end': 3629.141, 'start': 3570.47, 'weight': 0.776}], 'summary': "The lecture covers logistic details for a 48-hour take-home midterm and includes a discussion on em. it introduces unsupervised learning and k-means clustering, discussing convergence, practical applications, and anomaly detection in aircraft engines. it also explains the em algorithm for latent variable estimation, mixtures of gaussians, gaussian mixture model, jensen's inequality, concave form, density estimation, and variational inference.", 'chapters': [{'end': 47.234, 'segs': [{'end': 47.234, 'src': 'embed', 'start': 4.608, 'weight': 0, 'content': [{'end': 4.948, 'text': 'All right.', 'start': 4.608, 'duration': 0.34}, {'end': 7.45, 'text': "Um, let's get started.", 'start': 6.029, 'duration': 1.421}, {'end': 12.034, 'text': "So um, let's see logistical reminder.", 'start': 8.251, 'duration': 3.783}, {'end': 16.477, 'text': "uh, the class midterm um is this Wednesday and it's 48 hour.", 'start': 12.034, 'duration': 4.443}, {'end': 23.102, 'text': 'take home midterm um and the logistical details you can find, uh, at this Piazza post right?', 'start': 16.477, 'duration': 6.625}, {'end': 24.743, 'text': 'So the midterm will start Wednesday evening.', 'start': 23.122, 'duration': 1.621}, {'end': 33.35, 'text': "You have 48 hours to do it and then submit it online through Gradescope. uh and uh. because of the midterm, there won't be a section uh this Friday.", 'start': 25.164, 'duration': 8.186}, {'end': 40.642, 'text': "Okay? Oh, and the midterm we'll cover everything up to and including EM, uh, which we'll spend most of today talking about.", 'start': 33.61, 'duration': 7.032}, {'end': 43.126, 'text': "Um, okay? Susie, don't look so stressed.", 'start': 41.062, 'duration': 2.064}, {'end': 43.566, 'text': "It'll be fun.", 'start': 43.166, 'duration': 0.4}, {'end': 45.309, 'text': 'All right.', 'start': 43.586, 'duration': 1.723}, {'end': 46.01, 'text': 'Well, maybe.', 'start': 45.389, 'duration': 0.621}, {'end': 47.234, 'text': 'All right.', 'start': 47.014, 'duration': 0.22}], 'summary': 'Class midterm this wednesday, 48-hour take-home, covers content up to and including em.', 'duration': 42.626, 'max_score': 4.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA4608.jpg'}], 'start': 4.608, 'title': 'Midterm logistics and em discussion', 'summary': 'Covers logistical details for a 48-hour take-home midterm happening this wednesday and includes a discussion on em. it reassures that it will be fun.', 'chapters': [{'end': 47.234, 'start': 4.608, 'title': 'Class midterm logistics and em discussion', 'summary': 'Covers logistical details for a 48-hour take-home midterm happening this wednesday and includes a discussion on em, with the reassurance that it will be fun.', 'duration': 42.626, 'highlights': ['Logistical reminder for a 48-hour take-home midterm this Wednesday, with details on submission through Gradescope and coverage up to and including EM.', 'No section this Friday due to the midterm.', 'Reassurance that the midterm will be fun and discussion on EM.']}], 'duration': 42.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA4608.jpg', 'highlights': ['Logistical reminder for a 48-hour take-home midterm this Wednesday, with details on submission through Gradescope and coverage up to and including EM.', 'Reassurance that the midterm will be fun and discussion on EM.', 'No section this Friday due to the midterm.']}, {'end': 674.498, 'segs': [{'end': 133.837, 'src': 'embed', 'start': 71.599, 'weight': 0, 'content': [{'end': 75.2, 'text': 'or an SVM or something to find the line, find the digital boundary between them.', 'start': 71.599, 'duration': 3.601}, {'end': 80.057, 'text': "Um, in unsupervised learning, you're given unlabeled data.", 'start': 75.896, 'duration': 4.161}, {'end': 86.559, 'text': "So rather than given data with x and y, you're given only x.", 'start': 80.518, 'duration': 6.041}, {'end': 97.466, 'text': "And so your training set now looks at x1, x2, up through, XM and you're asked to find something interesting about the data.", 'start': 86.559, 'duration': 10.907}, {'end': 103.711, 'text': "Um. so the first unsupervised learning algorithm we'll talk about is clustering, in which, given a dataset like this,", 'start': 98.087, 'duration': 5.624}, {'end': 110.716, 'text': 'hopefully we can have an algorithm that can figure out that, um, this dataset has two separate clusters.', 'start': 103.711, 'duration': 7.005}, {'end': 116.58, 'text': 'Um, and so one of the most common uses of clustering is, uh, market segmentation.', 'start': 111.276, 'duration': 5.304}, {'end': 118.822, 'text': 'If you have a website, you know selling things online.', 'start': 116.82, 'duration': 2.002}, {'end': 125.128, 'text': 'We have a huge database of many different users and run clustering to decide what are the different market segments right?', 'start': 119.162, 'duration': 5.966}, {'end': 131.014, 'text': 'So there may be, you know, people of a certain age range, of a certain gender, people of different age range, different level of education,', 'start': 125.168, 'duration': 5.846}, {'end': 133.837, 'text': 'people that live in the East Coast versus West Coast versus elsewhere in the country.', 'start': 131.014, 'duration': 2.823}], 'summary': 'Unsupervised learning involves finding patterns in unlabeled data, such as clustering for market segmentation.', 'duration': 62.238, 'max_score': 71.599, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA71599.jpg'}, {'end': 205.792, 'src': 'embed', 'start': 175.846, 'weight': 2, 'content': [{'end': 181.35, 'text': "And, uh, the cluster centroids are your best guess for where the centers of the two clusters you're trying to find.", 'start': 175.846, 'duration': 5.504}, {'end': 183.952, 'text': 'And then k-means is an iterative algorithm.', 'start': 181.93, 'duration': 2.022}, {'end': 186.115, 'text': 'And repeatedly, you do two things.', 'start': 184.173, 'duration': 1.942}, {'end': 189.198, 'text': 'The first thing is go through each of your training example.', 'start': 186.155, 'duration': 3.043}, {'end': 189.818, 'text': "Oh, I'm sorry.", 'start': 189.318, 'duration': 0.5}, {'end': 190.899, 'text': 'Oh, okay.', 'start': 190.259, 'duration': 0.64}, {'end': 191.28, 'text': 'Thank you.', 'start': 190.979, 'duration': 0.301}, {'end': 191.7, 'text': 'All right.', 'start': 191.3, 'duration': 0.4}, {'end': 193.642, 'text': 'Okay Uh, let me know if it happens again.', 'start': 192.281, 'duration': 1.361}, {'end': 194.523, 'text': 'Okay Right.', 'start': 193.782, 'duration': 0.741}, {'end': 197.266, 'text': 'So, um, you guys saw that, right? So, right.', 'start': 194.703, 'duration': 2.563}, {'end': 198.267, 'text': 'You have two cluster centroids.', 'start': 197.286, 'duration': 0.981}, {'end': 205.792, 'text': 'So the first thing you do is go through each of your training examples, the green dots, and for each of them you color them either red or blue,', 'start': 198.787, 'duration': 7.005}], 'summary': 'K-means algorithm iteratively finds cluster centroids, assigning points to the closest centroid.', 'duration': 29.946, 'max_score': 175.846, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA175846.jpg'}, {'end': 319.845, 'src': 'embed', 'start': 296.448, 'weight': 5, 'content': [{'end': 304.111, 'text': 'So if you look at this picture and you repeatedly color each point red or blue depending on which cluster central is closer, nothing changes.', 'start': 296.448, 'duration': 7.663}, {'end': 311.275, 'text': 'And if you repeatedly look at each of the two clusters of colored dots and computer mean and move the clusters there, nothing changes.', 'start': 304.532, 'duration': 6.743}, {'end': 314.756, 'text': 'So this algorithm has converged even if you keep on running these two steps.', 'start': 311.315, 'duration': 3.441}, {'end': 319.845, 'text': "Okay? So, um, Let's see.", 'start': 315.156, 'duration': 4.689}], 'summary': 'Convergent algorithm for clustering with two steps.', 'duration': 23.397, 'max_score': 296.448, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA296448.jpg'}], 'start': 47.514, 'title': 'Unsupervised learning and k-means clustering', 'summary': 'Introduces unsupervised learning, focusing on the k-means clustering algorithm and its application in market segmentation to identify user segments based on demographics and locations. it also explains the iterative process of the k-means algorithm and how it converges to stable clusters.', 'chapters': [{'end': 133.837, 'start': 47.514, 'title': 'Introduction to unsupervised learning', 'summary': 'Introduces unsupervised learning, focusing on clustering as the first algorithm, highlighting its application in market segmentation to identify different user segments based on various demographics and locations.', 'duration': 86.323, 'highlights': ['Clustering is an unsupervised learning algorithm that can identify separate clusters within a dataset, commonly used for market segmentation.', 'Market segmentation can be done using clustering to identify different user segments based on demographics and locations.', 'Unsupervised learning involves working with unlabeled data, unlike supervised learning which uses labeled data with x and y values.']}, {'end': 674.498, 'start': 134.258, 'title': 'K-means clustering algorithm', 'summary': 'Introduces the k-means clustering algorithm, which iteratively assigns data points to clusters and moves cluster centroids based on the mean of the points, ultimately converging to stable clusters.', 'duration': 540.24, 'highlights': ['K-means clustering algorithm iteratively assigns data points to clusters and moves cluster centroids based on the mean of the points, ultimately converging to stable clusters.', 'The first step of k-means is to pick two points denoted by the two crosses called cluster centroids.', 'The algorithm has converged when repeatedly coloring each point and computing the mean of the colored clusters results in no changes.', 'The k-means algorithm involves initializing cluster centroids and repeating the process until convergence.', 'The k-means algorithm involves assigning data points to clusters based on proximity to cluster centroids.']}], 'duration': 626.984, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA47514.jpg', 'highlights': ['Clustering is an unsupervised learning algorithm for market segmentation.', 'Market segmentation can be done using clustering to identify user segments.', 'K-means algorithm iteratively assigns data points to clusters and moves centroids.', 'Unsupervised learning involves working with unlabeled data.', 'K-means algorithm involves initializing centroids and repeating the process.', 'K-means algorithm converges when computing the mean of clusters results in no changes.', 'K-means algorithm involves assigning data points to clusters based on proximity.']}, {'end': 994.631, 'segs': [{'end': 757.935, 'src': 'embed', 'start': 674.598, 'weight': 1, 'content': [{'end': 674.818, 'text': 'All right.', 'start': 674.598, 'duration': 0.22}, {'end': 676.6, 'text': 'Is it okay? Thank you.', 'start': 675.098, 'duration': 1.502}, {'end': 688.5, 'text': "Right And this wasn't part of it.", 'start': 687.203, 'duration': 1.297}, {'end': 705.769, 'text': 'All right.', 'start': 705.509, 'duration': 0.26}, {'end': 712.078, 'text': 'Um, so it turns out that, uh, um, this algorithm can be proven to converge.', 'start': 706.05, 'duration': 6.028}, {'end': 716.123, 'text': 'Um, the- the exactly why is written out in the lecture notes.', 'start': 712.498, 'duration': 3.625}, {'end': 728.511, 'text': 'But it turns out if you write this as a cost function, Right.', 'start': 716.163, 'duration': 12.348}, {'end': 739.143, 'text': 'Um. so the cost function for a certain set of assignments of uh points of examples to cluster centroids and for a certain set of positions of the cluster centroids.', 'start': 729.112, 'duration': 10.031}, {'end': 746.572, 'text': 'So- so C, these are the assignments and these are the centroids.', 'start': 739.183, 'duration': 7.389}, {'end': 757.935, 'text': "Right?. So- so this cost here is sum over your training set of what's the squared distance within each point and the cluster central that it's assigned to.", 'start': 748.834, 'duration': 9.101}], 'summary': 'Algorithm can be proven to converge with a certain cost function.', 'duration': 83.337, 'max_score': 674.598, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA674598.jpg'}, {'end': 879.805, 'src': 'embed', 'start': 777.881, 'weight': 0, 'content': [{'end': 783.391, 'text': "And so this shows that k-means must converge, or at least this function must converge, because it's, uh, uh,", 'start': 777.881, 'duration': 5.51}, {'end': 786.313, 'text': "strictly non-negative function that's going down on every iteration.", 'start': 783.391, 'duration': 2.922}, {'end': 790.676, 'text': 'So at some point, it has to stop going down and then you could declare K-means to have converged.', 'start': 786.333, 'duration': 4.343}, {'end': 800.442, 'text': 'Um. in practice, if you run K-means in a very, very large dataset, then, as you plot the number of iterations, uh, J may go down and you know,', 'start': 791.016, 'duration': 9.426}, {'end': 807.146, 'text': "and and just because of um lack of compute or lack of patience, you might just stop this running after a while if it's going down too slowly.", 'start': 800.442, 'duration': 6.704}, {'end': 812.369, 'text': "So that's sort of K-means in practice where maybe it hasn't totally converged, it's just cut it off and call it good enough.", 'start': 807.266, 'duration': 5.103}, {'end': 820.422, 'text': 'Um now, uh, uh, the most frequently asked question I get for k-means is uh, how do you choose k??', 'start': 813.616, 'duration': 6.806}, {'end': 822.523, 'text': 'It turns out that, um.', 'start': 820.902, 'duration': 1.621}, {'end': 825.726, 'text': 'uh, when I use k-means, I still usually choose k by hand.', 'start': 822.523, 'duration': 3.203}, {'end': 830.47, 'text': 'Uh and so- and-, and this is why, which is in unsupervised learning.', 'start': 825.746, 'duration': 4.724}, {'end': 834.853, 'text': "um, sometimes it's just ambiguous, right?", 'start': 830.47, 'duration': 4.383}, {'end': 841.639, 'text': 'how many clusters there are right?', 'start': 834.853, 'duration': 6.786}, {'end': 848.963, 'text': 'Um, with this dataset, some of you will see two clusters and some of you will see four clusters.', 'start': 842.378, 'duration': 6.585}, {'end': 853.826, 'text': "And it's just inherently ambiguous what is the right number of clusters.", 'start': 850.183, 'duration': 3.643}, {'end': 859.99, 'text': 'So there are some formulas you can find online, the criteria like AIC and BIC for automatically choosing number of clusters.', 'start': 853.886, 'duration': 6.104}, {'end': 866.955, 'text': 'In practice, I tend not to use them because, uh um, I usually look at the downstream application of what.', 'start': 860.45, 'duration': 6.505}, {'end': 870.858, 'text': 'do you actually want to use k-means for in order to make a decision on the number of clusters?', 'start': 866.955, 'duration': 3.903}, {'end': 874.941, 'text': "So, for example, if you're doing a market segmentation, um, you know,", 'start': 870.898, 'duration': 4.043}, {'end': 879.805, 'text': 'because your marketers want to design different marketing campaigns right for different groups of users,', 'start': 874.941, 'duration': 4.864}], 'summary': 'K-means must converge, but in practice, manual k selection is common due to ambiguous cluster count and downstream application requirements.', 'duration': 101.924, 'max_score': 777.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA777881.jpg'}, {'end': 971.134, 'src': 'embed', 'start': 941.877, 'weight': 4, 'content': [{'end': 944.619, 'text': 'Uh, will K-means get second local minima? Uh, or, or, yes.', 'start': 941.877, 'duration': 2.742}, {'end': 947.061, 'text': 'Uh, K-means just subsets the local minima sometimes.', 'start': 944.839, 'duration': 2.222}, {'end': 954.555, 'text': "And so, if you're worried about local minima, the thing you can do is, uh, run K-means, say 10 times or 100 times, or 1,", 'start': 947.221, 'duration': 7.334}, {'end': 959.229, 'text': '000 times from different random initializations of the cluster centroids and then run it.', 'start': 954.555, 'duration': 4.674}, {'end': 965.673, 'text': 'you know, say 100 times, uh, and then pick whichever run resulted in the lowest value for this cost function.', 'start': 959.229, 'duration': 6.444}, {'end': 971.134, 'text': 'All right.', 'start': 970.794, 'duration': 0.34}], 'summary': 'K-means may encounter local minima. run it multiple times to choose the lowest cost function result.', 'duration': 29.257, 'max_score': 941.877, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA941877.jpg'}], 'start': 674.598, 'title': 'Convergence and practical application of k-means', 'summary': 'Discusses the convergence of k-means algorithm, showing its cost function decreases on each iteration, and addresses practical aspects such as choosing the number of clusters using aic and bic criteria.', 'chapters': [{'end': 757.935, 'start': 674.598, 'title': 'Convergence of algorithm and cost function', 'summary': "Discusses the convergence of an algorithm, proven to converge, and the cost function defined as the sum of squared distances within each point and the cluster centroid it's assigned to.", 'duration': 83.337, 'highlights': ['The algorithm can be proven to converge, as explained in the lecture notes.', "The cost function is defined as the sum of squared distances within each point and the cluster centroid it's assigned to."]}, {'end': 859.99, 'start': 759.275, 'title': 'Convergence and practical application of k-means', 'summary': 'Discusses the convergence of k-means algorithm, showing its cost function goes down on every iteration, proving its convergence, and also addresses the practical aspect of k-means where the choice of k is ambiguous and suggests using aic and bic criteria for automatically choosing the number of clusters.', 'duration': 100.715, 'highlights': ["K-means algorithm's cost function goes down on every iteration, proving its convergence, and at some point, it has to stop going down, leading to the declaration of convergence.", 'In practice, when running K-means on a very large dataset, the number of iterations, J, may go down slowly due to lack of compute or patience, leading to early termination.', 'Choosing the right number of clusters, k, in unsupervised learning, particularly with ambiguous datasets, remains a challenge, and manually choosing k is still a common practice.', 'AIC and BIC criteria are suggested for automatically choosing the number of clusters in k-means, providing a solution to the ambiguity in selecting k.']}, {'end': 994.631, 'start': 860.45, 'title': 'Choosing clusters for k-means', 'summary': 'Explains the decision-making process to choose the number of clusters for k-means, including examples such as market segmentation and image compression, and advises on dealing with local minima by running k-means multiple times and selecting the lowest cost function value.', 'duration': 134.181, 'highlights': ['The chapter discusses the decision-making process for choosing the number of clusters for K-means based on downstream application, such as market segmentation and image compression.', 'The chapter advises on dealing with local minima in K-means by running it multiple times and selecting the lowest cost function value.', 'The chapter briefly mentions density estimation as a related problem but emphasizes its differences when deriving the algorithms.']}], 'duration': 320.033, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA674598.jpg', 'highlights': ['AIC and BIC criteria help automatically choose the number of clusters in k-means', 'The algorithm can be proven to converge, as explained in the lecture notes', "The cost function is defined as the sum of squared distances within each point and the cluster centroid it's assigned to", 'Choosing the right number of clusters, k, in unsupervised learning remains a challenge', 'Dealing with local minima in K-means by running it multiple times and selecting the lowest cost function value', 'The chapter discusses the decision-making process for choosing the number of clusters for K-means based on downstream application', 'In practice, when running K-means on a very large dataset, the number of iterations, J, may go down slowly due to lack of compute or patience, leading to early termination', "K-means algorithm's cost function goes down on every iteration, proving its convergence, and at some point, it has to stop going down, leading to the declaration of convergence", 'Manually choosing k is still a common practice in unsupervised learning']}, {'end': 1643.912, 'segs': [{'end': 1100.544, 'src': 'embed', 'start': 1072.41, 'weight': 0, 'content': [{'end': 1080.294, 'text': "before you ship the airplane- before you ship the engine to- to an airplane maker and then something goes wrong in the air and there's a- there's a major accident or major disaster,", 'start': 1072.41, 'duration': 7.884}, {'end': 1082.495, 'text': 'right?. And so anomaly detection.', 'start': 1080.294, 'duration': 2.201}, {'end': 1093.196, 'text': 'uh, uh, is most commonly done, or one of the common ways to um implement anomaly detection is the model P of x which is.', 'start': 1082.495, 'duration': 10.701}, {'end': 1100.544, 'text': 'given all of these blue examples, given all of these dots, can you model what is the density from which x was drawn?', 'start': 1093.196, 'duration': 7.348}], 'summary': 'Anomaly detection is commonly implemented using the model p of x to predict density from examples.', 'duration': 28.134, 'max_score': 1072.41, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1072410.jpg'}, {'end': 1169.898, 'src': 'embed', 'start': 1138.11, 'weight': 3, 'content': [{'end': 1144.912, 'text': "So if one day one of the cell towers start throwing off network patterns that seem very unusual, then maybe something's wrong with that cell tower,", 'start': 1138.11, 'duration': 6.802}, {'end': 1146.052, 'text': "like something's gone wrong.", 'start': 1144.912, 'duration': 1.14}, {'end': 1148.453, 'text': 'We could send out a technician to fix it.', 'start': 1146.072, 'duration': 2.381}, {'end': 1150.773, 'text': "It's also used for computer security.", 'start': 1149.293, 'duration': 1.48}, {'end': 1157.475, 'text': 'If a computer say if a computer at Stanford starts sending out very strange, you know, uh uh, network traffic.', 'start': 1150.853, 'duration': 6.622}, {'end': 1162.936, 'text': "that's very unusual relative to everything that's done before, relative to what, what is this very anomalous network traffic?", 'start': 1157.475, 'duration': 5.461}, {'end': 1167.477, 'text': 'then maybe IT staff should have a look to see if, uh, that particular computer has been hacked right?', 'start': 1162.936, 'duration': 4.541}, {'end': 1169.898, 'text': 'So these are some of the applications of anomaly detection.', 'start': 1167.497, 'duration': 2.401}], 'summary': 'Anomaly detection can help detect unusual network patterns in cell towers and computers, prompting necessary action.', 'duration': 31.788, 'max_score': 1138.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1138110.jpg'}, {'end': 1267.604, 'src': 'embed', 'start': 1248.551, 'weight': 1, 'content': [{'end': 1259.516, 'text': "maybe there's one type of aircraft engine that that you know is drawn from a Gaussian like the one below and a separate aircraft type of aircraft engine that's drawn from a Gaussian like that above.", 'start': 1248.551, 'duration': 10.965}, {'end': 1267.604, 'text': "And this is why there's a lot of probability mass in this L-shaped region, but very low probability outside that L-shaped region, right?", 'start': 1259.796, 'duration': 7.808}], 'summary': 'Two types of aircraft engines are drawn from different gaussian distributions, resulting in high probability mass in an l-shaped region.', 'duration': 19.053, 'max_score': 1248.551, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1248551.jpg'}, {'end': 1435.92, 'src': 'embed', 'start': 1411.616, 'weight': 2, 'content': [{'end': 1419.36, 'text': 'Uh, the problem with this density estimation problem is you just see this data and maybe the data came from two different Gaussians,', 'start': 1411.616, 'duration': 7.744}, {'end': 1421.782, 'text': "but you don't know which example actually came from which Gaussian.", 'start': 1419.36, 'duration': 2.422}, {'end': 1431.013, 'text': 'Okay?. So the EM algorithm or the expectation maximization algorithm will allow us to, uh, fit a model,', 'start': 1422.242, 'duration': 8.771}, {'end': 1435.92, 'text': 'despite not knowing which Gaussian each example had come from.', 'start': 1431.013, 'duration': 4.907}], 'summary': 'Em algorithm solves density estimation problem with unknown gaussian origins.', 'duration': 24.304, 'max_score': 1411.616, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1411616.jpg'}, {'end': 1643.912, 'src': 'heatmap', 'start': 1540.061, 'weight': 1, 'content': [{'end': 1549.492, 'text': "So- so let's imagine that, um, there's some hidden random variable Z and- and the term latent just means hidden or unobserved, right?", 'start': 1540.061, 'duration': 9.431}, {'end': 1552.035, 'text': "It means that it exists, but you don't get to see the value directly.", 'start': 1549.512, 'duration': 2.523}, {'end': 1555.438, 'text': 'So when I say latent, it just means hidden or unobserved.', 'start': 1552.595, 'duration': 2.843}, {'end': 1562.982, 'text': "So let's imagine that there's a hidden or latent random variable z, and, uh, xi and zi have this joint distribution.", 'start': 1555.999, 'duration': 6.983}, {'end': 1567.384, 'text': 'And this- this- this is very, very similar to the model you saw in Gaussian discriminant analysis.', 'start': 1563.062, 'duration': 4.322}, {'end': 1571.806, 'text': 'Uh, but zi is multinomial with some set of parameters phi.', 'start': 1567.724, 'duration': 4.082}, {'end': 1576.188, 'text': 'Uh, for a mixture of two Gaussians, this would just be Bernoulli with two values.', 'start': 1572.546, 'duration': 3.642}, {'end': 1581.59, 'text': 'But if you have a mixture of k Gaussians, then z, you know, can take on values from 1 through k.', 'start': 1576.248, 'duration': 5.342}, {'end': 1586.536, 'text': "Right? Um, and it was two Gaussians, it'd just be Bernoulli.", 'start': 1583.575, 'duration': 2.961}, {'end': 1592.778, 'text': 'And then, once you know that one example comes from uh, Gaussian number j,', 'start': 1587.376, 'duration': 5.402}, {'end': 1601.822, 'text': 'then x condition that z i is equal to j that is drawn from a Gaussian distribution with some mean and some uh covariance sigma.', 'start': 1592.778, 'duration': 9.044}, {'end': 1612.695, 'text': "Okay? So the two unimportant ways this is different than GDA, um, one, well, I've set z to be one of k values instead of one of two values.", 'start': 1602.262, 'duration': 10.433}, {'end': 1619.323, 'text': 'In GDA Gaussian differential analysis, we had, um, z, you know, uh, y, the labels y took on one of two values.', 'start': 1612.795, 'duration': 6.528}, {'end': 1623.645, 'text': 'Uh, and then second is, uh, I have Sigma j instead of Sigma.', 'start': 1620.104, 'duration': 3.541}, {'end': 1629.827, 'text': 'So by- by convention, when we fit mixture of Gaussians models, we let each Gaussian have his own covariance matrix Sigma,', 'start': 1623.785, 'duration': 6.042}, {'end': 1631.628, 'text': 'but you could actually force it to be the same as your one.', 'start': 1629.827, 'duration': 1.801}, {'end': 1633.448, 'text': 'So- but these are the trivial differences.', 'start': 1631.648, 'duration': 1.8}, {'end': 1643.912, 'text': 'Uh, the most significant difference is that in Gaussian distributed analysis, we had labeled examples x i, y i, whereas y was observed.', 'start': 1634.129, 'duration': 9.783}], 'summary': 'The transcript discusses latent variables and their joint distributions, emphasizing differences from gaussian discriminant analysis.', 'duration': 103.851, 'max_score': 1540.061, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1540061.jpg'}], 'start': 994.731, 'title': 'Anomaly detection in aircraft engines', 'summary': 'Explores anomaly detection for aircraft engines using vibration and heat features, highlighting its role in preventing accidents. it also covers its applications in telecom and computer security, along with the use of the mixture of gaussians model and the em algorithm for identifying anomalies in aircraft engine performance.', 'chapters': [{'end': 1169.898, 'start': 994.731, 'title': 'Anomaly detection in aircraft engines', 'summary': 'Discusses the application of anomaly detection in identifying unusual aircraft engine signatures based on vibration and heat features to prevent accidents, as well as its applications in telecom and computer security.', 'duration': 175.167, 'highlights': ['Anomaly detection is crucial in identifying unusual aircraft engine signatures based on vibration and heat features to prevent accidents, with the aim of inspecting or testing engines before they are shipped to airplane makers to avoid potential disasters.', 'The model P of x is commonly used to implement anomaly detection by modeling the density from which x was drawn, flagging an anomaly when P of x is very small, thus indicating the need for further inspection or testing of the aircraft engine.', 'Anomaly detection is also applied in telecoms to detect network issues in cell towers, enabling the timely dispatch of technicians for repairs, and in computer security to identify anomalous network traffic, potentially indicating a security breach.']}, {'end': 1643.912, 'start': 1170.658, 'title': 'Anomaly detection with mixture of gaussians', 'summary': 'Discusses the application of the mixture of gaussians model for anomaly detection, using latent variables and the em algorithm, to model complex data distributions and identify anomalies in aircraft engine performance.', 'duration': 473.254, 'highlights': ['The mixture of Gaussians model is used for anomaly detection by identifying anomalies in aircraft engine performance based on the combination of vibration and heat signatures, where neither feature alone is unusual, but the combination of the two is considered unusual.', 'The EM algorithm is described as a method to fit the mixture of Gaussians model, allowing the model to be developed despite not knowing which Gaussian each example came from, addressing the problem of not having labels for the data.', 'The use of latent variables is discussed, where a hidden random variable z and xi have a joint distribution, similar to the model in Gaussian discriminant analysis, but with z taking on k values instead of two, and each Gaussian having its own covariance matrix.']}], 'duration': 649.181, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA994731.jpg', 'highlights': ['Anomaly detection crucial for identifying unusual aircraft engine signatures to prevent accidents', 'Mixture of Gaussians model used for identifying anomalies in aircraft engine performance', 'EM algorithm used to fit the mixture of Gaussians model for anomaly detection', 'Anomaly detection applied in telecoms to detect network issues in cell towers', 'Anomaly detection applied in computer security to identify anomalous network traffic']}, {'end': 2230.316, 'segs': [{'end': 1724.464, 'src': 'embed', 'start': 1685.948, 'weight': 2, 'content': [{'end': 1699.464, 'text': "So if we knew the z i's right, then we can use um maximum likelihood estimation, right?", 'start': 1685.948, 'duration': 13.516}, {'end': 1703.185, 'text': "So if only we knew the value of the z i's, which we don't, but if only we did,", 'start': 1699.504, 'duration': 3.681}, {'end': 1707.607, 'text': 'then we could use maximum likelihood estimation or MLE to estimate everything.', 'start': 1703.185, 'duration': 4.422}, {'end': 1720.572, 'text': 'You know, so we will write the log likelihood of the parameters right equals, sum, um log p of x, i, z, i.', 'start': 1707.767, 'duration': 12.805}, {'end': 1724.464, 'text': 'you know, given the parameters Right?', 'start': 1720.572, 'duration': 3.892}], 'summary': "Using mle, we could estimate parameters if we knew z i's.", 'duration': 38.516, 'max_score': 1685.948, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1685948.jpg'}, {'end': 1811.089, 'src': 'embed', 'start': 1773.758, 'weight': 3, 'content': [{'end': 1776.959, 'text': 'Actually, these two are exactly the formulas.', 'start': 1773.758, 'duration': 3.201}, {'end': 1783.243, 'text': "uh, we had for uh, Gaussian discriminant analysis, except we've replaced y with z right?", 'start': 1776.959, 'duration': 6.284}, {'end': 1785.324, 'text': "And then there's some other formula for sigma.", 'start': 1783.303, 'duration': 2.021}, {'end': 1787.825, 'text': "that's written in the lecture notes, but I won't- but I won't write down here.", 'start': 1785.324, 'duration': 2.501}, {'end': 1797.779, 'text': "Okay Um, but the reason we can't use this- use these formulas is we don't actually know what are the values of Z.", 'start': 1788.386, 'duration': 9.393}, {'end': 1811.089, 'text': 'So what we will do in the EM algorithm is, um, two steps.', 'start': 1797.779, 'duration': 13.31}], 'summary': 'Comparison of formulas for gaussian discriminant analysis and the limitation in using them due to the unknown values of z in the em algorithm.', 'duration': 37.331, 'max_score': 1773.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1773758.jpg'}, {'end': 1898.126, 'src': 'embed', 'start': 1835.316, 'weight': 0, 'content': [{'end': 1838.659, 'text': 'using your guesses, and then you update your guesses and then run the algorithm again.', 'start': 1835.316, 'duration': 3.343}, {'end': 1841.401, 'text': 'Let me- let me make that concrete by writing this down.', 'start': 1838.739, 'duration': 2.662}, {'end': 1857.138, 'text': 'So the EM algorithm has two steps.', 'start': 1855.256, 'duration': 1.882}, {'end': 1896.325, 'text': 'The E step, um also called the expectation step, is set w ij, So w i, j um is going to be the probability that, uh, z i is equal to j, okay?', 'start': 1858.9, 'duration': 37.425}, {'end': 1898.126, 'text': 'Um, given all the parameters.', 'start': 1896.945, 'duration': 1.181}], 'summary': 'Em algorithm has two steps: e step and m step.', 'duration': 62.81, 'max_score': 1835.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1835316.jpg'}, {'end': 2034.633, 'src': 'embed', 'start': 2010.137, 'weight': 1, 'content': [{'end': 2018.183, 'text': "It's just read it off one of the parameters in your multinomial probability for um, for the odds of Z being different values, okay?", 'start': 2010.137, 'duration': 8.046}, {'end': 2022.405, 'text': 'And so um, and similarly the terms and denominator.', 'start': 2018.863, 'duration': 3.542}, {'end': 2030.73, 'text': 'this term here is from a Gaussian, and that second term is from the um multinomial probability that you have for Z.', 'start': 2022.405, 'duration': 8.325}, {'end': 2034.633, 'text': "Um, and so that's how you plug in all of these numbers and use Bayes' rule.", 'start': 2030.73, 'duration': 3.903}], 'summary': "Using bayes' rule, calculate multinomial probabilities for different values of z.", 'duration': 24.496, 'max_score': 2010.137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2010137.jpg'}, {'end': 2170.155, 'src': 'embed', 'start': 2136.693, 'weight': 4, 'content': [{'end': 2137.113, 'text': 'Oh, this one.', 'start': 2136.693, 'duration': 0.42}, {'end': 2137.954, 'text': 'Sorry Yes.', 'start': 2137.393, 'duration': 0.561}, {'end': 2138.574, 'text': 'Thank you.', 'start': 2137.994, 'duration': 0.58}, {'end': 2139.715, 'text': 'Yeah Thank you.', 'start': 2138.894, 'duration': 0.821}, {'end': 2142.336, 'text': 'Yes This is a Yes.', 'start': 2140.435, 'duration': 1.901}, {'end': 2144.838, 'text': 'Um, okay.', 'start': 2143.837, 'duration': 1.001}, {'end': 2151.282, 'text': "So in the- so the E-step tells us you know we're trying to guess the values of the z's right?", 'start': 2146.179, 'duration': 5.103}, {'end': 2155.424, 'text': "We figure out what's the probability of z being 1,, 2,, 3, 4 up to k, and we'll store it here.", 'start': 2151.342, 'duration': 4.082}, {'end': 2163.849, 'text': "And then, in the M-step, what we're going to do is use the formulas we have for maximum likelihood estimation,", 'start': 2156.324, 'duration': 7.525}, {'end': 2170.155, 'text': 'And I want you to compare these with the equations I had above, right?', 'start': 2165.512, 'duration': 4.643}], 'summary': 'Discussion on e-step and m-step in maximum likelihood estimation.', 'duration': 33.462, 'max_score': 2136.693, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2136693.jpg'}], 'start': 1645.242, 'title': 'Em algorithm and gaussian density', 'summary': "Covers em algorithm for latent variable estimation with a focus on maximum likelihood estimation and a two-step process, and also explains the computation of gaussian density and multinomial probability, using bayes' rule and e-step and m-step for maximum likelihood estimation in the context of machine learning.", 'chapters': [{'end': 1949.859, 'start': 1645.242, 'title': 'Em algorithm for latent variable estimation', 'summary': 'Introduces the em algorithm for latent variable estimation, emphasizing the use of maximum likelihood estimation and the two-step process involving guessing the values of zs and then using these values in the equations.', 'duration': 304.617, 'highlights': ["The EM algorithm involves two steps: the E step, which computes the probability of Z i being equal to j using Bayes' rule, and the M step, which updates the parameters using the values of the Zs guessed in the E step.", "Maximum likelihood estimation (MLE) can be used to estimate everything if the values of the Z i's were known, involving the calculation of log likelihood of the parameters and taking the derivative to find phi j.", 'Replacing y with z in the formulas for Gaussian discriminant analysis, and using a similar formula for sigma in the EM algorithm.']}, {'end': 2230.316, 'start': 1949.859, 'title': 'Gaussian density and multinomial probability', 'summary': "Explains the computation of gaussian density and multinomial probability, using bayes' rule to compute the chance of w ij taking on a certain value and the e-step and m-step for maximum likelihood estimation in the context of machine learning.", 'duration': 280.457, 'highlights': ["The chapter emphasizes the computation of Gaussian density and multinomial probability, including the use of Bayes' rule to calculate the chance of W ij taking on a certain value, essential in machine learning (relevance: 5)", "Explains the E-step, involving guessing the values of z's by determining the probability of z being 1, 2, 3, 4 up to k, and storing it for further use in machine learning (relevance: 4)", 'Discusses the M-step, which involves using formulas for maximum likelihood estimation and comparing them with previous equations, replacing indicator z i equals j with w, ij, the expected value of the indicator function, in the context of machine learning (relevance: 3)']}], 'duration': 585.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA1645242.jpg', 'highlights': ['The EM algorithm involves E step and M step for latent variable estimation (relevance: 5)', "Computation of Gaussian density and multinomial probability using Bayes' rule (relevance: 4)", 'Maximum likelihood estimation (MLE) involves calculating log likelihood and taking derivatives (relevance: 3)', 'Replacing y with z in Gaussian discriminant analysis and EM algorithm (relevance: 2)', "E-step involves guessing values of z's and storing for further use (relevance: 1)"]}, {'end': 2628.005, 'segs': [{'end': 2251.663, 'src': 'embed', 'start': 2230.336, 'weight': 0, 'content': [{'end': 2244.135, 'text': "Okay? So, um, one intuition, of, um, this mixture of Gaussian's algorithm is that it's a little bit like K-means, but with soft assignment.", 'start': 2230.336, 'duration': 13.799}, {'end': 2251.663, 'text': 'So in K-means, uh, in the first step, we will take each point and just assign it to one of the K-K cluster centroids.', 'start': 2244.315, 'duration': 7.348}], 'summary': "Mixture of gaussian's algorithm is like k-means with soft assignment.", 'duration': 21.327, 'max_score': 2230.336, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2230336.jpg'}, {'end': 2362.743, 'src': 'embed', 'start': 2306.783, 'weight': 1, 'content': [{'end': 2312.546, 'text': "Okay? So- so- so that's one intuition be- between, um, Em and, uh, k-means.", 'start': 2306.783, 'duration': 5.763}, {'end': 2323.452, 'text': 'Um, and then the second uh, uh, but- but um, when you run this algorithm, it turns out that, um, this algorithm, uh, will converge.', 'start': 2313.227, 'duration': 10.225}, {'end': 2325.113, 'text': "uh, with some caveats, I'll get to later.", 'start': 2323.452, 'duration': 1.661}, {'end': 2335.579, 'text': 'And this will find a pretty decent estimate, um, of, uh, of the parameters, you know, of say fitting a mixture of two Gaussians model.', 'start': 2325.693, 'duration': 9.886}, {'end': 2342.309, 'text': 'Okay, So this is um the- oh.', 'start': 2337.056, 'duration': 5.253}, {'end': 2348.953, 'text': 'and so, if you are given a dataset of, say, airplane engines, you can run this algorithm fill the mixture of two Gaussians and then,', 'start': 2342.309, 'duration': 6.644}, {'end': 2354.017, 'text': 'when the new airplane engine rolls off the assembly line, um uh, you- oh so-.', 'start': 2348.953, 'duration': 5.064}, {'end': 2358.16, 'text': "so after you're fitting the K-means algorithm, you now have a-.", 'start': 2354.017, 'duration': 4.143}, {'end': 2362.743, 'text': 'after fitting EM algorithm, you now have a joint density for P of x, z.', 'start': 2358.16, 'duration': 4.583}], 'summary': 'Em algorithm converges to find decent estimate of parameters for fitting a mixture of two gaussians model.', 'duration': 55.96, 'max_score': 2306.783, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2306783.jpg'}, {'end': 2439.604, 'src': 'embed', 'start': 2415.666, 'weight': 2, 'content': [{'end': 2422.931, 'text': 'And so by doing this, you can now model P of x for many complicated densities including this one.', 'start': 2415.666, 'duration': 7.265}, {'end': 2425.973, 'text': 'this example I had just now.', 'start': 2424.192, 'duration': 1.781}, {'end': 2431.538, 'text': 'this would allow you to fit a probability density function that puts a lot of probability mass on on a region that looks like this', 'start': 2425.973, 'duration': 5.565}, {'end': 2439.604, 'text': 'And so when you have a new example, you can evaluate P of x, and if P of x is large, then you can say, nope, this looks okay.', 'start': 2432.319, 'duration': 7.285}], 'summary': 'Model p of x for complicated densities, fit probability density function, evaluate p of x for new examples.', 'duration': 23.938, 'max_score': 2415.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2415666.jpg'}, {'end': 2499.573, 'src': 'embed', 'start': 2463.914, 'weight': 3, 'content': [{'end': 2468.258, 'text': "So let's guess the values of Z and then plug that into the formulas of maximum likelihood estimation.", 'start': 2463.914, 'duration': 4.344}, {'end': 2474.901, 'text': 'It turns out that hand wavy explanation works in the particular case of um EM for mixtures of Gaussians,', 'start': 2468.898, 'duration': 6.003}, {'end': 2480.504, 'text': 'but that there is a more formal way of deriving the EM algorithm.', 'start': 2474.901, 'duration': 5.603}, {'end': 2486.807, 'text': 'uh, that shows that this is a maximum likelihood estimation algorithm and that it converges at least the local optimum.', 'start': 2480.504, 'duration': 6.303}, {'end': 2499.573, 'text': "Um and in particular there- what we'll do is show that if your goal is um, uh, given a model p of x, z parameterized by Theta,", 'start': 2487.488, 'duration': 12.085}], 'summary': 'Em algorithm is a maximum likelihood estimation that converges to a local optimum.', 'duration': 35.659, 'max_score': 2463.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2463914.jpg'}], 'start': 2230.336, 'title': 'Em algorithm for mixtures of gaussians', 'summary': "Explains mixture of gaussians algorithm and its soft assignment approach, compares it to k-means, discusses em algorithm's convergence and parameter estimation, and emphasizes its ability to fit a rich family of models for complex distributions.", 'chapters': [{'end': 2305.938, 'start': 2230.336, 'title': 'Mixture of gaussians algorithm', 'summary': 'Explains the mixture of gaussians algorithm, comparing it to k-means and highlighting its soft assignment approach, using probabilities for assigning points to different cluster centroids, and updating means accordingly.', 'duration': 75.602, 'highlights': ['EM implements a softer way of assigning points to the different cluster centroids, using probabilities and giving it a weighting in terms of how much is assigned to each Gaussian center.', 'K-means makes a hard assignment of points to the closest cluster centroid, while EM uses a soft assignment approach based on probabilities.', "EM updates the means according to the sum of the Xi's assigned to each cluster centroid, divided by the number of examples assigned to that cluster centroid."]}, {'end': 2439.604, 'start': 2306.783, 'title': 'Em algorithm for gaussian mixture model', 'summary': "Discusses the em algorithm's convergence, parameter estimation, and its ability to fit a rich family of models, such as a mixture of gaussians, for complex distributions, enabling the evaluation of probability density functions for various examples.", 'duration': 132.821, 'highlights': ['The EM algorithm will converge, providing a decent estimate of the parameters for fitting a mixture of two Gaussians model.', 'A mixture of two Gaussians can fit distributions of different shapes, offering a rich family of models for complex distributions.', 'The EM algorithm enables the modeling of probability density functions for complicated distributions, allowing the evaluation of new examples based on the probability of x.']}, {'end': 2628.005, 'start': 2440.065, 'title': 'Understanding em algorithm for mixtures of gaussians', 'summary': 'Discusses the em algorithm for mixtures of gaussians, presenting a hand-wavy explanation and a more formal derivation of the algorithm, emphasizing its convergence to a local optimum and the goal of maximizing p of x.', 'duration': 187.94, 'highlights': ['The EM algorithm is a maximum likelihood estimation algorithm for mixtures of Gaussians, with a more formal derivation showing its convergence to at least a local optimum.', 'The goal of the EM algorithm is to maximize p of x, aligning with the objective of maximum likelihood estimation.', 'The Wij term represents the strength of assigning a training example to a particular Gaussian, with values ranging between 0 and 1 and summing up to 1 for all assignments.']}], 'duration': 397.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2230336.jpg', 'highlights': ['EM algorithm uses soft assignment approach based on probabilities', 'EM algorithm fits a rich family of models for complex distributions', 'EM algorithm enables modeling of probability density functions for complicated distributions', 'EM algorithm is a maximum likelihood estimation algorithm for mixtures of Gaussians', 'EM algorithm converges to at least a local optimum']}, {'end': 3386.006, 'segs': [{'end': 2731.561, 'src': 'embed', 'start': 2705.398, 'weight': 2, 'content': [{'end': 2709.962, 'text': "You know, for this one, I think there's 8% chance it came for process one and 20% chance it came for process two.", 'start': 2705.398, 'duration': 4.564}, {'end': 2711.424, 'text': "So that's the E-step.", 'start': 2710.463, 'duration': 0.961}, {'end': 2713.786, 'text': 'And then, in the M-step,', 'start': 2712.285, 'duration': 1.501}, {'end': 2725.397, 'text': "you look at all the engines that you're kind of guessing were generated by process one and you update your Gaussian to be a better model for all of the things that were- that you kind of think were generated by process one.", 'start': 2713.786, 'duration': 11.611}, {'end': 2731.561, 'text': "And if there's something that you're absolutely sure came from process 1, then it has a weight of 1, close to 1 in this.", 'start': 2726.218, 'duration': 5.343}], 'summary': '8% chance process one, 20% chance process two, update gaussian model in e-step and m-step.', 'duration': 26.163, 'max_score': 2705.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2705398.jpg'}, {'end': 2902.954, 'src': 'embed', 'start': 2852.604, 'weight': 0, 'content': [{'end': 2861.528, 'text': "what we'll see on Wednesday is that this view of EM, uh, allows us to derive EM in a in a more correct way for other models as well,", 'start': 2852.604, 'duration': 8.924}, {'end': 2862.568, 'text': 'the mixtures of Gaussians.', 'start': 2861.528, 'duration': 1.04}, {'end': 2870.752, 'text': "On Wednesday we'll talk about, um uh, a model called factor analysis that lets you model Gaussians in extremely high dimensional spaces where,", 'start': 2862.789, 'duration': 7.963}, {'end': 2873.173, 'text': 'if you have 1, 000 dimensional data, but only 30 examples.', 'start': 2870.752, 'duration': 2.421}, {'end': 2875.935, 'text': "How do you fill the Gauss into that? So we'll talk about that on Wednesday.", 'start': 2873.573, 'duration': 2.362}, {'end': 2885.482, 'text': "And it turns out this derivation of EM we're gonna go about through now is crucial for, um, applying EM accurately in, in, in problems like that.", 'start': 2876.335, 'duration': 9.147}, {'end': 2895.028, 'text': "Okay? So, um, in order to lead up to that derivation, uh, let me describe, um, Jensen's inequality.", 'start': 2885.502, 'duration': 9.526}, {'end': 2902.954, 'text': 'So let f be a, uh, convex function.', 'start': 2896.589, 'duration': 6.365}], 'summary': 'Em allows deriving models like mixtures of gaussians and factor analysis for high dimensional spaces with 1,000 dimensional data and 30 examples, crucial for accurate em application.', 'duration': 50.35, 'max_score': 2852.604, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2852604.jpg'}], 'start': 2628.005, 'title': "Em algorithm in gaussian mixture model and jensen's inequality", 'summary': "Delves into em algorithm in gaussian mixture model, assigning probabilities to gaussian distributions and hypothesizing multiple data sources. it also explores em algorithm's rigorous derivation, its application in modeling high-dimensional spaces, and the crucial role of jensen's inequality.", 'chapters': [{'end': 2779.416, 'start': 2628.005, 'title': 'Em algorithm in gaussian mixture model', 'summary': 'Discusses the em algorithm in gaussian mixture model, where data points are assigned probabilities to different gaussian distributions, hypothesizing multiple sources of data and updating the parameters iteratively.', 'duration': 151.411, 'highlights': ['The EM algorithm assigns probabilities to data points for different Gaussian distributions', 'Hypothesizing multiple sources of data in Gaussian Mixture Model', 'Updating parameters iteratively in the M-step of the EM algorithm']}, {'end': 3386.006, 'start': 2792.534, 'title': "Understanding em algorithm and jensen's inequality", 'summary': "Discusses the em algorithm's rigorous derivation, its application in modeling high-dimensional spaces, and the explanation of jensen's inequality, which is crucial for applying em accurately.", 'duration': 593.472, 'highlights': ["The EM algorithm's rigorous derivation and its application in modeling high-dimensional spaces", "Explanation of Jensen's inequality and its crucial role in applying the EM algorithm accurately"]}], 'duration': 758.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA2628005.jpg', 'highlights': ['The EM algorithm assigns probabilities to data points for different Gaussian distributions', 'Hypothesizing multiple sources of data in Gaussian Mixture Model', 'Updating parameters iteratively in the M-step of the EM algorithm', "The EM algorithm's rigorous derivation and its application in modeling high-dimensional spaces", "Explanation of Jensen's inequality and its crucial role in applying the EM algorithm accurately"]}, {'end': 3856.303, 'segs': [{'end': 3505.147, 'src': 'embed', 'start': 3386.006, 'weight': 0, 'content': [{'end': 3390.809, 'text': "um, the form of Jensen's equality we're gonna use is actually a form for a concave function.", 'start': 3386.006, 'duration': 4.803}, {'end': 3395.071, 'text': "So instead of convex, um, I'm gonna say concave.", 'start': 3391.329, 'duration': 3.742}, {'end': 3400.194, 'text': 'And so you know, a concave function is just a negative of a convex function.', 'start': 3395.371, 'duration': 4.823}, {'end': 3403.956, 'text': 'right?. If you take a convex function and take negative of that, it becomes concave.', 'start': 3400.194, 'duration': 3.762}, {'end': 3411.921, 'text': 'And so, uh, the whole thing works with the- with everything flipped around the other way.', 'start': 3404.557, 'duration': 7.364}, {'end': 3416.111, 'text': 'Yeah Yep.', 'start': 3415.731, 'duration': 0.38}, {'end': 3419.392, 'text': 'And so, uh, F is strictly concave.', 'start': 3416.291, 'duration': 3.101}, {'end': 3429.774, 'text': "Okay? So the form of Jensen's inequality we're going to use is actually the, um, uh, concave form of Jensen's inequality.", 'start': 3419.492, 'duration': 10.282}, {'end': 3433.135, 'text': "And we're actually going to apply it to the log function.", 'start': 3430.374, 'duration': 2.761}, {'end': 3435.555, 'text': 'So the log function, right, log X looks like this.', 'start': 3433.275, 'duration': 2.28}, {'end': 3436.835, 'text': "And so that's a concave function.", 'start': 3435.615, 'duration': 1.22}, {'end': 3441.196, 'text': "And so the inequality we'll use will be in this direction that I have in orange.", 'start': 3437.856, 'duration': 3.34}, {'end': 3444.757, 'text': 'All right.', 'start': 3444.437, 'duration': 0.32}, {'end': 3479.312, 'text': "So here's the density estimation problem.", 'start': 3475.149, 'duration': 4.163}, {'end': 3484.856, 'text': 'Uh, meaning density estimation means you want to estimate P of X, right? So we have a model.', 'start': 3480.493, 'duration': 4.363}, {'end': 3494.439, 'text': 'for P of x comma z, um, with parameters Theta.', 'start': 3489.755, 'duration': 4.684}, {'end': 3503.326, 'text': 'And so you know, instead of uh, writing out mu sigma, uh, mu sigma and phi, like we did for the mixture of Gaussians,', 'start': 3495.279, 'duration': 8.047}, {'end': 3505.147, 'text': "I'm just gonna capture all the parameters you have.", 'start': 3503.326, 'duration': 1.821}], 'summary': "Using concave form of jensen's inequality for log function in density estimation problem.", 'duration': 119.141, 'max_score': 3386.006, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3386006.jpg'}, {'end': 3629.141, 'src': 'heatmap', 'start': 3570.47, 'weight': 0.776, 'content': [{'end': 3585.381, 'text': 'And so what we want is um maximum likelihood estimation, which is to find the value of Theta that maximizes its log likelihood, right?', 'start': 3570.47, 'duration': 14.911}, {'end': 3592.066, 'text': "And what we'll like- what we'll like to do is derive an EM-, derive an algorithm which will turn out to be an EM algorithm, uh,", 'start': 3585.401, 'duration': 6.665}, {'end': 3598.45, 'text': 'as an iterative algorithm for finding the maximum likelihood estimates of the parameters data.', 'start': 3592.066, 'duration': 6.384}, {'end': 3601.252, 'text': 'Um, okay.', 'start': 3600.432, 'duration': 0.82}, {'end': 3617.534, 'text': "So, Um, let me draw a picture that I'd like you to keep in mind as we go through the math, which is, you know,", 'start': 3606.136, 'duration': 11.398}, {'end': 3621.296, 'text': 'the horizontal axis is a space of possible values of parameters theta.', 'start': 3617.534, 'duration': 3.762}, {'end': 3629.141, 'text': "And so there's some function L of theta that you try to maximize.", 'start': 3622.077, 'duration': 7.064}], 'summary': 'Derive em algorithm for maximum likelihood estimation of parameters data.', 'duration': 58.671, 'max_score': 3570.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3570470.jpg'}, {'end': 3629.141, 'src': 'embed', 'start': 3592.066, 'weight': 5, 'content': [{'end': 3598.45, 'text': 'as an iterative algorithm for finding the maximum likelihood estimates of the parameters data.', 'start': 3592.066, 'duration': 6.384}, {'end': 3601.252, 'text': 'Um, okay.', 'start': 3600.432, 'duration': 0.82}, {'end': 3617.534, 'text': "So, Um, let me draw a picture that I'd like you to keep in mind as we go through the math, which is, you know,", 'start': 3606.136, 'duration': 11.398}, {'end': 3621.296, 'text': 'the horizontal axis is a space of possible values of parameters theta.', 'start': 3617.534, 'duration': 3.762}, {'end': 3629.141, 'text': "And so there's some function L of theta that you try to maximize.", 'start': 3622.077, 'duration': 7.064}], 'summary': 'Iterative algorithm for finding maximum likelihood estimates of parameters using function l of theta.', 'duration': 37.075, 'max_score': 3592.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3592066.jpg'}, {'end': 3815.885, 'src': 'embed', 'start': 3779.649, 'weight': 2, 'content': [{'end': 3780.47, 'text': "That's the E-step.", 'start': 3779.649, 'duration': 0.821}, {'end': 3788.39, 'text': 'And then M-step will maximize, this red curve, um, and so on.', 'start': 3780.89, 'duration': 7.5}, {'end': 3793.015, 'text': "Now you're here, construct another thing.", 'start': 3788.59, 'duration': 4.425}, {'end': 3794.456, 'text': 'do that right?', 'start': 3793.015, 'duration': 1.441}, {'end': 3801.124, 'text': 'And you can kinda tell that as you keep running EM, this is constantly trying to increase L of Theta,', 'start': 3794.997, 'duration': 6.127}, {'end': 3804.388, 'text': 'trying to increase the log likelihood until it converges to local optimum.', 'start': 3801.124, 'duration': 3.264}, {'end': 3808.564, 'text': 'Um, the EM algorithm does converge only to local optimum.', 'start': 3805.623, 'duration': 2.941}, {'end': 3815.885, 'text': 'So if, you know, if there was another even bigger thing there, then it may never find its way over to that other, uh, better optimum.', 'start': 3808.704, 'duration': 7.181}], 'summary': 'Em algorithm maximizes log likelihood to converge to local optimum.', 'duration': 36.236, 'max_score': 3779.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3779649.jpg'}], 'start': 3386.006, 'title': "Concave form of jensen's inequality and density estimation", 'summary': "Discusses the concave form of jensen's inequality and its application to the log function, along with the density estimation problem and the em algorithm for finding maximum likelihood estimates of parameters theta, involving e-step and m-step iterations until convergence to a local optimum.", 'chapters': [{'end': 3441.196, 'start': 3386.006, 'title': "Concave form of jensen's inequality", 'summary': "Discusses the concave form of jensen's inequality and its application to the log function, highlighting the concept of concave functions and the application of jensen's inequality to the log function.", 'duration': 55.19, 'highlights': ["The form of Jensen's inequality used is the concave form, applied to the log function, which is a concave function.", 'A concave function is the negative of a convex function, providing a clear understanding of concave functions.', "The application of Jensen's inequality to the log function is a key concept discussed in the chapter."]}, {'end': 3856.303, 'start': 3444.437, 'title': 'Density estimation and em algorithm', 'summary': 'Covers the density estimation problem to estimate p of x using a model with parameters theta, focusing on deriving an em algorithm for finding the maximum likelihood estimates of the parameters theta, and explaining the e-step and m-step iterations of the em algorithm for increasing the log likelihood until convergence to a local optimum.', 'duration': 411.866, 'highlights': ['The chapter covers the density estimation problem to estimate P of X using a model with parameters Theta.', 'Deriving an EM algorithm for finding the maximum likelihood estimates of the parameters Theta.', 'Explaining the E-step and M-step iterations of the EM algorithm for increasing the log likelihood until convergence to a local optimum.']}], 'duration': 470.297, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3386006.jpg', 'highlights': ["The application of Jensen's inequality to the log function is a key concept discussed in the chapter.", 'The chapter covers the density estimation problem to estimate P of X using a model with parameters Theta.', 'Explaining the E-step and M-step iterations of the EM algorithm for increasing the log likelihood until convergence to a local optimum.', "The form of Jensen's inequality used is the concave form, applied to the log function, which is a concave function.", 'A concave function is the negative of a convex function, providing a clear understanding of concave functions.', 'Deriving an EM algorithm for finding the maximum likelihood estimates of the parameters Theta.']}, {'end': 4826.387, 'segs': [{'end': 4096.307, 'src': 'embed', 'start': 4031.542, 'weight': 2, 'content': [{'end': 4046.021, 'text': "Now, using the um concave form of Jensen's inequality, we have that this is greater than or equal to.", 'start': 4031.542, 'duration': 14.479}, {'end': 4087.361, 'text': "So, uh, this is a form of Jensen's equality where um f of e x is greater than or equal to e of f of x.", 'start': 4074.751, 'duration': 12.61}, {'end': 4087.921, 'text': 'where here?', 'start': 4087.361, 'duration': 0.56}, {'end': 4093.866, 'text': 'um, this is the log-rhythmic function, right?', 'start': 4087.921, 'duration': 5.945}, {'end': 4095.727, 'text': 'So the log function is a concave function.', 'start': 4093.886, 'duration': 1.841}, {'end': 4096.307, 'text': 'it looks like that', 'start': 4095.727, 'duration': 0.58}], 'summary': "Using jensen's inequality, f of x is greater than or equal to e of f of x.", 'duration': 64.765, 'max_score': 4031.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA4031542.jpg'}, {'end': 4200.851, 'src': 'embed', 'start': 4173.071, 'weight': 1, 'content': [{'end': 4177.814, 'text': 'So your x, um x is just your data and z is a variable.', 'start': 4173.071, 'duration': 4.743}, {'end': 4178.555, 'text': 'you sum over.', 'start': 4177.814, 'duration': 0.741}, {'end': 4183.801, 'text': 'So this whole thing is the function of Theta, right? Because x is a fix, z is just something you f- sum over.', 'start': 4178.676, 'duration': 5.125}, {'end': 4188.024, 'text': 'So this whole formula here, this is a function of the parameters Theta.', 'start': 4183.881, 'duration': 4.143}, {'end': 4197.449, 'text': "And what we've shown is that this thing you know, this formula here, this is a lower bound for the log likelihood.", 'start': 4189.066, 'duration': 8.383}, {'end': 4200.851, 'text': 'uh, for-, for- for this thing, right?', 'start': 4197.449, 'duration': 3.402}], 'summary': 'Formula is a lower bound for the log likelihood.', 'duration': 27.78, 'max_score': 4173.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA4173071.jpg'}, {'end': 4364.197, 'src': 'embed', 'start': 4336.833, 'weight': 0, 'content': [{'end': 4342.496, 'text': 'And this is actually how you guarantee that when you optimize the green function by improving on the green function,', 'start': 4336.833, 'duration': 5.663}, {'end': 4343.877, 'text': "you're improving on the blue function.", 'start': 4342.496, 'duration': 1.381}, {'end': 4348.86, 'text': 'So we want this lower bound to be tight, right? To be- to- to the two functions be equal, tangent to each other.', 'start': 4343.917, 'duration': 4.943}, {'end': 4353.445, 'text': 'So in other words, we want this inequality to hold with equality.', 'start': 4349.74, 'duration': 3.705}, {'end': 4354.125, 'text': 'So we want?', 'start': 4353.545, 'duration': 0.58}, {'end': 4364.197, 'text': 'um, uh, yeah, so we want the left-hand side and the right-hand side to be equal for the current value of Theta, right?', 'start': 4354.125, 'duration': 10.072}], 'summary': 'Optimizing the green function improves the blue function, aiming for equality.', 'duration': 27.364, 'max_score': 4336.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA4336833.jpg'}], 'start': 3861.895, 'title': 'Variational inference and em algorithm', 'summary': "Introduces variational inference equation, maximizing log likelihood, jensen's inequality, log-likelihood bounds, and em algorithm for mixture of gaussians, with a focus on establishing lower bounds and optimizing for convergence to a local optima.", 'chapters': [{'end': 4024.997, 'start': 3861.895, 'title': 'Variational inference equation', 'summary': 'Introduces the variational inference equation, which involves maximizing the log likelihood of a probability distribution given a set of parameters, and discusses the concept of expected value with respect to a probability distribution.', 'duration': 163.102, 'highlights': ['The equation for variational inference is sum over i log sum over z, i, p of x, i comma z, i given Theta, and involves maximizing the log likelihood of a probability distribution given a set of parameters.', 'The concept of expected value with respect to a probability distribution is illustrated through the equation sum over i log of an expected value of zi drawn from the qi distribution, computed as the sum over all possible values of zi of the probability of zi times the function of zi.', 'The process involves multiplying and dividing by a probability distribution q i of z i, where sum over z i, q i of z i equals 1, to construct a probability distribution and perform the required operations.']}, {'end': 4294.196, 'start': 4031.542, 'title': "Jensen's inequality and log-likelihood bound", 'summary': "Discusses jensen's inequality and log-likelihood bounds, demonstrating the application of jensen's inequality to establish a lower bound for the log likelihood, with a focus on the concave nature of the log function and the calculation of expected values using probability distributions.", 'duration': 262.654, 'highlights': ["The chapter emphasizes the application of Jensen's inequality to establish a lower bound for the log likelihood, with a focus on the concave nature of the log function.", 'The speaker illustrates the calculation of expected values using probability distributions, emphasizing the concept of expected value as the sum of the probability of a value multiplied by the value itself.', 'The speaker explains the notation for the probability of a random variable taking on different values, denoted as qi of z, and its relevance to the formula for expected values.']}, {'end': 4826.387, 'start': 4294.236, 'title': 'Em algorithm for mixture of gaussians', 'summary': 'Discusses the construction of a tight green lower bound equal to the blue function, the derivation of the em algorithm, and the use of the em algorithm as a maximum likelihood estimation algorithm with optimization solved by constructing lower bounds and optimizing lower bounds, leading to convergence to a local optima.', 'duration': 532.151, 'highlights': ['The EM algorithm constructs a tight green lower bound equal to the blue function to ensure improvement of the blue function when optimizing the green function.', 'The distribution for zi is set as qi proportional to p(xi, zi) parameterized by Theta, resulting in qi of zi being equal to the posterior probability.', 'In the E-step of the EM algorithm, qi of zi is set to the posterior probability, and in the M-step, the lower bound constructed is maximized with respect to Theta, leading to convergence to a local optima.']}], 'duration': 964.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rVfZHWTwXSA/pics/rVfZHWTwXSA3861895.jpg', 'highlights': ['The EM algorithm constructs a tight green lower bound equal to the blue function to ensure improvement of the blue function when optimizing the green function.', 'The equation for variational inference is sum over i log sum over z, i, p of x, i comma z, i given Theta, and involves maximizing the log likelihood of a probability distribution given a set of parameters.', "The chapter emphasizes the application of Jensen's inequality to establish a lower bound for the log likelihood, with a focus on the concave nature of the log function."]}], 'highlights': ['Logistical reminder for a 48-hour take-home midterm this Wednesday, with details on submission through Gradescope and coverage up to and including EM.', 'Anomaly detection crucial for identifying unusual aircraft engine signatures to prevent accidents', 'Clustering is an unsupervised learning algorithm for market segmentation.', 'The lecture covers logistic details for a 48-hour take-home midterm and includes a discussion on EM.', 'The EM algorithm involves E step and M step for latent variable estimation (relevance: 5)', 'Market segmentation can be done using clustering to identify user segments.', 'EM algorithm uses soft assignment approach based on probabilities', 'The algorithm can be proven to converge, as explained in the lecture notes', "K-means algorithm's cost function goes down on every iteration, proving its convergence, and at some point, it has to stop going down, leading to the declaration of convergence", 'EM algorithm fits a rich family of models for complex distributions']}