title
Machine Learning Tutorial Python - 13: K Means Clustering Algorithm

description
K Means clustering algorithm is unsupervised machine learning technique used to cluster data points. In this tutorial we will go over some theory behind how k means works and then solve income group clustering problem using sklearn, kmeans and python. Elbow method is a technique used to determine optimal number of k, we will review that method as well. #MachineLearning #PythonMachineLearning #MachineLearningTutorial #Python #PythonTutorial #PythonTraining #MachineLearningCource #kmeans #MachineLearningTechnique #sklearn #sklearntutorials #scikitlearntutorials Code: https://github.com/codebasics/py/blob/master/ML/13_kmeans/13_kmeans_tutorial.ipynb data link: https://github.com/codebasics/py/tree/master/ML/13_kmeans Exercise solution: https://github.com/codebasics/py/blob/master/ML/13_kmeans/Exercise/13_kmeans_exercise.ipynb Topics that are covered in this Video: 0:00 introduction 0:08 Theory - Explanation of Supervised vs Unsupervised learning and how kmeans clustering works. kmeans is unsupervised learning 5:00 Elbow method 7:33 Coding (start) (Cluster people income based on age) 9:38 sklearn.cluster KMeans model creation and training 14:56 Use MinMaxScaler from sklearn 24:07 Exercise (Cluster iris flowers using their petal width and length) Do you want to learn technology from me? Check https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description for my affordable video courses. Next Video: Machine Learning Tutorial Python - 14: Naive Bayes Part 1: https://www.youtube.com/watch?v=PPeaRc-r1OI&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=15 Popular Playlist: Data Science Full Course: https://www.youtube.com/playlist?list=PLeo1K3hjS3us_ELKYSj_Fth2tIEkdKXvV Data Science Project: https://www.youtube.com/watch?v=rdfbcdP75KI&list=PLeo1K3hjS3uu7clOTtwsp94PcHbzqpAdg Machine learning tutorials: https://www.youtube.com/watch?v=gmvvaobm7eQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw Pandas: https://www.youtube.com/watch?v=CmorAWRsCAw&list=PLeo1K3hjS3uuASpe-1LjfG5f14Bnozjwy matplotlib: https://www.youtube.com/watch?v=qqwf4Vuj8oM&list=PLeo1K3hjS3uu4Lr8_kro2AqaO6CFYgKOl Python: https://www.youtube.com/watch?v=eykoKxsYtow&list=PLeo1K3hjS3uv5U-Lmlnucd7gqF-3ehIh0&index=1 Jupyter Notebook: https://www.youtube.com/watch?v=q_BzsPxwLOE&list=PLeo1K3hjS3uuZPwzACannnFSn9qHn8to8 Tools and Libraries: Scikit learn tutorials Sklearn tutorials Machine learning with scikit learn tutorials Machine learning with sklearn tutorials To download csv and code for all tutorials: go to https://github.com/codebasics/py, click on a green button to clone or download the entire repository and then go to relevant folder to get access to that specific file. 🌎 My Website For Video Courses: https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description Need help building software or data analytics and AI solutions? My company https://www.atliq.com/ can help. Click on the Contact button on that website. #️⃣ Social Media #️⃣ 🔗 Discord: https://discord.gg/r42Kbuk 📸 Dhaval's Personal Instagram: https://www.instagram.com/dhavalsays/ 📸 Instagram: https://www.instagram.com/codebasicshub/ 🔊 Facebook: https://www.facebook.com/codebasicshub 📱 Twitter: https://twitter.com/codebasicshub 📝 Linkedin: https://www.linkedin.com/company/codebasics/

detail
{'title': 'Machine Learning Tutorial Python - 13: K Means Clustering Algorithm', 'heatmap': [{'end': 217.791, 'start': 190.324, 'weight': 0.779}, {'end': 371.742, 'start': 330.318, 'weight': 0.794}, {'end': 396.06, 'start': 375.004, 'weight': 0.829}, {'end': 607.287, 'start': 530.358, 'weight': 0.815}, {'end': 836.525, 'start': 816.758, 'weight': 0.702}, {'end': 985.75, 'start': 974.707, 'weight': 0.714}, {'end': 1461.862, 'start': 1408.323, 'weight': 0.739}], 'summary': 'Tutorial on k means clustering in python covers the k-means clustering algorithm, optimal cluster analysis, k means clustering in python, scaling and visualizing clustering results, and the elbow plot method for determining the optimal number of clusters using practical examples and guidance on selecting k values.', 'chapters': [{'end': 270.091, 'segs': [{'end': 98.789, 'src': 'embed', 'start': 25.306, 'weight': 0, 'content': [{'end': 26.827, 'text': 'Using this dataset,', 'start': 25.306, 'duration': 1.521}, {'end': 36.43, 'text': 'we try to identify the underlying structure in that data or we sometimes try to find the clusters in that data and we can make useful predictions out of it.', 'start': 26.827, 'duration': 9.603}, {'end': 42.352, 'text': "K-Means is a very popular clustering algorithm and that's what we are going to look into today.", 'start': 37.15, 'duration': 5.202}, {'end': 46.553, 'text': 'As usual, the tutorial will be in three parts.', 'start': 43.272, 'duration': 3.281}, {'end': 49.734, 'text': 'The first part is theory, then coding and then exercise.', 'start': 46.653, 'duration': 3.081}, {'end': 56.71, 'text': "Let's say you have a data set like this where x and y axis represent the two different features.", 'start': 50.526, 'duration': 6.184}, {'end': 60.053, 'text': 'And you want to identify clusters in this data set.', 'start': 57.271, 'duration': 2.782}, {'end': 65.096, 'text': "Now, when the data set is given to you, you don't have any information on target variables.", 'start': 60.613, 'duration': 4.483}, {'end': 67.158, 'text': "So you don't know what you're looking for.", 'start': 65.135, 'duration': 2.023}, {'end': 70.62, 'text': "All you're trying to do is identify some structure into it.", 'start': 67.378, 'duration': 3.242}, {'end': 74.703, 'text': 'And one way of looking into this is these two clusters.', 'start': 71.241, 'duration': 3.462}, {'end': 79.867, 'text': 'Just by visual examination, we can say that this data set has these two clusters.', 'start': 75.103, 'duration': 4.764}, {'end': 85.213, 'text': 'And k-means helps you identify these clusters.', 'start': 80.387, 'duration': 4.826}, {'end': 91.26, 'text': 'Now, k in k-means is a free parameter wherein, before you start the algorithm,', 'start': 85.313, 'duration': 5.947}, {'end': 94.885, 'text': 'you have to tell the algorithm what is the value of k that you are looking for.', 'start': 91.26, 'duration': 3.625}, {'end': 98.789, 'text': 'Here, k is equal to 2.', 'start': 95.285, 'duration': 3.504}], 'summary': 'Using k-means clustering algorithm, we identify 2 clusters in the dataset to make useful predictions.', 'duration': 73.483, 'max_score': 25.306, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM25306.jpg'}, {'end': 225.894, 'src': 'heatmap', 'start': 190.324, 'weight': 2, 'content': [{'end': 199.076, 'text': 'for example, for this red cluster, which is these four data points, you will try to find the center of gravity almost,', 'start': 190.324, 'duration': 8.752}, {'end': 204.305, 'text': "and you'll put the red centroid there and you do the same thing for green one.", 'start': 199.076, 'duration': 5.229}, {'end': 208.808, 'text': 'So you get this when you make the adjustment.', 'start': 206.127, 'duration': 2.681}, {'end': 212.049, 'text': 'And now you repeat the same process again.', 'start': 209.948, 'duration': 2.101}, {'end': 217.791, 'text': 'Again, you recompute the distance of each of these points from these centroids.', 'start': 212.469, 'duration': 5.322}, {'end': 223.573, 'text': 'And then if the point is more near to red, you put them in a red cluster.', 'start': 218.371, 'duration': 5.202}, {'end': 225.894, 'text': 'Otherwise, you put it in a green cluster.', 'start': 223.613, 'duration': 2.281}], 'summary': 'Data points are classified into red or green clusters based on distance from centroids.', 'duration': 26.818, 'max_score': 190.324, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM190324.jpg'}], 'start': 0.369, 'title': 'K-means clustering', 'summary': 'Provides an overview of the k-means clustering algorithm, a popular unsupervised learning method used to identify clusters in a dataset. it covers the process of identifying clusters, selecting the number of clusters, and iteratively refining the clusters until convergence.', 'chapters': [{'end': 270.091, 'start': 0.369, 'title': 'K-means clustering overview', 'summary': 'Explains the k-means clustering algorithm, which is a popular unsupervised learning method used to identify clusters in a dataset. it covers the process of identifying clusters, selecting the number of clusters (k), and iteratively refining the clusters until convergence.', 'duration': 269.722, 'highlights': ["The K-Means algorithm categorizes data into clusters, with a key parameter 'k' determining the number of clusters to identify.", 'The algorithm iteratively adjusts centroids to improve cluster accuracy, with data points reassigned to the nearest centroid based on distance calculations.', 'The tutorial is structured into three parts: theory, coding, and exercises, providing a comprehensive understanding of the K-Means clustering process.', 'In unsupervised learning, the focus is on identifying underlying structures and clusters within a dataset without the presence of target variables or class labels.', 'K-Means helps in identifying clusters in the absence of target variables, making it a valuable tool for data exploration and pattern recognition.']}], 'duration': 269.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM369.jpg', 'highlights': ['The tutorial is structured into three parts: theory, coding, and exercises, providing a comprehensive understanding of the K-Means clustering process.', "The K-Means algorithm categorizes data into clusters, with a key parameter 'k' determining the number of clusters to identify.", 'The algorithm iteratively adjusts centroids to improve cluster accuracy, with data points reassigned to the nearest centroid based on distance calculations.', 'In unsupervised learning, the focus is on identifying underlying structures and clusters within a dataset without the presence of target variables or class labels.', 'K-Means helps in identifying clusters in the absence of target variables, making it a valuable tool for data exploration and pattern recognition.']}, {'end': 554.681, 'segs': [{'end': 371.742, 'src': 'heatmap', 'start': 295.112, 'weight': 0, 'content': [{'end': 300.956, 'text': 'So which K should you start with? Well, there is a technique called Albo method.', 'start': 295.112, 'duration': 5.844}, {'end': 303.518, 'text': "OK, and we'll look into it.", 'start': 301.537, 'duration': 1.981}, {'end': 312.365, 'text': 'But just to look at our data set, we started with two clusters, but someone might say, no, these are actually four clusters.', 'start': 303.919, 'duration': 8.446}, {'end': 315.828, 'text': 'Third person might say, oh, they are actually six clusters.', 'start': 312.865, 'duration': 2.963}, {'end': 321.232, 'text': 'So you can see like different people might interpret these things in a different way.', 'start': 316.168, 'duration': 5.064}, {'end': 326.256, 'text': 'And your job is to find out the best possible K number.', 'start': 321.752, 'duration': 4.504}, {'end': 329.938, 'text': 'OK, and that technique is called Albo method.', 'start': 326.656, 'duration': 3.282}, {'end': 334.881, 'text': 'And the way that method works is you start with some K.', 'start': 330.318, 'duration': 4.563}, {'end': 340.745, 'text': "OK, so let's say we start with K is equal to two and we try to compute sum of squared error.", 'start': 334.881, 'duration': 5.864}, {'end': 348.69, 'text': 'What it means is for each of the clusters, you try to compute the distance of individual data points from the centroid.', 'start': 341.405, 'duration': 7.285}, {'end': 351.272, 'text': 'You square it and then you sum it up.', 'start': 349.25, 'duration': 2.022}, {'end': 356.975, 'text': 'so for this cluster we got sum of squared error one.', 'start': 351.972, 'duration': 5.003}, {'end': 362.598, 'text': 'similarly, for the second cluster, you will get, uh, the error number two,', 'start': 356.975, 'duration': 5.623}, {'end': 369.061, 'text': 'and you do that for all your cluster and in the end you get the total sum of squared errors.', 'start': 362.598, 'duration': 6.463}, {'end': 371.742, 'text': 'now we do square just to handle negate value.', 'start': 369.061, 'duration': 2.681}], 'summary': 'Using albo method to determine the best k number for cluster analysis by computing sum of squared errors.', 'duration': 53.578, 'max_score': 295.112, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM295112.jpg'}, {'end': 402.942, 'src': 'heatmap', 'start': 375.004, 'weight': 0.829, 'content': [{'end': 382.997, 'text': 'so now we computed ssc for k equal to 2, you repeat the same process for k equal to 3, 4 and so on.', 'start': 375.004, 'duration': 7.993}, {'end': 396.06, 'text': 'okay, and once you have that number, you draw a plot like this Here I have k, going from 1 to 11 and then on the y-axis, I have sum of squared error.', 'start': 382.997, 'duration': 13.063}, {'end': 402.942, 'text': "You'll realize that as you increase number of clusters, it will decrease the error.", 'start': 396.6, 'duration': 6.342}], 'summary': 'Compute sum of squared error for k=2,3,4, etc. plot shows error decreases with more clusters.', 'duration': 27.938, 'max_score': 375.004, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM375004.jpg'}, {'end': 436.102, 'src': 'embed', 'start': 406.383, 'weight': 1, 'content': [{'end': 415.468, 'text': 'At some point you can consider all your data points as one cluster individual where your sum of square error becomes almost zero.', 'start': 406.383, 'duration': 9.085}, {'end': 423.051, 'text': "Okay, so let's assume we have only 11 data points at 11 Value of k the error will become zero.", 'start': 415.828, 'duration': 7.223}, {'end': 430.135, 'text': 'Okay, so error will keep on reducing and the general guideline is to find out an elbow.', 'start': 423.312, 'duration': 6.823}, {'end': 434.061, 'text': 'so the elbow is this chart.', 'start': 430.135, 'duration': 3.926}, {'end': 436.102, 'text': 'this point is short of like an elbow.', 'start': 434.061, 'duration': 2.041}], 'summary': 'Data points can form a cluster where sum of square error becomes almost zero, with an elbow point indicating a significant reduction in error.', 'duration': 29.719, 'max_score': 406.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM406383.jpg'}, {'end': 554.681, 'src': 'embed', 'start': 494.48, 'weight': 2, 'content': [{'end': 497.802, 'text': 'So right now we have just name, age and income.', 'start': 494.48, 'duration': 3.322}, {'end': 505.125, 'text': "And first thing I'm going to do is import that data set into Pandas data frame.", 'start': 498.642, 'duration': 6.483}, {'end': 511.168, 'text': 'So here you can see that I imported essential libraries and then I have my data frame ready with that.', 'start': 505.145, 'duration': 6.023}, {'end': 518.772, 'text': 'And since the data set is simple enough, I will first try to plot it on a scatter plot.', 'start': 512.049, 'duration': 6.723}, {'end': 525.395, 'text': "OK, so when you plot it on a scatter plot, of course, I don't want to include name.", 'start': 519.332, 'duration': 6.063}, {'end': 529.678, 'text': 'I just want to plot the age against the income.', 'start': 525.656, 'duration': 4.022}, {'end': 537.462, 'text': 'So DF dot age, DF income in dollar.', 'start': 530.358, 'duration': 7.104}, {'end': 540.283, 'text': "I'll just use the same convention.", 'start': 538.122, 'duration': 2.161}, {'end': 541.464, 'text': 'You can use dot also.', 'start': 540.344, 'duration': 1.12}, {'end': 544.746, 'text': "But since there is a bracket here, I'll use the same convention.", 'start': 541.884, 'duration': 2.862}, {'end': 554.681, 'text': 'Okay, when you plot this on scatter chart, you can kind of see three clusters, one, two, and three.', 'start': 548.519, 'duration': 6.162}], 'summary': 'Data set includes name, age, and income. plotted on scatter plot, reveals three clusters.', 'duration': 60.201, 'max_score': 494.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM494480.jpg'}], 'start': 270.191, 'title': 'Optimal cluster analysis', 'summary': "Discusses the significance of correct k value in multi-dimensional space for clustering algorithms, introduces the albo method to determine optimal k, explains the elbow technique for k-means clustering, demonstrates the decrease in sum of squared errors with increasing clusters, provides guidance on identifying the 'elbow' point, and covers data analysis with pandas including importing, plotting a scatter plot, and identifying clusters in a dataset.", 'chapters': [{'end': 348.69, 'start': 270.191, 'title': 'Finding optimal clusters with albo method', 'summary': 'Discusses the importance of supplying the correct k value for clustering algorithms in multi-dimensional space, emphasizing the challenge of determining an optimal k and introducing the albo method for this purpose.', 'duration': 78.499, 'highlights': ['The Albo method is introduced as a technique to determine the optimal number of clusters, addressing the challenge of selecting a suitable K value for clustering algorithms.', 'The difficulty of visualizing data in multi-dimensional space is highlighted, emphasizing the challenge of determining an optimal K value for clustering algorithms.', 'Different interpretations of clustering may lead to varying opinions on the ideal number of clusters, illustrating the challenge of determining an optimal K value for clustering algorithms.']}, {'end': 493.62, 'start': 349.25, 'title': 'Elbow technique for cluster analysis', 'summary': "Explains the elbow technique for determining the optimal number of clusters in k-means clustering, showcasing how the sum of squared errors decreases as the number of clusters increases, and provides guidance on identifying the 'elbow' point to determine the optimal number of clusters, followed by a practical example of applying cluster analysis to a dataset of age and income for identifying characteristics of different groups.", 'duration': 144.37, 'highlights': ["The chapter explains the Elbow Technique for determining the optimal number of clusters in K-means clustering, showcasing how the sum of squared errors decreases as the number of clusters increases, and provides guidance on identifying the 'elbow' point to determine the optimal number of clusters.", 'As the number of clusters increases, the sum of squared errors decreases, and it becomes intuitive to think that at some point, all data points can be considered as one cluster with almost zero sum of square error.', "The 'elbow' point on the plot represents a good cluster number, with the example suggesting a good K number of 4 for the dataset being discussed.", 'The chapter concludes with a practical example of applying cluster analysis to a dataset of age and income, aiming to identify characteristics of different groups, such as regional salary differences or differences based on profession.']}, {'end': 554.681, 'start': 494.48, 'title': 'Data analysis with pandas', 'summary': 'Covers importing a data set into a pandas data frame, plotting a scatter plot of age against income, and identifying three clusters in the data set.', 'duration': 60.201, 'highlights': ['The data set is imported into a Pandas data frame, consisting of name, age, and income.', 'A scatter plot is created to visualize the relationship between age and income, revealing three distinct clusters.', 'The process involves importing essential libraries, preparing the data frame, and plotting the scatter chart.']}], 'duration': 284.49, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM270191.jpg', 'highlights': ['The Albo method is introduced as a technique to determine the optimal number of clusters, addressing the challenge of selecting a suitable K value for clustering algorithms.', "The chapter explains the Elbow Technique for determining the optimal number of clusters in K-means clustering, showcasing how the sum of squared errors decreases as the number of clusters increases, and provides guidance on identifying the 'elbow' point to determine the optimal number of clusters.", 'A scatter plot is created to visualize the relationship between age and income, revealing three distinct clusters.', 'Different interpretations of clustering may lead to varying opinions on the ideal number of clusters, illustrating the challenge of determining an optimal K value for clustering algorithms.', 'The data set is imported into a Pandas data frame, consisting of name, age, and income.']}, {'end': 858.579, 'segs': [{'end': 587.058, 'src': 'embed', 'start': 556.982, 'weight': 0, 'content': [{'end': 562.624, 'text': 'So for this particular case, choosing K is pretty straightforward.', 'start': 556.982, 'duration': 5.642}, {'end': 567.986, 'text': 'So I will use K means.', 'start': 564.045, 'duration': 3.941}, {'end': 571.627, 'text': 'So K means is something we imported here.', 'start': 568.226, 'duration': 3.401}, {'end': 581.616, 'text': 'Okay, and of course you need to specify your k, which is n underscore clusters and, by the way, in jupyter notebook,', 'start': 572.988, 'duration': 8.628}, {'end': 587.058, 'text': 'when you type something and when you hit tab, it will auto complete.', 'start': 581.616, 'duration': 5.442}], 'summary': 'Using k means with n clusters for this case.', 'duration': 30.076, 'max_score': 556.982, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM556982.jpg'}, {'end': 686.047, 'src': 'embed', 'start': 620.241, 'weight': 1, 'content': [{'end': 622.843, 'text': 'So fit and predict what? Okay.', 'start': 620.241, 'duration': 2.602}, {'end': 630.095, 'text': "I'm going to fit and predict the data frame, excluding the name column,", 'start': 624.153, 'duration': 5.942}, {'end': 635.758, 'text': "because name column is string and it's not going to be useful in our numeric computation.", 'start': 630.095, 'duration': 5.663}, {'end': 637.138, 'text': 'So I want to ignore it.', 'start': 635.818, 'duration': 1.32}, {'end': 644.161, 'text': 'All right.', 'start': 643.701, 'duration': 0.46}, {'end': 651.384, 'text': 'So you do fit and predict and what you get back is y predicted.', 'start': 645.141, 'duration': 6.243}, {'end': 665.017, 'text': 'So now what this statement did is it ran K-means algorithm on age and income, which is this scatterplot,', 'start': 656.414, 'duration': 8.603}, {'end': 671.899, 'text': 'and it computed the cluster as per our criteria, where we told algorithm to identify three clusters somehow.', 'start': 665.017, 'duration': 6.882}, {'end': 673.94, 'text': 'OK, and it did it.', 'start': 672.219, 'duration': 1.721}, {'end': 676.881, 'text': 'It just assigned them different labels.', 'start': 674.56, 'duration': 2.321}, {'end': 679.782, 'text': 'So you can see three clusters, zero, one and two.', 'start': 676.901, 'duration': 2.881}, {'end': 686.047, 'text': 'now, visualizing this array is not very, very much fun.', 'start': 680.402, 'duration': 5.645}], 'summary': 'Ran k-means algorithm on age and income to identify three clusters.', 'duration': 65.806, 'max_score': 620.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM620241.jpg'}, {'end': 790.922, 'src': 'embed', 'start': 760.079, 'weight': 4, 'content': [{'end': 768.023, 'text': 'and the second one will be this and the third one will be this.', 'start': 760.079, 'duration': 7.944}, {'end': 782.73, 'text': 'so now we have three different data frames, each belonging to one cluster, and i want to plot these three data frames onto, uh, one scatter plot.', 'start': 768.023, 'duration': 14.707}, {'end': 790.922, 'text': 'OK, now, just to save some time, let me just copy paste the code here.', 'start': 786.3, 'duration': 4.622}], 'summary': 'Plot three data frames onto one scatter plot.', 'duration': 30.843, 'max_score': 760.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM760079.jpg'}, {'end': 858.579, 'src': 'heatmap', 'start': 816.758, 'weight': 6, 'content': [{'end': 827.08, 'text': "okay so df oh i'm made a mistake here i had a typo Good.", 'start': 816.758, 'duration': 10.322}, {'end': 827.781, 'text': 'All right.', 'start': 827.5, 'duration': 0.281}, {'end': 831.763, 'text': 'So I see a scatterplot here, but there is a little problem.', 'start': 828.061, 'duration': 3.702}, {'end': 836.525, 'text': 'So this red cluster looks okay, but there is a problem with these two clusters.', 'start': 832.023, 'duration': 4.502}, {'end': 838.907, 'text': 'You know, they are not grouped correctly.', 'start': 836.585, 'duration': 2.322}, {'end': 843.109, 'text': 'So this problem happened because our scaling is not right.', 'start': 839.587, 'duration': 3.522}, {'end': 852.054, 'text': "Our y-axis is scaled from, let's say, 40, 000 to 160, 000, and the range of x-axis is pretty narrow.", 'start': 843.429, 'duration': 8.625}, {'end': 853.515, 'text': "See, it's like hardly.", 'start': 852.374, 'duration': 1.141}, {'end': 858.579, 'text': '120 versus here is hundred and twenty thousand.', 'start': 854.375, 'duration': 4.204}], 'summary': 'Data visualization issue with incorrect clustering due to scaling problem.', 'duration': 26.556, 'max_score': 816.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM816758.jpg'}], 'start': 556.982, 'title': 'K means clustering in python', 'summary': "Introduces k means clustering in python, covering the process of choosing k, importing the k means library, fitting and predicting the data frame, and excluding non-numeric columns, with an emphasis on leveraging default parameters. it also discusses visualizing k-means clustering results on a scatter plot, identifying three clusters and addressing the scaling issue affecting the accuracy of the clusters' grouping.", 'chapters': [{'end': 651.384, 'start': 556.982, 'title': 'K means clustering in python', 'summary': 'Introduces the use of k means clustering in python for data analysis, including the process of choosing k, importing the k means library, fitting and predicting the data frame, and excluding non-numeric columns, with an emphasis on leveraging default parameters.', 'duration': 94.402, 'highlights': ['The process of using K Means clustering in Python involves choosing the appropriate value for K, importing the K means library, and fitting and predicting the data frame, with an emphasis on leveraging default parameters.', 'In Python, when using Jupyter notebook, the tab key can be used for auto-completion, simplifying the process of specifying parameters and completing code.', 'The K means object created has default parameters that can be later adjusted based on the specific requirements of the analysis.', 'In K Means clustering, the fit and predict method is used directly, excluding non-numeric columns from the data frame to enhance the accuracy of the prediction.']}, {'end': 858.579, 'start': 656.414, 'title': 'Visualizing k-means clustering results', 'summary': "Discusses the process of visualizing k-means clustering results on a scatter plot, identifying three clusters and addressing the scaling issue affecting the accuracy of the clusters' grouping.", 'duration': 202.165, 'highlights': ['The chapter demonstrates running the K-means algorithm on age and income to compute three clusters, visually represented as cluster 0, 1, and 2, on a scatter plot.', 'The process involves separating the clusters into three different data frames based on their assigned labels, followed by plotting these data frames onto a single scatter plot using different colors for each cluster.', 'The chapter highlights the issue of inaccurate grouping in two clusters due to scaling problems, specifically in the y-axis, with a range from 40,000 to 160,000, and a narrow x-axis range of 120, resulting in incorrect cluster visualization.']}], 'duration': 301.597, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM556982.jpg', 'highlights': ['The process of using K Means clustering in Python involves choosing the appropriate value for K, importing the K means library, and fitting and predicting the data frame, with an emphasis on leveraging default parameters.', 'The chapter demonstrates running the K-means algorithm on age and income to compute three clusters, visually represented as cluster 0, 1, and 2, on a scatter plot.', 'In K Means clustering, the fit and predict method is used directly, excluding non-numeric columns from the data frame to enhance the accuracy of the prediction.', 'The K means object created has default parameters that can be later adjusted based on the specific requirements of the analysis.', 'The process involves separating the clusters into three different data frames based on their assigned labels, followed by plotting these data frames onto a single scatter plot using different colors for each cluster.', 'In Python, when using Jupyter notebook, the tab key can be used for auto-completion, simplifying the process of specifying parameters and completing code.', 'The chapter highlights the issue of inaccurate grouping in two clusters due to scaling problems, specifically in the y-axis, with a range from 40,000 to 160,000, and a narrow x-axis range of 120, resulting in incorrect cluster visualization.']}, {'end': 1236.058, 'segs': [{'end': 889.765, 'src': 'embed', 'start': 858.579, 'weight': 0, 'content': [{'end': 864.464, 'text': "So when you don't scale your features properly properly, you might get into this problem.", 'start': 858.579, 'duration': 5.885}, {'end': 873.953, 'text': "That's why we need to do some pre-processing and use min max killer To scale these two features, and then only we can run our algorithm.", 'start': 864.464, 'duration': 9.489}, {'end': 874.814, 'text': 'All right.', 'start': 874.053, 'duration': 0.761}, {'end': 878.738, 'text': 'so we are going to use min max scalar.', 'start': 874.814, 'duration': 3.924}, {'end': 881.339, 'text': 'so the way you do it is.', 'start': 879.378, 'duration': 1.961}, {'end': 889.765, 'text': 'you will say scalar is min max scalar and this is something, if you already noticed, we imported here.', 'start': 881.339, 'duration': 8.426}], 'summary': 'Scaling features using min max scalar is essential for algorithm efficiency.', 'duration': 31.186, 'max_score': 858.579, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM858579.jpg'}, {'end': 988.171, 'src': 'heatmap', 'start': 974.707, 'weight': 1, 'content': [{'end': 979.068, 'text': 'So you can see that the income is a scale right?', 'start': 974.707, 'duration': 4.361}, {'end': 983.249, 'text': "It's like see 0.21, 0.38 and so on.", 'start': 979.188, 'duration': 4.061}, {'end': 985.75, 'text': 'So it is in a range of one to zero.', 'start': 983.309, 'duration': 2.441}, {'end': 988.171, 'text': 'You will not see any value outside zero to one range.', 'start': 985.79, 'duration': 2.381}], 'summary': 'Income is represented on a scale from 0 to 1, with no values outside this range.', 'duration': 48.623, 'max_score': 974.707, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM974707.jpg'}, {'end': 1200.865, 'src': 'embed', 'start': 1032.31, 'weight': 3, 'content': [{'end': 1039.156, 'text': 'OK And even if you plot these on to scatter plot they will look structure wise at least they will look like this.', 'start': 1032.31, 'duration': 6.846}, {'end': 1042.103, 'text': 'OK, all right.', 'start': 1039.518, 'duration': 2.585}, {'end': 1046.885, 'text': 'so the next step is to use k-means algorithm once again.', 'start': 1042.103, 'duration': 4.782}, {'end': 1051.766, 'text': 'uh, to train our scale data set.', 'start': 1046.885, 'duration': 4.881}, {'end': 1054.087, 'text': 'this is gonna be fun.', 'start': 1051.766, 'duration': 2.321}, {'end': 1058.528, 'text': "now let's see what scaling can give us.", 'start': 1054.087, 'duration': 4.441}, {'end': 1067.21, 'text': 'as usual, y predicted is equal to km, dot, fit and predict.', 'start': 1062.627, 'duration': 4.583}, {'end': 1090.952, 'text': "so again, I started with three clusters and I am using I'm just fitting my scale data, age, income all right, and let's see my y predicted.", 'start': 1067.21, 'duration': 23.742}, {'end': 1096.277, 'text': "so it predicted some values which yet don't know how good they are.", 'start': 1090.952, 'duration': 5.325}, {'end': 1103.043, 'text': 'so i will just do cluster is equal to y predicted.', 'start': 1096.277, 'duration': 6.766}, {'end': 1110.289, 'text': 'i will also just drop the column that we typoed.', 'start': 1103.043, 'duration': 7.246}, {'end': 1125.945, 'text': "then let's look at DF in place is equal to true, okay.", 'start': 1115.434, 'duration': 10.511}, {'end': 1130.729, 'text': 'so now this is my new clustering result.', 'start': 1125.945, 'duration': 4.784}, {'end': 1136.575, 'text': "let's plot this on to our scatter plot.", 'start': 1130.729, 'duration': 5.846}, {'end': 1146.943, 'text': "I'm just going to remove this for now.", 'start': 1145.362, 'duration': 1.581}, {'end': 1151.404, 'text': 'Now you can see that I have a pretty good cluster.', 'start': 1148.603, 'duration': 2.801}, {'end': 1152.865, 'text': 'See black, green, and red.', 'start': 1151.444, 'duration': 1.421}, {'end': 1154.606, 'text': 'They look very nicely formed.', 'start': 1153.125, 'duration': 1.481}, {'end': 1159.888, 'text': 'One of the things we studied in our theory section was centroids.', 'start': 1155.866, 'duration': 4.022}, {'end': 1169.411, 'text': 'So, if you look at KM, which is your train K-means model, uh,', 'start': 1160.408, 'duration': 9.003}, {'end': 1177.618, 'text': 'that has a variable called cluster centers and these centers are basically your centroids.', 'start': 1169.411, 'duration': 8.207}, {'end': 1180, 'text': 'okay, so this is x, this is y.', 'start': 1177.618, 'duration': 2.382}, {'end': 1186.865, 'text': 'so this is the first centroid of your first cluster, second centroid and third centroid.', 'start': 1180, 'duration': 6.865}, {'end': 1193.53, 'text': 'and if you can plot this into a scatter plot, uh, it can give a nice visualization to us, right?', 'start': 1186.865, 'duration': 6.665}, {'end': 1195.292, 'text': 'so plt dot scatter.', 'start': 1193.53, 'duration': 1.762}, {'end': 1200.865, 'text': "So first let's plot X axis.", 'start': 1197.423, 'duration': 3.442}], 'summary': 'Using k-means algorithm to cluster data and visualize centroids on scatter plot.', 'duration': 168.555, 'max_score': 1032.31, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM1032310.jpg'}], 'start': 858.579, 'title': 'Scaling and visualizing clustering results', 'summary': 'Covers scaling features using min max scalar to prepare data for algorithmic processing, applying k-means algorithm to train a scaled data set and make predictions, and visualizing clustering results on a scatter plot with a focus on centroids and distinct clusters.', 'chapters': [{'end': 1031.99, 'start': 858.579, 'title': 'Scaling features with min max scalar', 'summary': 'Explains the use of min max scalar to scale features such as income and age, transforming their values to a range of 0 to 1, to prepare the data for algorithmic processing.', 'duration': 173.411, 'highlights': ['Using min max scalar to scale features like income and age is crucial for preparing the data for algorithmic processing, ensuring that the values are transformed to a range of 0 to 1.', 'Demonstrates the process of fitting and transforming features using min max scalar, resulting in scaled values for both income and age within the range of 0 to 1.', 'Emphasizes the importance of proper feature scaling to avoid problems when running algorithms, highlighting the necessity of pre-processing for effective data analysis.']}, {'end': 1125.945, 'start': 1032.31, 'title': 'K-means algorithm for scaling data', 'summary': 'Demonstrates the use of k-means algorithm to train a scaled data set and make predictions, starting with three clusters and evaluating the y predicted values.', 'duration': 93.635, 'highlights': ['Using k-means algorithm to train a scaled data set and evaluate y predicted values', 'Starting with three clusters for the k-means algorithm', 'Dropping a column from the data set and examining the result']}, {'end': 1236.058, 'start': 1125.945, 'title': 'Clustering results visualization', 'summary': 'Discusses visualizing clustering results on a scatter plot, highlighting the formation of distinct clusters and the use of centroids to represent cluster centers in the visualization, with a focus on plotting the centroids and differentiating them from regular data points.', 'duration': 110.113, 'highlights': ['The chapter discusses the visualization of clustering results on a scatter plot, highlighting the distinct formation of clusters, including black, green, and red clusters, and emphasizes the effectiveness of this visualization for analysis.', "It explains the concept of centroids in the context of K-means model, where the 'cluster centers' variable represents centroids, and illustrates their use in plotting the centroids on a scatter plot for visualization and analysis.", 'The chapter provides guidance on differentiating centroids from regular data points in the scatter plot, emphasizing the significance of this differentiation for effective visualization and analysis.']}], 'duration': 377.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM858579.jpg', 'highlights': ['Using min max scalar to scale features like income and age is crucial for preparing the data for algorithmic processing, ensuring that the values are transformed to a range of 0 to 1.', 'Demonstrates the process of fitting and transforming features using min max scalar, resulting in scaled values for both income and age within the range of 0 to 1.', 'Emphasizes the importance of proper feature scaling to avoid problems when running algorithms, highlighting the necessity of pre-processing for effective data analysis.', 'The chapter discusses the visualization of clustering results on a scatter plot, highlighting the distinct formation of clusters, including black, green, and red clusters, and emphasizes the effectiveness of this visualization for analysis.', "It explains the concept of centroids in the context of K-means model, where the 'cluster centers' variable represents centroids, and illustrates their use in plotting the centroids on a scatter plot for visualization and analysis.", 'Using k-means algorithm to train a scaled data set and evaluate y predicted values', 'Starting with three clusters for the k-means algorithm', 'Dropping a column from the data set and examining the result', 'The chapter provides guidance on differentiating centroids from regular data points in the scatter plot, emphasizing the significance of this differentiation for effective visualization and analysis.']}, {'end': 1514.097, 'segs': [{'end': 1297.832, 'src': 'embed', 'start': 1267.469, 'weight': 0, 'content': [{'end': 1271.951, 'text': "And you'll be like, what do I do now? Well, you use your elbow plot method.", 'start': 1267.469, 'duration': 4.482}, {'end': 1273.452, 'text': 'So in elbow plot.', 'start': 1272.471, 'duration': 0.981}, {'end': 1281.278, 'text': 'um, as we saw, in theory we, uh, go through number of case.', 'start': 1274.472, 'duration': 6.806}, {'end': 1287.163, 'text': "okay. so let's say we'll go from k equal to 1 to 10 in our case, okay.", 'start': 1281.278, 'duration': 5.885}, {'end': 1293.308, 'text': 'and then we try to calculate sse, which is sum of square error, and then plot them and try to find this elbow.', 'start': 1287.163, 'duration': 6.145}, {'end': 1297.832, 'text': "so let's define our k range.", 'start': 1293.308, 'duration': 4.524}], 'summary': 'Use elbow plot method to find k with lowest sse.', 'duration': 30.363, 'max_score': 1267.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM1267469.jpg'}, {'end': 1509.593, 'src': 'heatmap', 'start': 1408.323, 'weight': 1, 'content': [{'end': 1414.789, 'text': 'So SSE, you can see that sum of squared error was very high initially, then it kept on reducing.', 'start': 1408.323, 'duration': 6.466}, {'end': 1422.397, 'text': "And now let's plot this guy into nice chart.", 'start': 1415.57, 'duration': 6.827}, {'end': 1431.706, 'text': 'okay, when you do that, you get our elbow plot.', 'start': 1426.602, 'duration': 5.104}, {'end': 1433.708, 'text': 'remember elbow plot, elbow.', 'start': 1431.706, 'duration': 2.002}, {'end': 1435.69, 'text': 'all right, where is my elbow?', 'start': 1433.708, 'duration': 1.982}, {'end': 1436.991, 'text': 'where is my elbow?', 'start': 1435.69, 'duration': 1.301}, {'end': 1438.452, 'text': 'okay, here is my elbow.', 'start': 1436.991, 'duration': 1.461}, {'end': 1443.216, 'text': "you can see that k is equal to three for my elbow, and that's what happened.", 'start': 1438.452, 'duration': 4.764}, {'end': 1447.657, 'text': 'see, i have three clusters For exercise.', 'start': 1443.216, 'duration': 4.441}, {'end': 1452.139, 'text': 'we are going to use our iris flower dataset from sklearn library.', 'start': 1447.657, 'duration': 4.482}, {'end': 1456.6, 'text': 'And what you have to do is use pattern length and width features.', 'start': 1452.479, 'duration': 4.121}, {'end': 1461.862, 'text': 'Just drop sample length and width because it makes your clustering a little bit difficult.', 'start': 1456.84, 'duration': 5.022}, {'end': 1464.263, 'text': 'So just drop these two features for simplicity.', 'start': 1462.142, 'duration': 2.121}, {'end': 1469.965, 'text': 'Use the pattern length and width features and try to form clusters in that dataset.', 'start': 1464.863, 'duration': 5.102}, {'end': 1476.348, 'text': 'Now, that dataset has a class label in the target variable, but you should just ignore it.', 'start': 1470.445, 'duration': 5.903}, {'end': 1480.51, 'text': 'You can use that just to confirm your results.', 'start': 1476.929, 'duration': 3.581}, {'end': 1487.234, 'text': 'And in the end, you will draw an ALBO plot to find out the optimal value of k.', 'start': 1481.191, 'duration': 6.043}, {'end': 1489.135, 'text': 'So just do the exercise.', 'start': 1487.234, 'duration': 1.901}, {'end': 1493.819, 'text': 'post your results into the video comments below.', 'start': 1489.135, 'duration': 4.684}, {'end': 1500.104, 'text': 'also, i have provided a link of jupyter notebook used in this tutorial in the video description.', 'start': 1493.819, 'duration': 6.285}, {'end': 1501.966, 'text': 'so look at it.', 'start': 1500.104, 'duration': 1.862}, {'end': 1506.67, 'text': 'when you go towards the end, you will find the exercise sections.', 'start': 1501.966, 'duration': 4.704}, {'end': 1509.593, 'text': "also, don't forget to give it a thumbs up.", 'start': 1506.67, 'duration': 2.923}], 'summary': 'Using sse, plot elbow plot to find k=3 clusters for iris flower dataset.', 'duration': 52.753, 'max_score': 1408.323, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM1408323.jpg'}], 'start': 1236.058, 'title': 'Elbow plot method for clustering', 'summary': 'Introduces the elbow plot method for determining the optimal number of clusters in k-means clustering, demonstrated through a practical example using the iris flower dataset, resulting in an optimal k value of 3 and excluding the sample length and width features.', 'chapters': [{'end': 1514.097, 'start': 1236.058, 'title': 'Elbow plot method for clustering', 'summary': 'Introduces the elbow plot method for determining the optimal number of clusters in k-means clustering, demonstrated through a practical example using the iris flower dataset, resulting in an optimal k value of 3 and excluding the sample length and width features.', 'duration': 278.039, 'highlights': ["The elbow plot method involves iterating through a range of k values, calculating the sum of squared errors (SSE) for each, and plotting them to identify the 'elbow' point, resulting in an optimal k value of 3.", 'In the practical example using the iris flower dataset, the exercise involves forming clusters using petal length and width features, excluding the sample length and width to simplify clustering, and drawing an elbow plot to determine the optimal k value.', 'The tutorial provides a link to the Jupyter notebook used in the demonstration and encourages viewers to post their results and engage with the content.']}], 'duration': 278.039, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/EItlUEPCIzM/pics/EItlUEPCIzM1236058.jpg', 'highlights': ['The elbow plot method identifies the optimal k value of 3 through SSE calculation.', 'The practical example uses petal length and width features, excluding sample length and width.', 'The tutorial provides a link to the Jupyter notebook and encourages viewer engagement.']}], 'highlights': ['The tutorial is structured into three parts: theory, coding, and exercises, providing a comprehensive understanding of the K-Means clustering process.', 'The Albo method is introduced as a technique to determine the optimal number of clusters, addressing the challenge of selecting a suitable K value for clustering algorithms.', 'Using min max scalar to scale features like income and age is crucial for preparing the data for algorithmic processing, ensuring that the values are transformed to a range of 0 to 1.', 'The process of using K Means clustering in Python involves choosing the appropriate value for K, importing the K means library, and fitting and predicting the data frame, with an emphasis on leveraging default parameters.', "The Elbow Technique for determining the optimal number of clusters in K-means clustering, showcasing how the sum of squared errors decreases as the number of clusters increases, and provides guidance on identifying the 'elbow' point to determine the optimal number of clusters."]}