Coursnap

title
Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)

description
Credit card fraud detection, cancer prediction, customer churn prediction are some of the examples where you might get an imbalanced dataset. Training a model on imbalanced dataset requires making certain adjustments otherwise the model will not perform as per your expectations. In this video I am discussing various techniques to handle imbalanced dataset in machine learning. I also have a python code that demonstrates these different techniques. In the end there is an exercise for you to solve along with a solution link. Code: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data.ipynb Path for csv file: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced Exercise: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data_exercise.md Focal loss article: https://medium.com/analytics-vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-3d2e1c4da8d7#:~:text=Focal%20loss%20is%20very%20useful,is%20simple%20and%20highly%20effective. #imbalanceddataset #imbalanceddatasetinmachinelearning #smotetechnique #deeplearning #imbalanceddatamachinelearning Topics 00:00 Overview 00:01 Handle imbalance using under sampling 02:05 Oversampling (blind copy) 02:35 Oversampling (SMOTE) 03:00 Ensemble 03:39 Focal loss 04:47 Python coding starts 07:56 Code - undersamping 14:31 Code - oversampling (blind copy) 19:47 Code - oversampling (SMOTE) 24:26 Code - Ensemble 35:48 Exercise Do you want to learn technology from me? Check https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description for my affordable video courses. Previous video: https://www.youtube.com/watch?v=lcI8ukTUEbo&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=20 Deep learning playlist: https://www.youtube.com/playlist?list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO Machine learning playlist : https://www.youtube.com/playlist?list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw 🌎 My Website For Video Courses: https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description Need help building software or data analytics and AI solutions? My company https://www.atliq.com/ can help. Click on the Contact button on that website. #️⃣ Social Media #️⃣ 🔗 Discord: https://discord.gg/r42Kbuk 📸 Dhaval's Personal Instagram: https://www.instagram.com/dhavalsays/ 📸 Instagram: https://www.instagram.com/codebasicshub/ 🔊 Facebook: https://www.facebook.com/codebasicshub 📝 Linkedin (Personal): https://www.linkedin.com/in/dhavalsays/ 📝 Linkedin (Codebasics): https://www.linkedin.com/company/codebasics/ 📱 Twitter: https://twitter.com/codebasicshub 🔗 Patreon: https://www.patreon.com/codebasics?fan_landing=true DISCLAIMER: All opinions expressed in this video are of my own and not that of my employers'.

detail
{'title': 'Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)', 'heatmap': [{'end': 193.573, 'start': 100.684, 'weight': 0.755}], 'summary': 'The tutorial addresses the challenge of imbalance in fraud detection data, discussing techniques like oversampling using smote and ensemble methods to handle class imbalance in machine learning, resulting in precision increase from 0.63 to 0.71 and an f1 score improvement from 53% to 81%. additionally, it covers the creation of neural network models and the process of determining majority vote for predictions.', 'chapters': [{'end': 44.457, 'segs': [{'end': 44.457, 'src': 'embed', 'start': 0.089, 'weight': 0, 'content': [{'end': 5.134, 'text': 'Fraud detection is a common problem that people try to solve in the field of machine learning.', 'start': 0.089, 'duration': 5.045}, {'end': 12.41, 'text': "But when you're training your model with a training set for fraud transaction, you will often find that you will have 10,", 'start': 5.694, 'duration': 6.716}, {'end': 15.263, 'text': '000 good transaction and only one will be fraud.', 'start': 12.41, 'duration': 2.853}, {'end': 18.646, 'text': 'This creates an imbalance in your data set.', 'start': 15.883, 'duration': 2.763}, {'end': 30.911, 'text': 'And even if you write a simple Python prediction function which returns false all the time, even with that stupid function, you can get 99% accuracy,', 'start': 19.306, 'duration': 11.605}, {'end': 33.112, 'text': 'because majority of the transactions are not fraud.', 'start': 30.911, 'duration': 2.201}, {'end': 37.574, 'text': 'But on the other hand, what you care about is the fraud transaction.', 'start': 33.932, 'duration': 3.642}, {'end': 44.457, 'text': "So although accuracy is 99%, the function is still performing horribly because it's not telling you what is fraud.", 'start': 37.814, 'duration': 6.643}], 'summary': 'Imbalanced data in fraud detection causes misleading accuracy, emphasizing need for focus on fraud transactions.', 'duration': 44.368, 'max_score': 0.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo89.jpg'}], 'start': 0.089, 'title': 'Imbalance in fraud detection data', 'summary': 'Discusses the challenge of imbalance in fraud detection data, highlighting how a model with 99% accuracy may struggle to identify fraud due to a high volume of legitimate transactions compared to fraudulent ones.', 'chapters': [{'end': 44.457, 'start': 0.089, 'title': 'Imbalance in fraud detection data', 'summary': 'Discusses the imbalance in fraud detection data, where a model with 99% accuracy can still perform horribly in identifying fraud transactions due to a large number of good transactions compared to fraudulent ones.', 'duration': 44.368, 'highlights': ['The imbalance in fraud detection data causes models to have high accuracy despite performing poorly in identifying fraud.', 'Training sets for fraud transactions often have 10,000 good transactions and only one fraudulent transaction, creating an imbalance in the dataset.', 'Even with a simple Python prediction function that always returns false, it can achieve 99% accuracy due to the majority of transactions being non-fraudulent, highlighting the issue of accuracy as a misleading metric.']}], 'duration': 44.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo89.jpg', 'highlights': ['Training sets for fraud transactions often have 10,000 good transactions and only one fraudulent transaction, creating an imbalance in the dataset.', 'The imbalance in fraud detection data causes models to have high accuracy despite performing poorly in identifying fraud.', 'Even with a simple Python prediction function that always returns false, it can achieve 99% accuracy due to the majority of transactions being non-fraudulent, highlighting the issue of accuracy as a misleading metric.']}, {'end': 243.639, 'segs': [{'end': 193.573, 'src': 'heatmap', 'start': 67.717, 'weight': 3, 'content': [{'end': 71.58, 'text': "So please stay till the end and let's get started.", 'start': 67.717, 'duration': 3.863}, {'end': 79.848, 'text': 'The first technique to handle imbalance in your data set is under sampling majority class.', 'start': 73.382, 'duration': 6.466}, {'end': 89.998, 'text': "Let's say you have 99, 000 samples belonging to one class, let's say green class, and 1, 000 samples belonging to red class.", 'start': 80.309, 'duration': 9.689}, {'end': 94.582, 'text': "Let's say this is a fraud detection scenario where 1, 000 transactions are fraud, 99, 000 are not fraud transaction.", 'start': 90.358, 'duration': 4.224}, {'end': 100.684, 'text': 'To tackle this imbalance.', 'start': 98.922, 'duration': 1.762}, {'end': 118.105, 'text': 'what you can do is take randomly picked 1000 samples from your 99000 samples and discard remaining samples and then combine that with 1000 red samples and then train your machine learning model.', 'start': 100.684, 'duration': 17.421}, {'end': 124.648, 'text': 'but obviously this is not the best approach, because you are throwing away so much data.', 'start': 118.866, 'duration': 5.782}, {'end': 129.07, 'text': 'so the second option is over sample the minority class.', 'start': 124.648, 'duration': 4.422}, {'end': 130.911, 'text': 'now, how do you over sample it?', 'start': 129.07, 'duration': 1.841}, {'end': 132.311, 'text': 'well, think about it.', 'start': 130.911, 'duration': 1.4}, {'end': 141.855, 'text': 'one obvious technique is you duplicate this thousand transactions 99 times and you get 99 000 transactions.', 'start': 132.311, 'duration': 9.544}, {'end': 149.758, 'text': "it's just simple copy and then you train the machine learning model While this works.", 'start': 141.855, 'duration': 7.903}, {'end': 152.018, 'text': 'you would think there should be a better way.', 'start': 149.758, 'duration': 2.26}, {'end': 153.919, 'text': 'Well, that is your third option.', 'start': 152.298, 'duration': 1.621}, {'end': 157.72, 'text': 'You do oversampling using a technique called SMOTE.', 'start': 154.899, 'duration': 2.821}, {'end': 168.784, 'text': 'So here you use k nearest neighbors algorithm and try to produce synthetic samples from your 1000 samples.', 'start': 158.28, 'duration': 10.504}, {'end': 174.505, 'text': "That's why it's called Synthetic Minority Oversampling Technique.", 'start': 170.942, 'duration': 3.563}, {'end': 179.709, 'text': 'And in Python, there is a module called imb-learn, which can be used for smooth.', 'start': 174.545, 'duration': 5.164}, {'end': 183.403, 'text': 'The fourth technique is ensemble.', 'start': 181.661, 'duration': 1.742}, {'end': 188.588, 'text': "So let's say you have 3000 transaction in one class, 1000 in another.", 'start': 184.204, 'duration': 4.384}, {'end': 193.573, 'text': 'What you can do is you can divide those 3000 in three batches.', 'start': 189.449, 'duration': 4.124}], 'summary': 'Handling data imbalance through under sampling, over sampling, smote, and ensemble techniques.', 'duration': 61.353, 'max_score': 67.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo67717.jpg'}, {'end': 243.639, 'src': 'embed', 'start': 158.28, 'weight': 0, 'content': [{'end': 168.784, 'text': 'So here you use k nearest neighbors algorithm and try to produce synthetic samples from your 1000 samples.', 'start': 158.28, 'duration': 10.504}, {'end': 174.505, 'text': "That's why it's called Synthetic Minority Oversampling Technique.", 'start': 170.942, 'duration': 3.563}, {'end': 179.709, 'text': 'And in Python, there is a module called imb-learn, which can be used for smooth.', 'start': 174.545, 'duration': 5.164}, {'end': 183.403, 'text': 'The fourth technique is ensemble.', 'start': 181.661, 'duration': 1.742}, {'end': 188.588, 'text': "So let's say you have 3000 transaction in one class, 1000 in another.", 'start': 184.204, 'duration': 4.384}, {'end': 193.573, 'text': 'What you can do is you can divide those 3000 in three batches.', 'start': 189.449, 'duration': 4.124}, {'end': 200.781, 'text': 'Take the first batch, combine it with 1000 rate transaction, build a model, call it model number one.', 'start': 194.534, 'duration': 6.247}, {'end': 207.786, 'text': 'Similarly, you take second and third batch and create model two and three.', 'start': 202.643, 'duration': 5.143}, {'end': 214.85, 'text': 'So now you have three models and you use a majority ward, something like random forest.', 'start': 208.226, 'duration': 6.624}, {'end': 218.392, 'text': "You know, you have a bunch of trees and you're taking just the majority ward.", 'start': 214.89, 'duration': 3.502}, {'end': 221.665, 'text': 'The fifth method is focal loss,', 'start': 219.443, 'duration': 2.222}, {'end': 231.991, 'text': "where it's a special type of loss function which will penalize the majority class and it will give more weightage to the minority class.", 'start': 221.665, 'duration': 10.326}, {'end': 243.639, 'text': "There is this article on medium which I'm going to refer in the video description below which talks about the math behind focal loss and why it works.", 'start': 233.152, 'duration': 10.487}], 'summary': 'Using k-nearest neighbors, ensemble, and focal loss methods to address imbalanced dataset in python.', 'duration': 85.359, 'max_score': 158.28, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo158280.jpg'}], 'start': 45.017, 'title': 'Handling class imbalance in ml', 'summary': 'Discusses techniques for handling class imbalance in machine learning with a focus on oversampling using smote, ensemble methods, and focal loss. it also mentions the imb-learn module in python for smote, aiming to provide a comprehensive approach to addressing class imbalance in ml scenarios.', 'chapters': [{'end': 129.07, 'start': 45.017, 'title': 'Handling imbalanced data in machine learning', 'summary': 'Discusses techniques for handling imbalanced data in machine learning, including under sampling the majority class and oversampling the minority class, with a fraud detection scenario as an example, aiming to achieve a balanced dataset for training the machine learning model.', 'duration': 84.053, 'highlights': ['Under sampling the majority class involves randomly selecting 1000 samples from 99,000, and combining with 1000 red samples, aiming to address the imbalance in the dataset for training the model.', 'The imbalance scenario is illustrated with a fraud detection example where 99,000 transactions are not fraud and 1,000 transactions are fraud, highlighting the need for handling imbalanced data in machine learning.']}, {'end': 243.639, 'start': 129.07, 'title': 'Handling class imbalance in machine learning', 'summary': 'Discusses techniques for handling class imbalance in machine learning, including oversampling using smote, ensemble methods, and focal loss, with the mention of the imb-learn module in python for smote.', 'duration': 114.569, 'highlights': ['SMOTE: Using the Synthetic Minority Oversampling Technique (SMOTE) with k nearest neighbors algorithm to create synthetic samples from the minority class, with the mention of the imb-learn module in Python for SMOTE.', 'Ensemble Methods: Creating multiple models by dividing the majority class transactions into batches and combining them with the minority class transactions, followed by using a majority vote ensemble method like random forest.', 'Focal Loss: Discussing the use of focal loss, a special type of loss function that penalizes the majority class and gives more weightage to the minority class.']}], 'duration': 198.622, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo45017.jpg', 'highlights': ['SMOTE: Using the Synthetic Minority Oversampling Technique (SMOTE) with k nearest neighbors algorithm to create synthetic samples from the minority class, with the mention of the imb-learn module in Python for SMOTE.', 'Ensemble Methods: Creating multiple models by dividing the majority class transactions into batches and combining them with the minority class transactions, followed by using a majority vote ensemble method like random forest.', 'Focal Loss: Discussing the use of focal loss, a special type of loss function that penalizes the majority class and gives more weightage to the minority class.', 'Under sampling the majority class involves randomly selecting 1000 samples from 99,000, and combining with 1000 red samples, aiming to address the imbalance in the dataset for training the model.', 'The imbalance scenario is illustrated with a fraud detection example where 99,000 transactions are not fraud and 1,000 transactions are fraud, highlighting the need for handling imbalanced data in machine learning.']}, {'end': 813.7, 'segs': [{'end': 272.898, 'src': 'embed', 'start': 246.269, 'weight': 1, 'content': [{'end': 250.41, 'text': 'These are some of the examples of imbalanced classes.', 'start': 246.269, 'duration': 4.141}, {'end': 252.491, 'text': 'Customer churn prediction.', 'start': 251.131, 'duration': 1.36}, {'end': 261.014, 'text': 'Whenever a company is stable and is doing a good service, the churn rate will be very less.', 'start': 253.411, 'duration': 7.603}, {'end': 263.395, 'text': 'Similarly, device failures.', 'start': 262.134, 'duration': 1.261}, {'end': 271.017, 'text': 'When IoT devices are sending continuous data and if the device is stable enough, the failure rate will be pretty low,', 'start': 263.455, 'duration': 7.562}, {'end': 272.898, 'text': 'and that creates imbalance in your data set.', 'start': 271.017, 'duration': 1.881}], 'summary': 'Imbalanced classes in customer churn and device failure prediction create data set imbalance.', 'duration': 26.629, 'max_score': 246.269, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo246269.jpg'}, {'end': 355.268, 'src': 'embed', 'start': 298.405, 'weight': 0, 'content': [{'end': 305.071, 'text': "When I made this video, couple of you commented that don't I have to take the imbalance into target variable.", 'start': 298.405, 'duration': 6.666}, {'end': 308.892, 'text': 'The comment came from teaching G.', 'start': 305.851, 'duration': 3.041}, {'end': 313.154, 'text': 'Also basic Nagar raised the concern about the imbalance in the data set.', 'start': 308.892, 'duration': 4.262}, {'end': 322.717, 'text': 'Same thing with few other people like with an I know that there is this problem because if you look at my notebook which I created in that video.', 'start': 313.994, 'duration': 8.723}, {'end': 325.718, 'text': 'And if you look at the.', 'start': 323.917, 'duration': 1.801}, {'end': 329.871, 'text': 'Precision and Recall for Class 1.', 'start': 327.469, 'duration': 2.402}, {'end': 332.993, 'text': 'Class 1 is how many customers are leaving your business.', 'start': 329.871, 'duration': 3.122}, {'end': 338.757, 'text': 'You will see F1 score is very low whereas F1 score is pretty high in here.', 'start': 333.833, 'duration': 4.924}, {'end': 345.902, 'text': 'The accuracy is 78% but accuracy is kind of useless if your data set is imbalanced.', 'start': 339.237, 'duration': 6.665}, {'end': 349.564, 'text': 'What matters is the F1 score for individual classes.', 'start': 346.402, 'duration': 3.162}, {'end': 355.268, 'text': 'You want F1 score for individual classes which is 0 and 1 to be higher.', 'start': 350.145, 'duration': 5.123}], 'summary': 'Imbalance in dataset affects accuracy, f1 score crucial for individual classes.', 'duration': 56.863, 'max_score': 298.405, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo298405.jpg'}, {'end': 545.762, 'src': 'embed', 'start': 501.369, 'weight': 2, 'content': [{'end': 508.335, 'text': 'I took 0 samples in into this data frame and one samples into that data frame.', 'start': 501.369, 'duration': 6.966}, {'end': 511.877, 'text': 'OK, and if you look at the shape.', 'start': 508.435, 'duration': 3.442}, {'end': 524.467, 'text': 'Also, this is the shape actually, so this is DF2.', 'start': 521.585, 'duration': 2.882}, {'end': 530.491, 'text': 'okay, so you can see the balance here.', 'start': 527.569, 'duration': 2.922}, {'end': 531.732, 'text': 'imbalance here.', 'start': 530.491, 'duration': 1.241}, {'end': 539.598, 'text': 'one class has 5163, second class has 1869 samples.', 'start': 531.732, 'duration': 7.866}, {'end': 543.7, 'text': 'so now i will under sample this ef0 class.', 'start': 539.598, 'duration': 4.102}, {'end': 545.762, 'text': 'now how do you under sample it?', 'start': 543.7, 'duration': 2.062}], 'summary': 'Under-sampling class ef0 to address imbalance: 5163 vs 1869 samples.', 'duration': 44.393, 'max_score': 501.369, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo501369.jpg'}, {'end': 739.895, 'src': 'embed', 'start': 706.207, 'weight': 4, 'content': [{'end': 715.515, 'text': 'this has an argument called stratify, which will make sure you have balanced samples.', 'start': 706.207, 'duration': 9.308}, {'end': 724.484, 'text': 'you know, okay, now see, okay, let me give more clarification.', 'start': 715.515, 'duration': 8.969}, {'end': 727.606, 'text': 'so when you do stratify is equal to y, this y.', 'start': 724.484, 'duration': 3.122}, {'end': 739.895, 'text': 'is this okay, and the samples in x train and x test will have balanced samples from zeros and one class.', 'start': 727.606, 'duration': 12.289}], 'summary': 'Using stratify in argument ensures balanced samples in x train and x test.', 'duration': 33.688, 'max_score': 706.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo706207.jpg'}], 'start': 246.269, 'title': 'Addressing imbalanced classes in data analysis', 'summary': 'Discusses the challenges of imbalanced classes in data analysis, citing examples such as customer churn prediction, device failures, and cancer prediction. it emphasizes the need to address imbalance in the target variable. additionally, it demonstrates the process of under sampling to balance the classes, resulting in an equal number of samples from both classes and improving the f1 score for class 1 from 0.53 to a higher value.', 'chapters': [{'end': 325.718, 'start': 246.269, 'title': 'Imbalanced classes in data analysis', 'summary': 'Discusses imbalanced classes in data analysis, citing examples such as customer churn prediction, device failures, and cancer prediction, and emphasizes the need to address imbalance in the target variable.', 'duration': 79.449, 'highlights': ['The chapter discusses examples of imbalanced classes, including customer churn prediction, device failures, and cancer prediction, highlighting the imbalance in the data sets. (Relevance Score: 5)', 'The need to address imbalance in the target variable is emphasized, as highlighted by comments from viewers regarding the imbalance in the data set and the use of a notebook from a previous tutorial for predicting customer churn. (Relevance Score: 4)', 'The speaker mentions using Python coding and refers to a specific tutorial number 18 in a deep learning series for predicting customer churn. (Relevance Score: 3)']}, {'end': 813.7, 'start': 327.469, 'title': 'Improving f1 score with under sampling', 'summary': 'Discusses the importance of f1 score for individual classes in an imbalanced dataset and demonstrates the process of under sampling to balance the classes, resulting in an equal number of samples from both classes and ensuring a balanced distribution in the training and testing data, ultimately improving the f1 score for class 1 from 0.53 to an unspecified higher value.', 'duration': 486.231, 'highlights': ['The F1 score for class 1 is initially 0.53, whereas for class 0, it is 85%.', 'The imbalance in the dataset is evident with 1033 samples for class 0 and 374 samples for class 1.', 'Under sampling is performed to balance the classes, resulting in 1869 samples for both class 0 and class 1.', 'The stratify argument is used in the train-test split to ensure balanced samples from both classes in the training and testing data.', 'The F1 score for class 1 is improved through under sampling, although the exact value is not specified.']}], 'duration': 567.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo246269.jpg', 'highlights': ['The need to address imbalance in the target variable is emphasized, as highlighted by comments from viewers regarding the imbalance in the data set and the use of a notebook from a previous tutorial for predicting customer churn. (Relevance Score: 4)', 'The chapter discusses examples of imbalanced classes, including customer churn prediction, device failures, and cancer prediction, highlighting the imbalance in the data sets. (Relevance Score: 5)', 'Under sampling is performed to balance the classes, resulting in 1869 samples for both class 0 and class 1. (Relevance Score: 3)', 'The F1 score for class 1 is initially 0.53, whereas for class 0, it is 85%. (Relevance Score: 2)', 'The stratify argument is used in the train-test split to ensure balanced samples from both classes in the training and testing data. (Relevance Score: 1)']}, {'end': 1133.782, 'segs': [{'end': 894.412, 'src': 'embed', 'start': 814.681, 'weight': 0, 'content': [{'end': 819.304, 'text': 'You see that my precision and recall is improved.', 'start': 814.681, 'duration': 4.623}, {'end': 823.527, 'text': 'Precision and recall is improved to this.', 'start': 820.685, 'duration': 2.842}, {'end': 824.988, 'text': 'So let me do this.', 'start': 824.127, 'duration': 0.861}, {'end': 826.929, 'text': 'I like using the snipping tool.', 'start': 825.328, 'duration': 1.601}, {'end': 834.754, 'text': "I'm going to take the numbers from imbalance classifier.", 'start': 827.87, 'duration': 6.884}, {'end': 836.876, 'text': 'So this was an imbalance classifier.', 'start': 835.195, 'duration': 1.681}, {'end': 844.101, 'text': 'and that you compare it here.', 'start': 839.497, 'duration': 4.604}, {'end': 850.728, 'text': 'you see so in the imbalance classifier, my precision was 0.63.', 'start': 844.101, 'duration': 6.627}, {'end': 857.034, 'text': 'recall was this and my f1 score was 0.53, which was very low.', 'start': 850.728, 'duration': 6.306}, {'end': 859.816, 'text': 'it improved here 71, pretty good.', 'start': 857.034, 'duration': 2.782}, {'end': 870.887, 'text': "For class 0 from 86 it dropped to 72, but that's okay, because now you are doing a fair treatment for minority and majority class.", 'start': 861.883, 'duration': 9.004}, {'end': 877.509, 'text': "Now let's look at second method which is oversampling.", 'start': 874.288, 'duration': 3.221}, {'end': 886.767, 'text': 'So again I am going to print class count 0 and 1.', 'start': 878.61, 'duration': 8.157}, {'end': 890.129, 'text': 'So 0 has more samples, 1 has less samples.', 'start': 886.767, 'duration': 3.362}, {'end': 894.412, 'text': "So this one data frame that I have, I'm going to oversample.", 'start': 890.169, 'duration': 4.243}], 'summary': 'Precision and recall improved in imbalance classifier, with precision increasing from 0.63 to 0.71, and f1 score improving from 0.53 to 0.71, while oversampling addressed class imbalance by increasing the minority class samples.', 'duration': 79.731, 'max_score': 814.681, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo814681.jpg'}, {'end': 1002.503, 'src': 'embed', 'start': 930.873, 'weight': 4, 'content': [{'end': 936.636, 'text': 'it picked up random samples and copied and somehow created this 2000 samples.', 'start': 930.873, 'duration': 5.763}, {'end': 942.738, 'text': 'So here what we want is class 0.', 'start': 937.376, 'duration': 5.362}, {'end': 946.06, 'text': 'That way I can have 5163 samples in my class 1 as well.', 'start': 942.738, 'duration': 3.322}, {'end': 958.551, 'text': 'and that i am going to store in a variable called df class 1.', 'start': 950.146, 'duration': 8.405}, {'end': 963.253, 'text': 'over means over sampling.', 'start': 958.551, 'duration': 4.702}, {'end': 969.617, 'text': 'okay, and let me just quickly print the shape.', 'start': 963.253, 'duration': 6.364}, {'end': 978.301, 'text': 'you see, now i have this data frame and i have another data frame called df class 0.', 'start': 969.617, 'duration': 8.684}, {'end': 981.827, 'text': 'okay, these two data frame.', 'start': 978.301, 'duration': 3.526}, {'end': 984.669, 'text': 'i have these two data frame.', 'start': 981.827, 'duration': 2.842}, {'end': 987.992, 'text': 'i want to join them and create a one data frame.', 'start': 984.669, 'duration': 3.323}, {'end': 989.633, 'text': 'what is the function for that?', 'start': 987.992, 'duration': 1.641}, {'end': 992.856, 'text': 'well, pd dot, concat.', 'start': 989.633, 'duration': 3.223}, {'end': 998.3, 'text': "if you've seen my pandas tutorial playlist, you will get an idea on what this is.", 'start': 992.856, 'duration': 5.444}, {'end': 1002.503, 'text': 'so it is just concatenating two data frames and creating a new data frame.', 'start': 998.3, 'duration': 4.203}], 'summary': 'Performed oversampling to create 2000 samples, with 5163 in class 1 and used pd.concat to join data frames.', 'duration': 71.63, 'max_score': 930.873, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo930873.jpg'}, {'end': 1085.085, 'src': 'embed', 'start': 1047.701, 'weight': 5, 'content': [{'end': 1049.142, 'text': 'Your Y is John.', 'start': 1047.701, 'duration': 1.441}, {'end': 1053.025, 'text': 'And then again, you do train the split.', 'start': 1049.983, 'duration': 3.042}, {'end': 1054.566, 'text': 'Same as this.', 'start': 1053.626, 'duration': 0.94}, {'end': 1056.868, 'text': "You know, I'm just doing copy paste.", 'start': 1054.606, 'duration': 2.262}, {'end': 1060.031, 'text': 'Copy paste is your best friend.', 'start': 1057.809, 'duration': 2.222}, {'end': 1065.312, 'text': 'the best programmer knows how to do copy paste.', 'start': 1061.67, 'duration': 3.642}, {'end': 1079.402, 'text': "okay, so now, when I specify stratify is equal to y, I'm making sure in my train and test the class distribution is equal.", 'start': 1065.312, 'duration': 14.09}, {'end': 1085.085, 'text': 'so y train value count is this if you look at y test value count, that is also.', 'start': 1079.402, 'duration': 5.683}], 'summary': 'Using stratify=y ensures equal class distribution in train and test data.', 'duration': 37.384, 'max_score': 1047.701, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1047701.jpg'}, {'end': 1133.782, 'src': 'embed', 'start': 1108.888, 'weight': 8, 'content': [{'end': 1118.211, 'text': "but when I'm trying simple different techniques for handling imbalanced data, I am supplying different values of X train, Y train and so on.", 'start': 1108.888, 'duration': 9.323}, {'end': 1121.572, 'text': "that's why I wrapped all this code into one function.", 'start': 1118.211, 'duration': 3.361}, {'end': 1125.295, 'text': 'So now see, I am just calling one method and it kind of works.', 'start': 1122.152, 'duration': 3.143}, {'end': 1128.597, 'text': 'Alright It is training.', 'start': 1125.695, 'duration': 2.902}, {'end': 1133.782, 'text': 'It will take some time based on what kind of hardware you have.', 'start': 1130.859, 'duration': 2.923}], 'summary': 'Testing different techniques for handling imbalanced data, wrapping code into one function, and training process duration based on hardware.', 'duration': 24.894, 'max_score': 1108.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1108888.jpg'}], 'start': 814.681, 'title': 'Improving imbalanced data handling', 'summary': 'Discusses the improvement in precision and recall using an imbalance classifier and oversampling, resulting in a precision increase from 0.63 to 0.71 and a minority class sample count of 5163. it also covers techniques in python, including data frame concatenation, oversampling, stratified train-test split, and creating a tensorflow model, leading to a balanced class distribution for training and testing.', 'chapters': [{'end': 978.301, 'start': 814.681, 'title': 'Improving precision and recall with imbalance classifier and oversampling', 'summary': 'Discusses the improvement in precision and recall by comparing an imbalance classifier with an oversampling method, where precision improved from 0.63 to 0.71 and the minority class was fairly treated through oversampling, resulting in a class 1 sample count of 5163.', 'duration': 163.62, 'highlights': ['Precision improved from 0.63 to 0.71, indicating a significant enhancement in correctly predicted positive cases.', 'Oversampling resulted in a class 1 sample count of 5163, demonstrating the effectiveness of the method in addressing class imbalance.', 'The minority class was fairly treated through oversampling, ensuring a more balanced representation of classes in the dataset.', 'The imbalance classifier had a recall of 72 for class 0, showcasing a positive impact on correctly identified negative cases.', 'The f1 score improved from 0.53 to an undisclosed value, indicating an overall enhancement in model performance.']}, {'end': 1133.782, 'start': 978.301, 'title': 'Handling imbalanced data in python', 'summary': 'Covers the process of handling imbalanced data in python using techniques such as concatenating data frames, oversampling, stratified train test split, and creating a tensorflow model, resulting in a balanced class distribution for training and testing.', 'duration': 155.481, 'highlights': ['The function pd.concat is used to join two data frames into one, resulting in a new data frame with a shape of 5163x2.', 'Oversampling is performed to ensure that both classes (1 and 0) have 5163 samples, creating a balanced data frame.', 'Stratified train test split is utilized to ensure equal class distribution in the training and testing data sets, as evidenced by the uniform value counts of y train and y test.', 'A function is created to handle imbalanced data and train a tensorflow model, allowing for easy testing of different techniques and data sets.']}], 'duration': 319.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo814681.jpg', 'highlights': ['Precision improved from 0.63 to 0.71, indicating a significant enhancement in correctly predicted positive cases.', 'Oversampling resulted in a class 1 sample count of 5163, demonstrating the effectiveness of the method in addressing class imbalance.', 'The imbalance classifier had a recall of 72 for class 0, showcasing a positive impact on correctly identified negative cases.', 'The minority class was fairly treated through oversampling, ensuring a more balanced representation of classes in the dataset.', 'The function pd.concat is used to join two data frames into one, resulting in a new data frame with a shape of 5163x2.', 'Stratified train test split is utilized to ensure equal class distribution in the training and testing data sets, as evidenced by the uniform value counts of y train and y test.', 'The f1 score improved from 0.53 to an undisclosed value, indicating an overall enhancement in model performance.', 'Oversampling is performed to ensure that both classes (1 and 0) have 5163 samples, creating a balanced data frame.', 'A function is created to handle imbalanced data and train a tensorflow model, allowing for easy testing of different techniques and data sets.']}, {'end': 1854.972, 'segs': [{'end': 1162.769, 'src': 'embed', 'start': 1134.222, 'weight': 0, 'content': [{'end': 1137.385, 'text': "But eventually, don't forget the score bar by the way.", 'start': 1134.222, 'duration': 3.163}, {'end': 1143.329, 'text': "Because some people will comment, oh I don't see the classification score.", 'start': 1138.125, 'duration': 5.204}, {'end': 1145.271, 'text': 'Well dig deeper dude.', 'start': 1143.97, 'duration': 1.301}, {'end': 1151.943, 'text': 'Your F1 score for class 1 is improved to 79%.', 'start': 1146.332, 'duration': 5.611}, {'end': 1155.245, 'text': 'Remember what it was in our original class? It was 0.53.', 'start': 1151.943, 'duration': 3.302}, {'end': 1155.945, 'text': 'See, 0.53 to 0.79.', 'start': 1155.245, 'duration': 0.7}, {'end': 1158.567, 'text': 'The F1 score for 0th class reduced from 86 to 76.', 'start': 1155.945, 'duration': 2.622}, {'end': 1159.608, 'text': "But again, it's fine.", 'start': 1158.567, 'duration': 1.041}, {'end': 1162.769, 'text': 'Because now you are giving a fair treatment to both of the classes.', 'start': 1159.628, 'duration': 3.141}], 'summary': 'F1 score for class 1 improved to 79% from 0.53, while 0th class reduced from 86 to 76.', 'duration': 28.547, 'max_score': 1134.222, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1134222.jpg'}, {'end': 1223.103, 'src': 'embed', 'start': 1197.493, 'weight': 2, 'content': [{'end': 1206.247, 'text': 'In this method when you do sample dot sample, this is just blindly copying your current samples and creating new samples.', 'start': 1197.493, 'duration': 8.754}, {'end': 1207.128, 'text': "so it's just a copy.", 'start': 1206.247, 'duration': 0.881}, {'end': 1209.07, 'text': "it's not perfect method.", 'start': 1207.128, 'duration': 1.942}, {'end': 1218.179, 'text': 'smode is little bit better because you are creating new samples out of your current samples and it uses k nearest neighbor algorithm for that.', 'start': 1209.07, 'duration': 9.109}, {'end': 1221.182, 'text': "I'm not going to go into detailed math.", 'start': 1218.94, 'duration': 2.242}, {'end': 1223.103, 'text': 'You can just Google about SMOTE.', 'start': 1221.542, 'duration': 1.561}], 'summary': 'Smote creates new samples using k-nearest neighbor algorithm, better than blindly copying samples.', 'duration': 25.61, 'max_score': 1197.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1197493.jpg'}, {'end': 1474.502, 'src': 'embed', 'start': 1434.296, 'weight': 1, 'content': [{'end': 1440.845, 'text': "yeah, when you don't specify anything, the batch size is 32..", 'start': 1434.296, 'duration': 6.549}, {'end': 1452.055, 'text': "okay, all right, my training is over and let's see the score.", 'start': 1440.845, 'duration': 11.21}, {'end': 1455.238, 'text': 'oh see, 81 everywhere.', 'start': 1452.055, 'duration': 3.183}, {'end': 1457.3, 'text': 'it is pretty good now.', 'start': 1455.238, 'duration': 2.062}, {'end': 1462.625, 'text': 'so from 53 percent, my f1 score improved to be 81 percent.', 'start': 1457.3, 'duration': 5.325}, {'end': 1463.266, 'text': 'hooray party.', 'start': 1462.625, 'duration': 0.641}, {'end': 1471.021, 'text': 'The fourth method is using ensemble with undersampling.', 'start': 1467.119, 'duration': 3.902}, {'end': 1474.502, 'text': 'So again, I am taking my original data frame, which is df2.', 'start': 1471.681, 'duration': 2.821}], 'summary': 'Improved f1 score from 53% to 81% using ensemble method with undersampling.', 'duration': 40.206, 'max_score': 1434.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1434296.jpg'}, {'end': 1854.972, 'src': 'embed', 'start': 1823.11, 'weight': 3, 'content': [{'end': 1827.513, 'text': 'So this is a majority class, minority class, 0 to 1495.', 'start': 1823.11, 'duration': 4.403}, {'end': 1835.218, 'text': 'Why 1495? Because you have total, this samples, the rough ratio is 1 to 3.', 'start': 1827.513, 'duration': 7.705}, {'end': 1840.041, 'text': "So I'm doing 1495, 1495 and whatever is remaining will be your third batch.", 'start': 1835.218, 'duration': 4.823}, {'end': 1850.188, 'text': "Okay, so once you do that, let's do.", 'start': 1842.623, 'duration': 7.565}, {'end': 1854.972, 'text': 'C2990 So this is working fine.', 'start': 1852.19, 'duration': 2.782}], 'summary': 'Data is split into majority class, minority class, 1:3 ratio, with 1495 samples each. c2990 is working fine.', 'duration': 31.862, 'max_score': 1823.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1823110.jpg'}], 'start': 1134.222, 'title': 'Improving classification scores and implementing ensemble', 'summary': 'Discusses improvements in classification scores through sampling methods, highlighting the use of smote for class balancing and achieving an f1 score improvement from 0.53 to 0.79 for class 1, while also implementing ensemble with undersampling to address class imbalance and achieving an improved f1 score from 53% to 81%.', 'chapters': [{'end': 1399.373, 'start': 1134.222, 'title': 'Improving classification scores with sampling methods', 'summary': 'Discusses improving classification scores by implementing sampling methods, where f1 score for class 1 improved from 0.53 to 0.79, f1 score for 0th class reduced from 86 to 76, and overall accuracy remains at 78%. the method of smote is highlighted as a better approach for balancing classes in the dataset, resulting in an equal number of samples for both classes.', 'duration': 265.151, 'highlights': ['The F1 score for class 1 improved from 0.53 to 0.79, demonstrating a significant enhancement in performance.', 'The method of SMOTE is presented as a better approach for balancing classes in the dataset, resulting in an equal number of samples for both classes, ensuring a fair treatment to both classes.', 'The F1 score for the 0th class reduced from 86 to 76, indicating a decrease in performance for this specific class.', "The overall accuracy remained consistent at 78%, highlighting the stability of the model's performance after the sampling method implementation."]}, {'end': 1854.972, 'start': 1399.554, 'title': 'Implementing ensemble with undersampling', 'summary': 'Discusses implementing ensemble with undersampling to address class imbalance, dividing the majority class into three batches of 1495 samples each and combining them with the minority class to create a new data frame, resulting in improved f1 score from 53% to 81%.', 'duration': 455.418, 'highlights': ['The F1 score improved from 53% to 81% after implementing ensemble with undersampling, addressing class imbalance.', 'The majority class was divided into three batches, each containing 1495 samples, to rectify the class imbalance.', 'The new data frame, created by combining the three batches of the majority class with the minority class, resulted in an improved F1 score.', 'A function was utilized to handle the creation of three batches from the majority class, simplifying the process of addressing class imbalance.']}], 'duration': 720.75, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1134222.jpg', 'highlights': ['The F1 score for class 1 improved from 0.53 to 0.79, demonstrating a significant enhancement in performance.', 'The F1 score improved from 53% to 81% after implementing ensemble with undersampling, addressing class imbalance.', 'The method of SMOTE is presented as a better approach for balancing classes in the dataset, resulting in an equal number of samples for both classes, ensuring a fair treatment to both classes.', 'The majority class was divided into three batches, each containing 1495 samples, to rectify the class imbalance.']}, {'end': 2073.318, 'segs': [{'end': 1917.354, 'src': 'embed', 'start': 1887.517, 'weight': 0, 'content': [{'end': 1889.938, 'text': 'We are creating three models and taking the average.', 'start': 1887.517, 'duration': 2.421}, {'end': 1894.78, 'text': 'So the second model will be 149 to 2990 and third will be 2990 to remaining.', 'start': 1890.859, 'duration': 3.921}, {'end': 1904.188, 'text': 'Okay, so all three of my models are trained.', 'start': 1901.047, 'duration': 3.141}, {'end': 1907.51, 'text': 'My third model was 2900 to 1430.', 'start': 1904.269, 'duration': 3.241}, {'end': 1910.791, 'text': 'I have three individual model with three prediction.', 'start': 1907.51, 'duration': 3.281}, {'end': 1913.613, 'text': "Okay, don't look at their individual F1 score yet.", 'start': 1911.092, 'duration': 2.521}, {'end': 1917.354, 'text': 'I have Y prediction, Y prediction, one, two, three.', 'start': 1914.433, 'duration': 2.921}], 'summary': 'Three models created for predictions, with f1 scores pending.', 'duration': 29.837, 'max_score': 1887.517, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1887517.jpg'}, {'end': 2033.148, 'src': 'embed', 'start': 1950.549, 'weight': 1, 'content': [{'end': 1960.633, 'text': 'If you have this kind of case, and if you just do addition, what do you get? 1.', 'start': 1950.549, 'duration': 10.084}, {'end': 1966.415, 'text': 'When you get 1, it means your majority word is 0.', 'start': 1960.633, 'duration': 5.782}, {'end': 1971.576, 'text': 'If you get 2, majority word is 1.', 'start': 1966.415, 'duration': 5.161}, {'end': 1973.777, 'text': 'If you get 3, then also majority word is 1.', 'start': 1971.576, 'duration': 2.201}, {'end': 1980.051, 'text': 'So what is our logic? Anything greater than 1 means 1.', 'start': 1973.777, 'duration': 6.274}, {'end': 1983.152, 'text': "Okay, I think that's pretty straightforward.", 'start': 1980.051, 'duration': 3.101}, {'end': 1995.237, 'text': "Okay, now let's do the length of, so we got YPRED1, YPRED2, YPRED3.", 'start': 1983.612, 'duration': 11.625}, {'end': 1996.977, 'text': 'These are the lengths.', 'start': 1996.277, 'duration': 0.7}, {'end': 2001.239, 'text': 'And I want to create a final prediction.', 'start': 1998.118, 'duration': 3.121}, {'end': 2003.48, 'text': 'So this final prediction is kind of like a union.', 'start': 2001.259, 'duration': 2.221}, {'end': 2006.721, 'text': "It's like not a union, basically a majority word.", 'start': 2003.5, 'duration': 3.221}, {'end': 2011.465, 'text': 'So, I will just copy from yp1, I will just copy this.', 'start': 2007.641, 'duration': 3.824}, {'end': 2027.419, 'text': 'So, I will create a new numpy array and then I will go through all the samples in yp1 and I will do this.', 'start': 2011.645, 'duration': 15.774}, {'end': 2033.148, 'text': 'So what is this? These are like individual words.', 'start': 2031.127, 'duration': 2.021}], 'summary': 'Determining majority word based on addition: 1=0, 2 or 3=1. creating final prediction by copying from ypred1 and applying majority word logic.', 'duration': 82.599, 'max_score': 1950.549, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1950549.jpg'}], 'start': 1856.093, 'title': 'Neural network model and majority vote for predictions', 'summary': 'Discusses creating three neural network models, training them, and obtaining the predictions, with a focus on the process and the f1 scores. it also covers the process of determining the majority vote for predictions and creating a final prediction based on this logic and individual predictions.', 'chapters': [{'end': 1917.354, 'start': 1856.093, 'title': 'Neural network model and predictions', 'summary': 'Discusses creating three neural network models, training them, and obtaining the predictions, with a focus on the process and the f1 scores.', 'duration': 61.261, 'highlights': ['The process involves creating three neural network models, with the second model covering the range 149 to 2990, and the third model covering 2990 to the remaining data.', 'The predictions from the three models are stored in prediction 1, prediction 2, and prediction 3, with a mention of the F1 scores not being impressive.', 'The chapter emphasizes the training of three individual models and the need to look at their combined predictions, setting the stage for further analysis and improvement.']}, {'end': 2073.318, 'start': 1918.175, 'title': 'Majority vote for predictions', 'summary': 'Discusses the process of determining the majority vote for predictions, using a logic where anything greater than 1 is considered as 1, and then creating a final prediction based on this logic and individual predictions.', 'duration': 155.143, 'highlights': ['The majority vote is determined by the logic that anything greater than 1 is considered as 1, and a final prediction is made based on this logic and the individual predictions.', 'The process involves creating a new numpy array and conducting a majority vote between YPRED1, YPRED2, and YPRED3 to generate the final prediction.', 'The chapter explains the method of determining the majority vote by adding up the individual predictions and applying the logic that anything greater than 1 is considered as 1.']}], 'duration': 217.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo1856093.jpg', 'highlights': ['The process involves creating three neural network models, with the second model covering the range 149 to 2990, and the third model covering 2990 to the remaining data.', 'The majority vote is determined by the logic that anything greater than 1 is considered as 1, and a final prediction is made based on this logic and the individual predictions.', 'The predictions from the three models are stored in prediction 1, prediction 2, and prediction 3, with a mention of the F1 scores not being impressive.', 'The process involves creating a new numpy array and conducting a majority vote between YPRED1, YPRED2, and YPRED3 to generate the final prediction.', 'The chapter emphasizes the training of three individual models and the need to look at their combined predictions, setting the stage for further analysis and improvement.']}, {'end': 2304.732, 'segs': [{'end': 2163.504, 'src': 'embed', 'start': 2073.318, 'weight': 0, 'content': [{'end': 2078.806, 'text': 'okay, now i can print my classification report.', 'start': 2073.318, 'duration': 5.488}, {'end': 2088.379, 'text': 'you know, with classification report you have to call print and then it does pretty formatting.', 'start': 2083.475, 'duration': 4.904}, {'end': 2091.081, 'text': 'so you still see the score is improved.', 'start': 2088.379, 'duration': 2.702}, {'end': 2093.803, 'text': 'it did not improve that much.', 'start': 2091.081, 'duration': 2.722}, {'end': 2097.887, 'text': 'it in from 53 is 60%.', 'start': 2093.803, 'duration': 4.084}, {'end': 2102.381, 'text': "so when you're trying all these techniques, It's more.", 'start': 2097.887, 'duration': 4.494}, {'end': 2106.864, 'text': 'see, machine learning is more like art and just trying things out.', 'start': 2102.381, 'duration': 4.483}, {'end': 2110.406, 'text': 'There is no sure sort that OK you try ensemble.', 'start': 2107.544, 'duration': 2.862}, {'end': 2113.128, 'text': 'You will surely get high prediction.', 'start': 2111.206, 'duration': 1.922}, {'end': 2120.692, 'text': "There is no guarantee you have to try different methods, so we try different methods and I think it's more to work the best.", 'start': 2114.588, 'duration': 6.104}, {'end': 2124.254, 'text': "The ensemble did not work best and that's OK.", 'start': 2121.752, 'duration': 2.502}, {'end': 2133.131, 'text': "I also tried focal loss and I don't have a code here, but I tried it and it did not work.", 'start': 2126.605, 'duration': 6.526}, {'end': 2135.814, 'text': 'Actually, it reduced the F1 score.', 'start': 2133.211, 'duration': 2.603}, {'end': 2139.417, 'text': "I don't know why, but based on the different scenarios,", 'start': 2135.854, 'duration': 3.563}, {'end': 2147.104, 'text': 'you can try all the five techniques which I discussed in the presentation and see whatever works best for you.', 'start': 2139.417, 'duration': 7.687}, {'end': 2153.015, 'text': 'Now comes the most interesting part of this tutorial which is an exercise.', 'start': 2148.812, 'duration': 4.203}, {'end': 2157.619, 'text': 'You have to do exercise otherwise you will not be able to learn.', 'start': 2153.416, 'duration': 4.203}, {'end': 2163.504, 'text': 'Simple In this exercise you will use the notebook which I showed in this video.', 'start': 2157.939, 'duration': 5.565}], 'summary': 'Classification report shows improved score, from 53% to 60%. various techniques tried, ensemble and focal loss did not work well.', 'duration': 90.186, 'max_score': 2073.318, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo2073318.jpg'}, {'end': 2248.952, 'src': 'embed', 'start': 2220.15, 'weight': 1, 'content': [{'end': 2226.275, 'text': 'click on that link only after you have tried this on your own.', 'start': 2220.15, 'duration': 6.125}, {'end': 2231.68, 'text': 'the second exercise is to use bank customer churn prediction data set.', 'start': 2226.275, 'duration': 5.405}, {'end': 2235.243, 'text': 'this has 90 to 90 percent to 10 percent imbalance.', 'start': 2231.68, 'duration': 3.563}, {'end': 2240.887, 'text': 'first build a deep learning model and see how your f1 score looks.', 'start': 2236.124, 'duration': 4.763}, {'end': 2248.952, 'text': 'then do the analysis of your classification report and then improve that using the same again all these different techniques.', 'start': 2240.887, 'duration': 8.065}], 'summary': 'Use bank customer churn prediction data with 90-10% imbalance to build and improve a deep learning model.', 'duration': 28.802, 'max_score': 2220.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo2220150.jpg'}, {'end': 2304.372, 'src': 'embed', 'start': 2275.881, 'weight': 6, 'content': [{'end': 2278.342, 'text': 'if you do, please give it a thumbs up.', 'start': 2275.881, 'duration': 2.461}, {'end': 2281.222, 'text': "i'm putting a lot of afford in creating this series.", 'start': 2278.342, 'duration': 2.88}, {'end': 2282.943, 'text': 'see working late at night right now.', 'start': 2281.222, 'duration': 1.721}, {'end': 2294.027, 'text': 'So if you can share this content with other people through whatsapp, Facebook, Whatever the medium, It will help so many people,', 'start': 2284.063, 'duration': 9.964}, {'end': 2300.49, 'text': "because I'm putting all my knowledge, all my experience into this and and these Videos.", 'start': 2294.027, 'duration': 6.463}, {'end': 2302.711, 'text': 'I hope they are helping you.', 'start': 2300.49, 'duration': 2.221}, {'end': 2304.372, 'text': 'Alright, I will see you in next video.', 'start': 2302.711, 'duration': 1.661}], 'summary': 'Creating series to help people, asking for support and sharing, putting in a lot of effort.', 'duration': 28.491, 'max_score': 2275.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo2275881.jpg'}], 'start': 2073.318, 'title': 'Improving model accuracy and handling imbalanced data', 'summary': 'Discusses experimenting with different techniques to improve model accuracy, resulting in a classification score increase from 53% to 60%. it also introduces an exercise on handling imbalanced data using multiple techniques and a secondary exercise on bank customer churn prediction data set for deep learning model building and improvement.', 'chapters': [{'end': 2135.814, 'start': 2073.318, 'title': 'Improving model accuracy with various techniques', 'summary': 'Discusses experimenting with different techniques to improve model accuracy, with the classification score increasing from 53% to 60%, emphasizing the trial-and-error nature of machine learning.', 'duration': 62.496, 'highlights': ['The classification score improved from 53% to 60% after trying various techniques.', 'Machine learning is an iterative and experimental process, requiring the trial of different methods for optimal results.', 'Focal loss technique was attempted but resulted in a reduction in the F1 score.']}, {'end': 2304.732, 'start': 2135.854, 'title': 'Imbalanced data techniques exercise', 'summary': 'Introduces an exercise on handling imbalanced data using five different techniques, including neural network, logistic regression, decision tree, and support vector machine, aiming to improve f1 score, and a secondary exercise on bank customer churn prediction data set for deep learning model building and improvement.', 'duration': 168.878, 'highlights': ['The exercise involves handling imbalanced data using five different techniques, including neural network, logistic regression, decision tree, and support vector machine, to improve F1 score.', 'The secondary exercise focuses on building a deep learning model for bank customer churn prediction data set and improving the F1 score using various techniques.', 'The chapter emphasizes the importance of practicing the exercises independently before referring to the solution link to understand and improve skills.', 'The speaker encourages viewers to share the content to help others, highlighting the effort and knowledge invested in creating the tutorial series.', 'The speaker requests viewers to give feedback and share the content, indicating the dedication and late-night efforts put into creating the tutorial series.']}], 'duration': 231.414, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/JnlM4yLFNuo/pics/JnlM4yLFNuo2073318.jpg', 'highlights': ['The classification score improved from 53% to 60% after trying various techniques.', 'The exercise involves handling imbalanced data using five different techniques to improve F1 score.', 'The secondary exercise focuses on building a deep learning model for bank customer churn prediction data set.', 'Machine learning is an iterative and experimental process, requiring the trial of different methods for optimal results.', 'The chapter emphasizes the importance of practicing the exercises independently before referring to the solution link.', 'Focal loss technique was attempted but resulted in a reduction in the F1 score.', 'The speaker encourages viewers to share the content to help others, highlighting the effort and knowledge invested in creating the tutorial series.', 'The speaker requests viewers to give feedback and share the content, indicating the dedication and late-night efforts put into creating the tutorial series.']}], 'highlights': ['Precision increased from 0.63 to 0.71, indicating significant enhancement', 'F1 score improved from 53% to 81% after implementing ensemble with undersampling', 'SMOTE presented as a better approach for balancing classes, ensuring fair treatment', 'Oversampling resulted in a class 1 sample count of 5163, demonstrating effectiveness', 'Training sets for fraud transactions often have 10,000 good transactions and only one fraudulent transaction, creating an imbalance', 'Ensemble Methods: Creating multiple models by dividing majority class transactions into batches and combining with minority class transactions', 'The imbalance in fraud detection data causes models to have high accuracy despite performing poorly in identifying fraud', 'The need to address imbalance in the target variable is emphasized, as highlighted by comments from viewers', 'The chapter discusses examples of imbalanced classes, including customer churn prediction, device failures, and cancer prediction', 'The imbalance scenario is illustrated with a fraud detection example where 99,000 transactions are not fraud and 1,000 transactions are fraud']}