title
Implementing a Spam classifier in python| Natural Language Processing
description
Here is the detailed explanation of implementing a Spam classifier in python using Natural Language Processing.
Github link: https://github.com/krishnaik06/SpamClassifier
NLP playlist: https://www.youtube.com/watch?v=6ZVf1jnEKGI&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm
You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
Packt url : https://prod.packtpub.com/in/big-data-and-business-intelligence/hands-python-finance
Amazon url: https://www.amazon.com/Hands-Python-Finance-implementing-strategies-ebook/dp/B07Q5W7GB1/ref=sr_1_1?keywords=Krish+naik&qid=1554285070&s=gateway&sr=8-1-spell
Please subscribe and share with all your friends
detail
{'title': 'Implementing a Spam classifier in python| Natural Language Processing', 'heatmap': [{'end': 125.703, 'start': 86.499, 'weight': 1}, {'end': 958.407, 'start': 918.954, 'weight': 0.785}, {'end': 1166.17, 'start': 1147.221, 'weight': 0.753}, {'end': 1290.227, 'start': 1228.347, 'weight': 0.774}], 'summary': 'Covers implementing a spam classifier using nlp techniques like stop keyword removal, lemmatization, stemming, count vectorizer, and tf-idf models, and building a machine learning project to create a spam classifier for email and text messages, achieving around 98% accuracy with potential improvements using lemmatization and tf-idf model.', 'chapters': [{'end': 74.524, 'segs': [{'end': 74.524, 'src': 'embed', 'start': 17.202, 'weight': 0, 'content': [{'end': 24.368, 'text': 'we saw something called as bag of words how we can implement bag of words by using count vectorizer, and then we also saw tf-idf models.', 'start': 17.202, 'duration': 7.166}, {'end': 29.054, 'text': 'Now we will try to implement a project and the project is basically called ads.', 'start': 25.129, 'duration': 3.925}, {'end': 34.16, 'text': "Basically, we'll try to create a model which will be like a spam classifier.", 'start': 30.576, 'duration': 3.584}, {'end': 42.53, 'text': 'This kind of spam classifier will actually classify your spam messages that you usually get in your email.', 'start': 34.881, 'duration': 7.649}, {'end': 48.707, 'text': "or suppose, if you're getting a text message, spam classifier can actually detect them.", 'start': 42.53, 'duration': 6.177}, {'end': 50.228, 'text': "so we'll try to see that.", 'start': 48.707, 'duration': 1.521}, {'end': 54.811, 'text': 'and today this is a whole machine learning problem statement.', 'start': 50.228, 'duration': 4.583}, {'end': 62.336, 'text': "as of now and i'll be discussing about this machine learning problem statement we will be starting from reading the data set.", 'start': 54.811, 'duration': 7.525}, {'end': 70.021, 'text': "i'll just explain you of the data set, from where i have taken this particular data set, and then we'll follow all the processes of stopkeepers,", 'start': 62.336, 'duration': 7.685}, {'end': 74.524, 'text': 'lemmatization stemming, and then try to apply a machine learning model.', 'start': 70.021, 'duration': 4.503}], 'summary': 'Implementing bag of words, tf-idf models for spam classification in machine learning', 'duration': 57.322, 'max_score': 17.202, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC017202.jpg'}], 'start': 0.868, 'title': 'Implementing nlp spam classifier', 'summary': 'Discusses implementing a spam classifier using nlp techniques like stop keyword removal, lemmatization, stemming, and the use of count vectorizer and tf-idf models, followed by a machine learning project to create a spam classifier for email and text messages.', 'chapters': [{'end': 74.524, 'start': 0.868, 'title': 'Nlp spam classifier project', 'summary': 'Discusses implementing a spam classifier using nlp techniques like stop keyword removal, lemmatization, stemming, and the use of count vectorizer and tf-idf models, followed by a machine learning project to create a spam classifier for email and text messages.', 'duration': 73.656, 'highlights': ['The chapter will focus on implementing a spam classifier using NLP techniques like stop keyword removal, lemmatization, stemming, count vectorizer, and tf-idf models. The discussion involves implementing NLP techniques such as stop keyword removal, lemmatization, stemming, count vectorizer, and tf-idf models for the spam classifier.', 'A project to create a spam classifier for email and text messages will be initiated. The project aims to create a model that can classify spam messages in emails and text messages, serving as a spam classifier.', 'The chapter will cover the entire process from reading the dataset to applying a machine learning model for the spam classifier. The discussion will encompass the complete process of reading the dataset, implementing NLP techniques, and applying a machine learning model for the spam classifier.']}], 'duration': 73.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0868.jpg', 'highlights': ['The chapter will cover the entire process from reading the dataset to applying a machine learning model for the spam classifier.', 'A project to create a spam classifier for email and text messages will be initiated.', 'The chapter will focus on implementing a spam classifier using NLP techniques like stop keyword removal, lemmatization, stemming, count vectorizer, and tf-idf models.']}, {'end': 321.402, 'segs': [{'end': 125.703, 'src': 'heatmap', 'start': 75.154, 'weight': 2, 'content': [{'end': 80.296, 'text': 'then try to find out whether this, whether this message, is a spam or not.', 'start': 75.154, 'duration': 5.142}, {'end': 86.499, 'text': 'so basically, in this particular, this particular data set i have i have taken from uci website.', 'start': 80.296, 'duration': 6.203}, {'end': 95.942, 'text': 'okay, there is a machine learning repository over here and you just search in google post sms spam collection data set and you will be getting the complete data set as it is.', 'start': 86.499, 'duration': 9.443}, {'end': 97.923, 'text': 'so you can download this particular data set.', 'start': 95.942, 'duration': 1.981}, {'end': 98.983, 'text': 'as soon as you download,', 'start': 97.923, 'duration': 1.06}, {'end': 108.678, 'text': "you'll be getting two folders and your two folders will be looking something like this spam collection and there will be something called as SMS spam collection.", 'start': 98.983, 'duration': 9.695}, {'end': 112.099, 'text': 'okay, now, if you just go, this one is the readme file.', 'start': 108.678, 'duration': 3.421}, {'end': 116.001, 'text': 'readme file has some information about this spam collection?', 'start': 112.099, 'duration': 3.902}, {'end': 125.703, 'text': 'okay, and when you try to open this particular, you know this particular file and if you just do edit, you can see that here is your whole data.', 'start': 116.001, 'duration': 9.702}], 'summary': 'Analyzing spam message data set from uci repository for machine learning.', 'duration': 40.847, 'max_score': 75.154, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC075154.jpg'}, {'end': 204.327, 'src': 'embed', 'start': 179.733, 'weight': 0, 'content': [{'end': 187.493, 'text': 'instead it is tab separated, the reason why i am telling you tabs separated, because When you try to read this particular data set, okay,', 'start': 179.733, 'duration': 7.76}, {'end': 191.757, 'text': 'by using read underscore CSV file, will try to put a delimiter.', 'start': 187.493, 'duration': 4.264}, {'end': 192.958, 'text': 'That is slash T.', 'start': 191.757, 'duration': 1.201}, {'end': 194.159, 'text': 'slash T indicates that.', 'start': 192.958, 'duration': 1.201}, {'end': 198.182, 'text': 'now Let us just continue and try to see how we can read this particular data set.', 'start': 194.159, 'duration': 4.023}, {'end': 204.327, 'text': 'now, You know that I have a folder called as SMS spam collection and this is SMS spam collection file.', 'start': 198.182, 'duration': 6.145}], 'summary': 'The transcript discusses reading tab-separated data using read_csv with a delimiter slash t.', 'duration': 24.594, 'max_score': 179.733, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0179733.jpg'}, {'end': 336.304, 'src': 'embed', 'start': 304.186, 'weight': 1, 'content': [{'end': 305.607, 'text': "I'll just go and see my data.", 'start': 304.186, 'duration': 1.421}, {'end': 307.508, 'text': 'It is something like label and message.', 'start': 305.887, 'duration': 1.621}, {'end': 310.77, 'text': 'Now you can see that my data frame is basically having two columns.', 'start': 307.548, 'duration': 3.222}, {'end': 312.25, 'text': 'One is label and one is message.', 'start': 310.89, 'duration': 1.36}, {'end': 317.013, 'text': 'In the label, I have ham, spam, okay? And in the messages, I have normal messages.', 'start': 312.571, 'duration': 4.442}, {'end': 321.402, 'text': 'okay, and this particular label indicates whether this message is the ham or a spam.', 'start': 317.461, 'duration': 3.941}, {'end': 330.663, 'text': 'okay now, the next thing is that, as i told you, as we discussed in the previous nlp playlist, right, if you have not seen the playlist, guys,', 'start': 321.402, 'duration': 9.261}, {'end': 336.304, 'text': "please go and see those players, because i'm going to use those concepts and i'm going to implement this whole.", 'start': 330.663, 'duration': 5.641}], 'summary': 'Data frame has 2 columns: label (ham/spam) and message, to be used for nlp implementation.', 'duration': 32.118, 'max_score': 304.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0304186.jpg'}], 'start': 75.154, 'title': 'Building a spam detection model and reading tab-separated data using pandas', 'summary': 'Covers building a spam detection model using a dataset from the uci machine learning repository, aiming to predict whether messages are spam or not. additionally, it explains how to read tab-separated files using pandas, creating a data frame with two columns for categorizing sms messages as ham or spam.', 'chapters': [{'end': 161.027, 'start': 75.154, 'title': 'Spam detection model from sms data', 'summary': 'Discusses the process of building a spam detection model using a dataset from the uci machine learning repository, which contains messages labeled as spam or ham, and aims to create a model for predicting whether messages are spam or not.', 'duration': 85.873, 'highlights': ["The dataset is obtained from the UCI machine learning repository by searching for 'SMS spam collection data set' on Google. The source of the dataset is the UCI machine learning repository, which can be accessed by searching for 'SMS spam collection data set' on Google.", "The dataset contains two folders: 'spam collection' and 'SMS spam collection', with the 'readme' file providing information about the dataset. The dataset comprises two folders, 'spam collection' and 'SMS spam collection', with the 'readme' file containing information about the dataset.", "Each message in the dataset is labeled as either 'spam' or 'ham', with 'ham' indicating that the message is not a spam. The dataset labels each message as either 'spam' or 'ham', with 'ham' denoting that the message is not a spam.", "The goal is to create a model to predict whether messages are classified as 'ham' or 'spam' based on the dataset. The objective is to develop a model for predicting whether messages are categorized as 'ham' or 'spam' using the dataset."]}, {'end': 321.402, 'start': 161.027, 'title': 'Reading tab-separated data using pandas', 'summary': 'Discusses how to read a tab-separated file using pandas, specifying the delimiter as slash t, and creating a data frame with two columns - label and message - for categorizing sms messages as ham or spam.', 'duration': 160.375, 'highlights': ['The chapter demonstrates how to use pandas to read a tab-separated file and specify the delimiter as slash t. By utilizing pandas to read the tab-separated file with the specified delimiter as slash t, the chapter illustrates an efficient method for data processing.', 'The process involves creating a data frame with two columns - label and message, where label denotes ham or spam, and message contains the SMS content. The chapter outlines the creation of a data frame with two columns, label and message, to categorize SMS messages as ham or spam, facilitating efficient data organization and analysis.', "The data set comprises two columns - label and message - with 'ham' and 'spam' labels for categorizing messages. The data set contains two columns, label and message, with 'ham' and 'spam' labels for categorizing messages, providing structured data for analysis."]}], 'duration': 246.248, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC075154.jpg', 'highlights': ['The chapter demonstrates how to use pandas to read a tab-separated file and specify the delimiter as slash t.', 'The process involves creating a data frame with two columns - label and message, where label denotes ham or spam, and message contains the SMS content.', "The dataset contains two folders: 'spam collection' and 'SMS spam collection', with the 'readme' file providing information about the dataset.", "The dataset is obtained from the UCI machine learning repository by searching for 'SMS spam collection data set' on Google."]}, {'end': 761.404, 'segs': [{'end': 369.369, 'src': 'embed', 'start': 340.885, 'weight': 0, 'content': [{'end': 344.826, 'text': "sorry, i'm going to use stemming i'm going to use, i'm going to create a bag of words.", 'start': 340.885, 'duration': 3.941}, {'end': 347.649, 'text': 'so all those concepts have been discussed in the playlist.', 'start': 345.186, 'duration': 2.463}, {'end': 352.454, 'text': 'if you have not gone through that, first of all, go through that, then you come and try to understand this particular use case,', 'start': 347.649, 'duration': 4.805}, {'end': 354.977, 'text': "how i've integrated all these concepts.", 'start': 352.454, 'duration': 2.523}, {'end': 357.76, 'text': "now, the next thing is that i'm going to clean.", 'start': 354.977, 'duration': 2.783}, {'end': 360.381, 'text': 'do some data cleaning and data pre-processing.', 'start': 358.399, 'duration': 1.982}, {'end': 369.369, 'text': 'okay now, in this messages, if i go and see my messages column, okay now, in this messages column, we have lot of things like comma, full stop,', 'start': 360.381, 'duration': 8.988}], 'summary': 'Utilizing stemming and bag of words for data cleaning and pre-processing.', 'duration': 28.484, 'max_score': 340.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0340885.jpg'}, {'end': 445.73, 'src': 'embed', 'start': 412.829, 'weight': 1, 'content': [{'end': 415.77, 'text': 'RE is basically a library which is used for regular expression.', 'start': 412.829, 'duration': 2.941}, {'end': 418.691, 'text': "The next library that I'm going to use is NLTK.", 'start': 416.41, 'duration': 2.281}, {'end': 427.354, 'text': "I hope you all know why NLTK is used, because everything, when you're using, stop words, when you're doing lemmatization, when you're doing stemming,", 'start': 419.211, 'duration': 8.143}, {'end': 431.215, 'text': "whenever you're doing bag of words, all the libraries are present inside NLTK.", 'start': 427.354, 'duration': 3.861}, {'end': 433.276, 'text': "So you don't have to worry about that.", 'start': 431.235, 'duration': 2.041}, {'end': 436.187, 'text': 'Sorry, with respect to bag of words.', 'start': 433.646, 'duration': 2.541}, {'end': 445.73, 'text': "we'll be using Skykit Learn, but most of the libraries, like stop words, you know, and some more libraries, like stemming lemmatization,", 'start': 436.187, 'duration': 9.543}], 'summary': 'Re for regular expressions, nltk for text processing with stop words, lemmatization, and stemming, skykit learn for bag of words.', 'duration': 32.901, 'max_score': 412.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0412829.jpg'}, {'end': 718.983, 'src': 'embed', 'start': 689.261, 'weight': 2, 'content': [{'end': 693.602, 'text': 'okay. or if there is a word like going, it will become go, okay.', 'start': 689.261, 'duration': 4.341}, {'end': 696.203, 'text': 'so that will actually be done by the stemming process.', 'start': 693.602, 'duration': 2.601}, {'end': 701.452, 'text': "Then in this review, I'll be getting again the list of words which will be in the base root form.", 'start': 696.809, 'duration': 4.643}, {'end': 705.995, 'text': 'And remember all the stop words will be removed from that whole list of words.', 'start': 701.952, 'duration': 4.043}, {'end': 707.976, 'text': "And then I'm going to join that words.", 'start': 706.395, 'duration': 1.581}, {'end': 710.718, 'text': "Okay I'm going to join that words.", 'start': 708.676, 'duration': 2.042}, {'end': 714.38, 'text': "Sorry I'm just going to join that all the list of words into a sentence.", 'start': 711.098, 'duration': 3.282}, {'end': 718.983, 'text': "And then I'm appending that in my new list that I created that is called as process.", 'start': 714.86, 'duration': 4.123}], 'summary': 'Stemming process reduces words to their base form, removes stop words, and joins remaining words into a sentence.', 'duration': 29.722, 'max_score': 689.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0689261.jpg'}], 'start': 321.402, 'title': 'Nlp and text data pre-processing', 'summary': 'Covers nlp techniques like stop keywords, lemmatization, and bag of words for cleaning and pre-processing in a spam classifier, emphasizing the importance of lowercasing words and removing irrelevant words. it also discusses the use of regular expressions, nltk, and porter stemmer for text data pre-processing, aiming to improve the efficiency of text analysis on a dataset of around 5,572 records.', 'chapters': [{'end': 397.746, 'start': 321.402, 'title': 'Nlp data pre-processing for spam classifier', 'summary': 'Discusses the use of nlp concepts such as stop keywords, lemmatization, stemming, and creating a bag of words for data cleaning and pre-processing in a spam classifier, emphasizing the importance of lowercasing words and removing irrelevant words.', 'duration': 76.344, 'highlights': ["The importance of lowercasing words to remove irrelevant words and enhance the spam classifier's accuracy. Lowercasing words is crucial to remove irrelevant words and enhance the spam classifier's accuracy.", 'Using NLP concepts such as stop keywords, lemmatization, stemming, and creating a bag of words for data cleaning and pre-processing in a spam classifier. Discussing the use of NLP concepts such as stop keywords, lemmatization, stemming, and creating a bag of words for data cleaning and pre-processing in a spam classifier.', 'Emphasizing the need to go through the previous NLP playlist to understand the integrated concepts for the spam classifier. Stressing the importance of going through the previous NLP playlist to understand the integrated concepts for the spam classifier.']}, {'end': 592.996, 'start': 397.746, 'title': 'Text data pre-processing', 'summary': 'Discusses the use of regular expressions and nltk for text data pre-processing, including the removal of unnecessary characters, importing stop words, and initializing a porter stemmer for stemming purposes.', 'duration': 195.25, 'highlights': ['Regular expressions and NLTK are used for text data pre-processing, including the removal of unnecessary characters such as numbers, full stops, commas, exclamation marks, and question marks from messages.', 'NLTK is utilized for various text processing tasks like stop words removal, lemmatization, stemming, and bag of words generation, eliminating the need to worry about other libraries.', "Importing stop words from nltk.corpus to remove common words like 'if', 'off', 'then', and 'to' from the text data, contributing to data cleaning and processing.", 'The use of porter stemmer for stemming purposes, which helps in finding the base root form of words, is highlighted as an essential part of the data pre-processing stage.']}, {'end': 761.404, 'start': 592.996, 'title': 'Text processing and analysis', 'summary': 'Explains the process of text processing and analysis, including steps such as cleaning, lowering, and splitting, with a focus on removing stop words and applying stemming to obtain the base root form of words, aiming to improve the efficiency of text analysis on a dataset of around 5,572 records.', 'duration': 168.408, 'highlights': ['The chapter emphasizes the importance of text processing and analysis, detailing steps such as cleaning, lowering, and splitting to enhance the efficiency of text analysis on a dataset of around 5,572 records. dataset of around 5,572 records', 'The process involves removing unnecessary characters like full stops, commas, and other punctuation marks, as well as applying stemming to obtain the base root form of words, ultimately aiming to improve the quality of text analysis. removing unnecessary characters, applying stemming', 'The chapter highlights the significance of removing stop words from the text to streamline the analysis process, ensuring that the focus remains on meaningful words and reducing the impact of duplicate words. reducing impact of duplicate words']}], 'duration': 440.002, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0321402.jpg', 'highlights': ['Using NLP concepts such as stop keywords, lemmatization, stemming, and creating a bag of words for data cleaning and pre-processing in a spam classifier.', 'Regular expressions and NLTK are used for text data pre-processing, including the removal of unnecessary characters such as numbers, full stops, commas, exclamation marks, and question marks from messages.', 'The process involves removing unnecessary characters like full stops, commas, and other punctuation marks, as well as applying stemming to obtain the base root form of words, ultimately aiming to improve the quality of text analysis.']}, {'end': 1166.17, 'segs': [{'end': 792.057, 'src': 'embed', 'start': 761.784, 'weight': 0, 'content': [{'end': 766.305, 'text': 'Then for each and every word, I have found out whether that word is present in the stop words or not.', 'start': 761.784, 'duration': 4.521}, {'end': 770.446, 'text': 'If it is not present in the stop words, I have actually, you know,', 'start': 766.385, 'duration': 4.061}, {'end': 776.607, 'text': 'moved to the stemming part and I have found out the stemming or the base root form of that particular word.', 'start': 770.446, 'duration': 6.161}, {'end': 781.628, 'text': 'Finally, I have joined all the words and appended it into another list which is called as Paracast.', 'start': 776.967, 'duration': 4.661}, {'end': 783.289, 'text': 'Now it has got executed.', 'start': 782.229, 'duration': 1.06}, {'end': 786.214, 'text': 'Now let us see corpus over here.', 'start': 783.329, 'duration': 2.885}, {'end': 792.057, 'text': 'okay, so this is my corpus and this was my real messages data frame.', 'start': 786.214, 'duration': 5.843}], 'summary': 'Identified stop words, performed stemming, and created paracast list. analyzed corpus and messages data frame.', 'duration': 30.273, 'max_score': 761.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0761784.jpg'}, {'end': 854.91, 'src': 'embed', 'start': 830.283, 'weight': 1, 'content': [{'end': 835.764, 'text': "Now this data after cleaning, I'm going to convert into bag of words and bag of words.", 'start': 830.283, 'duration': 5.481}, {'end': 840.726, 'text': 'If I say bag of words, you should remember that in the previous class we have discussed about bag of words.', 'start': 835.784, 'duration': 4.942}, {'end': 842.646, 'text': 'It is nothing but a document matrix.', 'start': 841.026, 'duration': 1.62}, {'end': 844.687, 'text': 'Document matrix with respect to the word.', 'start': 842.946, 'duration': 1.741}, {'end': 849.208, 'text': 'Again, the explanation is given in the previous videos about how to create a bag of words.', 'start': 845.607, 'duration': 3.601}, {'end': 850.308, 'text': 'Please see that.', 'start': 849.608, 'duration': 0.7}, {'end': 854.91, 'text': "I'll provide the link in the description box about the NLP playlist.", 'start': 851.068, 'duration': 3.842}], 'summary': 'Data is cleaned and converted into bag of words for nlp.', 'duration': 24.627, 'max_score': 830.283, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0830283.jpg'}, {'end': 958.407, 'src': 'heatmap', 'start': 918.954, 'weight': 0.785, 'content': [{'end': 922.255, 'text': 'How many total number of columns I have over here?', 'start': 918.954, 'duration': 3.301}, {'end': 922.716, 'text': 'You see this?', 'start': 922.275, 'duration': 0.441}, {'end': 924.436, 'text': 'How many total number of columns I have?', 'start': 922.936, 'duration': 1.5}, {'end': 926.857, 'text': 'I have somewhere around, you know, 6296 columns.', 'start': 924.776, 'duration': 2.081}, {'end': 935.031, 'text': 'Okay, now, when I have 6296 columns, I told you that I have to take the most frequent element,', 'start': 928.567, 'duration': 6.464}, {'end': 944.738, 'text': 'because there will be some characters where there will be somewhere like words, like some of the names which may be coming just once or twice,', 'start': 935.031, 'duration': 9.707}, {'end': 947.66, 'text': 'right?. And there will be some of the words which will be just.', 'start': 944.738, 'duration': 2.922}, {'end': 950.722, 'text': 'it will not be that frequent when compared to the other words, right?', 'start': 947.66, 'duration': 3.062}, {'end': 954.484, 'text': 'So because of that, we should not take all this column that is 6296..', 'start': 951.022, 'duration': 3.462}, {'end': 958.407, 'text': 'Instead, we should just select some frequent column.', 'start': 954.484, 'duration': 3.923}], 'summary': 'Select frequent columns from 6296 total columns.', 'duration': 39.453, 'max_score': 918.954, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0918954.jpg'}, {'end': 1059.353, 'src': 'embed', 'start': 1035.323, 'weight': 2, 'content': [{'end': 1044.587, 'text': "Now I basically mean that I'm just taking the top 5, 000 most frequent words from the when comparing to all the features right?", 'start': 1035.323, 'duration': 9.264}, {'end': 1048.749, 'text': 'In that particular whole document or the sentences that I had.', 'start': 1044.967, 'duration': 3.782}, {'end': 1053.291, 'text': "what I'm doing is that I'm taking the most 5, 000 elements most frequent words, okay?", 'start': 1048.749, 'duration': 4.542}, {'end': 1055.592, 'text': 'And now you can see that this is my data set.', 'start': 1053.691, 'duration': 1.901}, {'end': 1057.232, 'text': 'This is my whole training data set.', 'start': 1055.692, 'duration': 1.54}, {'end': 1059.353, 'text': 'This is my whole training data set.', 'start': 1057.913, 'duration': 1.44}], 'summary': 'Selecting top 5,000 frequent words for training dataset.', 'duration': 24.03, 'max_score': 1035.323, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01035323.jpg'}, {'end': 1176.234, 'src': 'heatmap', 'start': 1147.221, 'weight': 3, 'content': [{'end': 1153.244, 'text': 'And in order to convert this into dummy variables, we will be using something called as pandas.getDummies.', 'start': 1147.221, 'duration': 6.023}, {'end': 1158.406, 'text': "So as soon as I execute this, you see this, and I'm just passing the label column.", 'start': 1153.824, 'duration': 4.582}, {'end': 1166.17, 'text': "So as soon as I execute this, now you can see that in my Y column, I'll get converted into two categories, two columns.", 'start': 1159.347, 'duration': 6.823}, {'end': 1170.571, 'text': 'So ham will actually get specified with one, spam will actually get specified with zero.', 'start': 1166.19, 'duration': 4.381}, {'end': 1176.234, 'text': 'So in all these columns, wherever there is spam, it will get classified with zero.', 'start': 1171.152, 'duration': 5.082}], 'summary': 'Using pandas.getdummies to convert y column into two categories: ham (1) and spam (0).', 'duration': 29.013, 'max_score': 1147.221, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01147221.jpg'}], 'start': 761.784, 'title': 'Nlp text preprocessing and feature extraction', 'summary': 'Covers nlp text preprocessing techniques including stop word removal, stemming, and lemmatization, and creating a bag of words using count vectorizer and dummy variable conversion for model training.', 'chapters': [{'end': 830.223, 'start': 761.784, 'title': 'Text preprocessing in nlp', 'summary': 'Explains the process of text preprocessing in nlp, including stop word removal, stemming, and lemmatization, with a demonstration of the resulting changes in a corpus and data frame.', 'duration': 68.439, 'highlights': ["Stop words are removed from the text to create a list called Paracast, resulting in the removal of words like 'see', 'go', 'until', and 'has' from the corpus.", "Stemming is applied to find the base root form of words, leading to the transformation of 'car crazy' to 'cr? az' due to stemming.", 'Lemmatization is mentioned as an alternative to stemming, with a consideration for accuracy and time efficiency.']}, {'end': 1166.17, 'start': 830.283, 'title': 'Creating bag of words and feature extraction', 'summary': 'Discusses the process of creating a bag of words using count vectorizer, extracting the top 5,000 most frequent words, and converting the label column into dummy variables, to prepare the data for the model training.', 'duration': 335.887, 'highlights': ['The count vectorizer is used to convert the data into bag of words, resulting in a document matrix with 6296 columns representing unique words and their frequency. The count vectorizer is used to create a document matrix with 6296 columns, indicating unique words and their frequency.', 'The top 5,000 most frequent words are extracted to reduce the dimensionality of the data and prepare it for model training. By extracting the top 5,000 most frequent words, the dimensionality of the data is reduced, providing a more manageable feature set for model training.', 'The label column is converted into dummy variables using pandas.getDummies to prepare the dependent feature for model training. The label column is converted into dummy variables using pandas.getDummies to prepare the dependent feature for model training.']}], 'duration': 404.386, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC0761784.jpg', 'highlights': ["Stop words are removed from the text to create a list called Paracast, resulting in the removal of words like 'see', 'go', 'until', and 'has' from the corpus.", 'The count vectorizer is used to convert the data into bag of words, resulting in a document matrix with 6296 columns representing unique words and their frequency.', 'The top 5,000 most frequent words are extracted to reduce the dimensionality of the data and prepare it for model training.', 'The label column is converted into dummy variables using pandas.getDummies to prepare the dependent feature for model training.', "Stemming is applied to find the base root form of words, leading to the transformation of 'car crazy' to 'cr? az' due to stemming."]}, {'end': 1737.329, 'segs': [{'end': 1192.307, 'src': 'embed', 'start': 1166.19, 'weight': 1, 'content': [{'end': 1170.571, 'text': 'So ham will actually get specified with one, spam will actually get specified with zero.', 'start': 1166.19, 'duration': 4.381}, {'end': 1176.234, 'text': 'So in all these columns, wherever there is spam, it will get classified with zero.', 'start': 1171.152, 'duration': 5.082}, {'end': 1179.544, 'text': 'Sorry, wherever there is ham, that is 1.', 'start': 1176.823, 'duration': 2.721}, {'end': 1183.025, 'text': 'Otherwise, if it is not ham, this will become 0.', 'start': 1179.544, 'duration': 3.481}, {'end': 1185.185, 'text': 'Similarly, when spam is there, it will be 0.', 'start': 1183.025, 'duration': 2.16}, {'end': 1188.306, 'text': 'When spam is not there, it will be 1.', 'start': 1185.185, 'duration': 3.121}, {'end': 1191.147, 'text': 'So this basically indicates that whenever.', 'start': 1188.306, 'duration': 2.841}, {'end': 1192.307, 'text': 'Okay, sorry, sorry.', 'start': 1191.147, 'duration': 1.16}], 'summary': 'Spam is classified as 0, ham as 1 in the specified columns.', 'duration': 26.117, 'max_score': 1166.19, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01166190.jpg'}, {'end': 1290.227, 'src': 'heatmap', 'start': 1228.347, 'weight': 0.774, 'content': [{'end': 1235.371, 'text': 'you know, because you can see that when the spam is zero, this basically indicates that this column will be ham right.', 'start': 1228.347, 'duration': 7.024}, {'end': 1240.234, 'text': "so ham value, i'm specifying it as zero and spam value, i'm specifying it as one.", 'start': 1235.371, 'duration': 4.863}, {'end': 1243.096, 'text': 'so what you can do is that you can remove one column from that.', 'start': 1240.234, 'duration': 2.862}, {'end': 1249.28, 'text': "so in order to remove this column and what i'm doing is that i'm writing just y dot, i log colon one dot value.", 'start': 1243.096, 'duration': 6.184}, {'end': 1257.203, 'text': "as soon as i execute this and see my y value, you can see that i'm just taking one column, that is the spam or the ham column.", 'start': 1249.719, 'duration': 7.484}, {'end': 1258.564, 'text': 'you can take any of the column.', 'start': 1257.203, 'duration': 1.361}, {'end': 1263.086, 'text': 'okay, because one column will actually specify both the information.', 'start': 1258.564, 'duration': 4.522}, {'end': 1266.128, 'text': 'if it is zero, it basically specifies that it is ham.', 'start': 1263.086, 'duration': 3.042}, {'end': 1270.09, 'text': 'if it is one, it basically specifies that it is spam, something like that.', 'start': 1266.128, 'duration': 3.962}, {'end': 1276.473, 'text': "so you can, you don't have to use two categorical features and you can just use one category feature, and this is also called as dummy variable trap.", 'start': 1270.09, 'duration': 6.383}, {'end': 1278.761, 'text': 'You should not get into this particular trap.', 'start': 1277.14, 'duration': 1.621}, {'end': 1283.904, 'text': "And there are some problems with this, which I'll be discussing in the data science interview question.", 'start': 1279.221, 'duration': 4.683}, {'end': 1290.227, 'text': 'What does it mean? So now my X, which is my independent feature, is ready.', 'start': 1284.084, 'duration': 6.143}], 'summary': 'Using dummy variable trap, column one specifies spam, column zero specifies ham.', 'duration': 61.88, 'max_score': 1228.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01228347.jpg'}, {'end': 1349.328, 'src': 'embed', 'start': 1319.373, 'weight': 3, 'content': [{'end': 1320.513, 'text': 'For doing the train test split.', 'start': 1319.373, 'duration': 1.14}, {'end': 1326.856, 'text': "I'll just import the train test split library from sklearn.modelselection and I will make a test size of 20%.", 'start': 1320.513, 'duration': 6.343}, {'end': 1328.517, 'text': 'you know 20% of the whole data.', 'start': 1326.856, 'duration': 1.661}, {'end': 1330.338, 'text': 'So let me just execute this.', 'start': 1328.697, 'duration': 1.641}, {'end': 1337.383, 'text': 'So here it is, my X-train, X-test, everything is getting executed and I have all the data over here.', 'start': 1331.82, 'duration': 5.563}, {'end': 1342.485, 'text': 'Now you can see that my X-test is having triple one five, whereas my X-train is having four, four, five, seven.', 'start': 1337.643, 'duration': 4.842}, {'end': 1345.086, 'text': 'If I add up this, it will be equal to X-site.', 'start': 1342.785, 'duration': 2.301}, {'end': 1349.328, 'text': 'Now in order to solve this problem.', 'start': 1346.247, 'duration': 3.081}], 'summary': 'Using train test split from sklearn.modelselection with 20% test size, x-train has 4457 and x-test has 1115.', 'duration': 29.955, 'max_score': 1319.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01319373.jpg'}, {'end': 1639.707, 'src': 'embed', 'start': 1601.289, 'weight': 0, 'content': [{'end': 1605.011, 'text': 'Fine Now you can see that your accuracy is around 98% days.', 'start': 1601.289, 'duration': 3.722}, {'end': 1606.952, 'text': 'See the accuracy.', 'start': 1606.131, 'duration': 0.821}, {'end': 1610.459, 'text': 'It is wonderful accuracy of 98%.', 'start': 1607.012, 'duration': 3.447}, {'end': 1616.141, 'text': 'and this is how you have basically implemented a spam classifier here, how i have implemented it.', 'start': 1610.459, 'duration': 5.682}, {'end': 1620.822, 'text': 'i have just used concepts from basic nlp concepts like stemming lemmatization.', 'start': 1616.141, 'duration': 4.681}, {'end': 1623.763, 'text': "i'm not using lemmatization of here over here, but, guys,", 'start': 1620.822, 'duration': 2.941}, {'end': 1629.364, 'text': 'if you want you please go and try to use lemmatization away and try to see in this particular sentence,', 'start': 1623.763, 'duration': 5.601}, {'end': 1632.185, 'text': 'you can use lemmatization and you can increase the performance.', 'start': 1629.364, 'duration': 2.821}, {'end': 1639.707, 'text': 'but here the accuracy is almost 98 because we have done, actually we have done a great job and created a good model.', 'start': 1632.185, 'duration': 7.522}], 'summary': 'Implemented spam classifier with 98% accuracy using basic nlp concepts.', 'duration': 38.418, 'max_score': 1601.289, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01601289.jpg'}], 'start': 1166.19, 'title': 'Categorical variable representation and naive bayes classifier implementation', 'summary': 'Explains representing categorical variables, train-test split with a 20% test size, and implementing a naive bayes spam classifier with around 98% accuracy using the sklearn library, and potential improvements using lemmatization and tf-idf model.', 'chapters': [{'end': 1337.383, 'start': 1166.19, 'title': 'Categorical variable representation', 'summary': 'Explains the representation of categorical variables in a dataset, highlighting the conversion of spam and ham categories to 0 and 1, the importance of avoiding the dummy variable trap, and the process of train test split with a test size of 20%.', 'duration': 171.193, 'highlights': ['The conversion of spam and ham categories to 0 and 1 is explained, where 1 indicates the ham category and 0 indicates the spam category.', 'The concept of avoiding the dummy variable trap is discussed, emphasizing the use of one category variable instead of two to represent the information, which can lead to problems in data analysis.', 'The process of train test split is demonstrated, with the import of the train test split library from sklearn.modelselection and a test size of 20% being specified.']}, {'end': 1737.329, 'start': 1337.643, 'title': 'Implementing naive bayes classifier', 'summary': 'Demonstrates the implementation of a naive bayes spam classifier, achieving an accuracy of around 98%, using sklearn library and discusses potential improvements using lemmatization and tf-idf model.', 'duration': 399.686, 'highlights': ['Implemented Naive Bayes spam classifier with around 98% accuracy Implemented a Naive Bayes spam classifier with an accuracy of around 98% using the sklearn library, demonstrating successful model creation and prediction.', 'Discussed potential improvements using lemmatization and TF-IDF model Discussed potential improvements in accuracy by suggesting the use of lemmatization and TF-IDF model instead of count vectorizer, highlighting their relevance in improving model performance for NLP problems.', "Explained theoretical concept of Naive Bayes and its applicability in NLP Mentioned the theoretical concept of Naive Bayes and its relevance in NLP problems, indicating a plan to explain it in a future video, providing insights into the classifier's working principle."]}], 'duration': 571.139, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/fA5TSFELkC0/pics/fA5TSFELkC01166190.jpg', 'highlights': ['Implemented Naive Bayes spam classifier with around 98% accuracy using sklearn library', 'Explained representing spam and ham categories as 0 and 1, respectively', 'Discussed potential improvements using lemmatization and TF-IDF model', 'Demonstrated train-test split process with 20% test size using sklearn.modelselection']}], 'highlights': ['Implemented Naive Bayes spam classifier with around 98% accuracy using sklearn library', 'The chapter will cover the entire process from reading the dataset to applying a machine learning model for the spam classifier.', 'A project to create a spam classifier for email and text messages will be initiated.', 'The chapter will focus on implementing a spam classifier using NLP techniques like stop keyword removal, lemmatization, stemming, count vectorizer, and tf-idf models.', 'Using NLP concepts such as stop keywords, lemmatization, stemming, and creating a bag of words for data cleaning and pre-processing in a spam classifier.', 'The process involves creating a data frame with two columns - label and message, where label denotes ham or spam, and message contains the SMS content.', "The dataset contains two folders: 'spam collection' and 'SMS spam collection', with the 'readme' file providing information about the dataset.", "The dataset is obtained from the UCI machine learning repository by searching for 'SMS spam collection data set' on Google.", 'Regular expressions and NLTK are used for text data pre-processing, including the removal of unnecessary characters such as numbers, full stops, commas, exclamation marks, and question marks from messages.', 'The process involves removing unnecessary characters like full stops, commas, and other punctuation marks, as well as applying stemming to obtain the base root form of words, ultimately aiming to improve the quality of text analysis.', "Stop words are removed from the text to create a list called Paracast, resulting in the removal of words like 'see', 'go', 'until', and 'has' from the corpus.", 'The count vectorizer is used to convert the data into bag of words, resulting in a document matrix with 6296 columns representing unique words and their frequency.', 'The top 5,000 most frequent words are extracted to reduce the dimensionality of the data and prepare it for model training.', 'The label column is converted into dummy variables using pandas.getDummies to prepare the dependent feature for model training.', "Stemming is applied to find the base root form of words, leading to the transformation of 'car crazy' to 'cr? az' due to stemming.", 'Explained representing spam and ham categories as 0 and 1, respectively', 'Discussed potential improvements using lemmatization and TF-IDF model', 'Demonstrated train-test split process with 20% test size using sklearn.modelselection']}