title
Natural Language Processing (NLP) Tutorial with Python & NLTK
description
This video will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and more. Python, NLTK, & Jupyter Notebook are used to demonstrate the concepts.
This tutorial was developed by Edureka.
🔗NLP Certification Training: https://goo.gl/kn2H8T
🔗Subscribe to the Edureka YouTube channel: https://www.youtube.com/user/edurekaIN
🔗Edureka Online Training: https://www.edureka.co/
--
Learn to code for free and get a developer job: https://www.freecodecamp.org
Read hundreds of articles on programming: https://medium.freecodecamp.org
detail
{'title': 'Natural Language Processing (NLP) Tutorial with Python & NLTK', 'heatmap': [{'end': 577.947, 'start': 546.726, 'weight': 0.842}, {'end': 717.871, 'start': 638.484, 'weight': 0.791}, {'end': 1105.041, 'start': 1047.987, 'weight': 0.75}, {'end': 1585.786, 'start': 1529.775, 'weight': 0.785}], 'summary': 'This tutorial covers the introduction and applications of natural language processing (nlp) in various industries, including tokenization using nltk with a focus on frequency and length, text analysis, stemming techniques, nlp text processing with nltk involving lemmatization, stop word removal, and part-of-speech tagging, and ner application for language understanding and processing.', 'chapters': [{'end': 39.986, 'segs': [{'end': 28.95, 'src': 'embed', 'start': 0.249, 'weight': 0, 'content': [{'end': 1.93, 'text': 'Welcome everyone to Free Code Camp.', 'start': 0.249, 'duration': 1.681}, {'end': 9.416, 'text': 'I Kislay on behalf of Edureka will take this session on natural language processing popularly known as NLP.', 'start': 2.271, 'duration': 7.145}, {'end': 16.561, 'text': 'Now Edureka is a global e-learning company that provides online training courses on the latest trending Technologies.', 'start': 9.856, 'duration': 6.705}, {'end': 20.444, 'text': "So without any further delay, let's have a look at the agenda for this session.", 'start': 17.101, 'duration': 3.343}, {'end': 28.95, 'text': "So I'll start off by explaining the evolution of the human language then we'll understand what is NLP and how it came into the picture.", 'start': 21.264, 'duration': 7.686}], 'summary': 'Edureka provides online training on nlp. session covers evolution of human language and nlp introduction.', 'duration': 28.701, 'max_score': 0.249, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM249.jpg'}], 'start': 0.249, 'title': 'Introduction to nlp in industry', 'summary': 'Introduces natural language processing (nlp) and its applications in the industry, covering the evolution of human language, the definition of nlp, its applications, and the challenges faced in implementation.', 'chapters': [{'end': 39.986, 'start': 0.249, 'title': 'Introduction to nlp in industry', 'summary': 'Introduces natural language processing (nlp) and its applications in the industry, covering the evolution of human language, the definition of nlp, its applications, and the challenges faced in implementation.', 'duration': 39.737, 'highlights': ['The session covers the evolution of human language and the emergence of NLP, followed by an exploration of NLP applications in the industry.', 'Edureka is a global e-learning company providing online training courses on trending technologies, and the session is conducted by Kislay on behalf of Edureka.', 'The agenda includes explanations of the evolution of human language, the definition of NLP, its applications in the industry, and the challenges faced during implementation.']}], 'duration': 39.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM249.jpg', 'highlights': ['The session covers the evolution of human language and the emergence of NLP, followed by an exploration of NLP applications in the industry.', 'The agenda includes explanations of the evolution of human language, the definition of NLP, its applications in the industry, and the challenges faced during implementation.', 'Edureka is a global e-learning company providing online training courses on trending technologies, and the session is conducted by Kislay on behalf of Edureka.']}, {'end': 551.47, 'segs': [{'end': 362.525, 'src': 'embed', 'start': 333.511, 'weight': 1, 'content': [{'end': 338.813, 'text': 'Now. it involves text planning, which includes retrieving the relevant contents from the knowledge base.', 'start': 333.511, 'duration': 5.302}, {'end': 346.976, 'text': 'It involves sentence planning, which includes choosing require words from meaningful phrases, setting tone of the sentences.', 'start': 339.352, 'duration': 7.624}, {'end': 349.998, 'text': 'And finally we have text realization.', 'start': 347.737, 'duration': 2.261}, {'end': 353.56, 'text': 'It is mapping sentence plan into the sentence structure.', 'start': 350.358, 'duration': 3.202}, {'end': 362.525, 'text': "Now, we'll learn about this later in this video and usually natural language understanding which is NLU is much much harder than NLG.", 'start': 354.16, 'duration': 8.365}], 'summary': 'Text planning, sentence planning, and text realization are key steps in natural language generation, with nlu being much harder than nlg.', 'duration': 29.014, 'max_score': 333.511, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM333511.jpg'}, {'end': 527.28, 'src': 'embed', 'start': 452.708, 'weight': 0, 'content': [{'end': 461.651, 'text': "So this is a kind of syntactical ambiguity which is often very hard to info for a new person, or I'd rather say a computer,", 'start': 452.708, 'duration': 8.943}, {'end': 467.193, 'text': 'because it means the meaning of the sentence is different for the different tones or in different aspects.', 'start': 461.651, 'duration': 5.542}, {'end': 472.775, 'text': 'So for example, if I look at the last statement, I saw the man with the binoculars.', 'start': 467.773, 'duration': 5.002}, {'end': 476.856, 'text': 'So do I have a binocular or the man has a binocular??', 'start': 473.455, 'duration': 3.401}, {'end': 484.331, 'text': 'It might be possible that you might be thinking that I saw the man with binoculars means that I have the binoculars.', 'start': 477.556, 'duration': 6.775}, {'end': 490.916, 'text': 'but somewhere some people might think that the guy which I am seeing has the binoculars rather than me.', 'start': 484.331, 'duration': 6.585}, {'end': 492.917, 'text': 'so that is syntactical ambiguity.', 'start': 490.916, 'duration': 2.001}, {'end': 496.279, 'text': 'now coming to the third ambiguity, which is the referential ambiguity.', 'start': 492.917, 'duration': 3.362}, {'end': 500.742, 'text': 'now this ambiguity arises when we refer to something using pronouns.', 'start': 496.279, 'duration': 4.463}, {'end': 505.506, 'text': 'now the boy told his father the theft he was very upset.', 'start': 500.742, 'duration': 4.764}, {'end': 510.429, 'text': 'now, when we talk about, he was very upset, if you focus on the italicized word he.', 'start': 505.506, 'duration': 4.923}, {'end': 513.948, 'text': 'Does this mean that the boy was upset?', 'start': 510.967, 'duration': 2.981}, {'end': 517.072, 'text': 'or the thief was upset or the father was upset??', 'start': 513.948, 'duration': 3.124}, {'end': 518.133, 'text': 'Nobody knows.', 'start': 517.611, 'duration': 0.522}, {'end': 520.014, 'text': 'This is referential ambiguity.', 'start': 518.673, 'duration': 1.341}, {'end': 527.28, 'text': 'Now coming back to NLP for using NLP onto our system or doing any natural language processing.', 'start': 520.674, 'duration': 6.606}], 'summary': 'Syntactical and referential ambiguities make nlp challenging for new users or computers.', 'duration': 74.572, 'max_score': 452.708, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM452708.jpg'}], 'start': 39.986, 'title': 'Natural language processing', 'summary': "Covers the evolution of human communication, language development, and grammar's significance in forming meaningful sentences. it also introduces nlp, emphasizing its role in processing unstructured data, applications in various industries, and challenges in natural language understanding.", 'chapters': [{'end': 120.113, 'start': 39.986, 'title': 'Steps in natural language processing', 'summary': 'Covers the evolution of human communication, the development of language, and the significance of grammar in forming meaningful sentences, emphasizing the steps and parts involved in natural language processing.', 'duration': 80.127, 'highlights': ["Humans' success is attributed to their ability to communicate and share information, which differentiates them from other animals.", 'The development of language involved standardizing drawings and creating different languages with their own sets of alphabets and grammar rules.', 'The significance of grammar in forming meaningful sentences is emphasized, highlighting the rules that govern word combinations.']}, {'end': 551.47, 'start': 120.113, 'title': 'Introduction to natural language processing', 'summary': "Introduces natural language processing (nlp), stating that only 21% of available data is structured, and it explains nlp's role in processing text data, its applications in various industries, and the challenges in natural language understanding due to lexical, syntactical, and referential ambiguities.", 'duration': 431.357, 'highlights': ['NLP applications include automation, summarization, machine translation, named entity recognition, relationship extraction, sentimental analysis, speech recognition, and topic segmentations.', 'NLP is used in spell checking, keyword search, extracting information from websites or documents, advertisement matching, sentimental analysis, speaker recognition, chatbot implementation, and machine translation.', 'NLP involves natural language understanding (NLU) and natural language generation (NLG), where NLU is more challenging due to lexical, syntactical, and referential ambiguities.', 'The NLTK library provides interfaces to 50 corpora and lexical resources, along with text processing libraries for classification, tokenization, stemming, tagging, and more.']}], 'duration': 511.484, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM39986.jpg', 'highlights': ['NLP applications include automation, summarization, machine translation, named entity recognition, relationship extraction, sentimental analysis, speech recognition, and topic segmentations.', 'The development of language involved standardizing drawings and creating different languages with their own sets of alphabets and grammar rules.', 'The significance of grammar in forming meaningful sentences is emphasized, highlighting the rules that govern word combinations.', "Humans' success is attributed to their ability to communicate and share information, which differentiates them from other animals.", 'NLP is used in spell checking, keyword search, extracting information from websites or documents, advertisement matching, sentimental analysis, speaker recognition, chatbot implementation, and machine translation.', 'The NLTK library provides interfaces to 50 corpora and lexical resources, along with text processing libraries for classification, tokenization, stemming, tagging, and more.', 'NLP involves natural language understanding (NLU) and natural language generation (NLG), where NLU is more challenging due to lexical, syntactical, and referential ambiguities.']}, {'end': 906.781, 'segs': [{'end': 717.871, 'src': 'heatmap', 'start': 638.484, 'weight': 0.791, 'content': [{'end': 643.528, 'text': 'So let me show you guys how you can implement tokenization using the NLTK library.', 'start': 638.484, 'duration': 5.044}, {'end': 648.172, 'text': "So here I'm using Jupyter Notebook.", 'start': 645.61, 'duration': 2.562}, {'end': 651.475, 'text': 'You are free to use any sort of ID also.', 'start': 649.253, 'duration': 2.222}, {'end': 654.217, 'text': 'My personal preference is Jupyter Notebook.', 'start': 652.115, 'duration': 2.102}, {'end': 660.502, 'text': "So first of all, let's import the OS, the NLTK library which we have downloaded and the NLTK corpus.", 'start': 654.797, 'duration': 5.705}, {'end': 666.427, 'text': "Now let's have a look at the corpora which is being provided by the NLTK.", 'start': 662.203, 'duration': 4.224}, {'end': 668.531, 'text': 'That is the whole data.', 'start': 667.27, 'duration': 1.261}, {'end': 674.437, 'text': 'So as you can see, we have so many files and all of these files have different functionalities.', 'start': 669.172, 'duration': 5.265}, {'end': 678.401, 'text': 'Some have textual data, some have different functions associated with it.', 'start': 674.497, 'duration': 3.904}, {'end': 682.105, 'text': 'We have stop was, as you can see here, the State Union names.', 'start': 678.441, 'duration': 3.664}, {'end': 684.187, 'text': 'We had to just sample data.', 'start': 682.486, 'duration': 1.701}, {'end': 686.97, 'text': 'We have different kind of data and different kind of functions here.', 'start': 684.608, 'duration': 2.362}, {'end': 690.674, 'text': "So let's take the brown into consideration.", 'start': 688.031, 'duration': 2.643}, {'end': 693.059, 'text': 'As you can see here, we have brown and brown zip.', 'start': 691.198, 'duration': 1.861}, {'end': 700.062, 'text': "So first, all we need to do is import the brown, and then let's have a look at the words which are present in the brown.", 'start': 693.299, 'duration': 6.763}, {'end': 705.405, 'text': "You can see we have the Fluton Country, Grand Jury Set, and it's going on and on.", 'start': 700.603, 'duration': 4.802}, {'end': 709.127, 'text': "Now let's have a look at the different Gutenberg fields.", 'start': 706.285, 'duration': 2.842}, {'end': 713.829, 'text': 'So, as you can see, under Gutenberg file we have Austin MR Text.', 'start': 709.547, 'duration': 4.282}, {'end': 717.871, 'text': 'we have the Bible Text, we have the Blake Poems, the Carol Alice Text.', 'start': 713.829, 'duration': 4.042}], 'summary': 'Demonstration of tokenization using nltk in jupyter notebook with various corpora and data samples.', 'duration': 79.387, 'max_score': 638.484, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM638484.jpg'}, {'end': 682.105, 'src': 'embed', 'start': 654.797, 'weight': 2, 'content': [{'end': 660.502, 'text': "So first of all, let's import the OS, the NLTK library which we have downloaded and the NLTK corpus.", 'start': 654.797, 'duration': 5.705}, {'end': 666.427, 'text': "Now let's have a look at the corpora which is being provided by the NLTK.", 'start': 662.203, 'duration': 4.224}, {'end': 668.531, 'text': 'That is the whole data.', 'start': 667.27, 'duration': 1.261}, {'end': 674.437, 'text': 'So as you can see, we have so many files and all of these files have different functionalities.', 'start': 669.172, 'duration': 5.265}, {'end': 678.401, 'text': 'Some have textual data, some have different functions associated with it.', 'start': 674.497, 'duration': 3.904}, {'end': 682.105, 'text': 'We have stop was, as you can see here, the State Union names.', 'start': 678.441, 'duration': 3.664}], 'summary': 'Imported os, nltk library, and corpus. explored diverse functionalities of corpus files.', 'duration': 27.308, 'max_score': 654.797, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM654797.jpg'}, {'end': 771.343, 'src': 'embed', 'start': 747.754, 'weight': 0, 'content': [{'end': 754.917, 'text': 'So if we have a look at the first 500 words of this textual paragraph or what we say the textual file.', 'start': 747.754, 'duration': 7.163}, {'end': 761.559, 'text': "So I'm using here for word in Hamlet and I'm using the colon and 500 that is the end point.", 'start': 755.537, 'duration': 6.022}, {'end': 771.343, 'text': 'So, as you can see, it starts as the tragedy of Hamlet by William Shakespeare, 1599 Actus Primus, Skona Prima, and it goes on and on.', 'start': 762.019, 'duration': 9.324}], 'summary': 'Analyzing first 500 words of hamlet reveals its origin and structure.', 'duration': 23.589, 'max_score': 747.754, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM747754.jpg'}], 'start': 552.25, 'title': 'Nltk download and tokenization', 'summary': 'Demonstrates the process of downloading nltk and tokenization using nltk, focusing on the frequency and length of tokens.', 'chapters': [{'end': 906.781, 'start': 552.25, 'title': 'Nltk download and tokenization', 'summary': 'Demonstrates the process of downloading nltk and tokenization using nltk, showing how to download nltk and use the nltk.tokenize function to tokenize text, with a focus on the frequency of tokens and the length of the tokens.', 'duration': 354.531, 'highlights': ['The process of downloading NLTK involves selecting all options in the NLTK downloader and clicking on the download button, which will download all the corpora and packages into a chosen directory.', 'Tokenization is the process of breaking up strings into tokens, and can be implemented using the NLTK library to divide a paragraph into tokens, with an example showing 273 tokens from a given paragraph.', "The frequency distinct function in NLTK is used to count the frequency of individual words in a given set of tokens, with examples showing the frequency of certain words such as 'comma' and 'intelligence'."]}], 'duration': 354.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM552250.jpg', 'highlights': ['Tokenization is the process of breaking up strings into tokens, and can be implemented using the NLTK library to divide a paragraph into tokens, with an example showing 273 tokens from a given paragraph.', 'The process of downloading NLTK involves selecting all options in the NLTK downloader and clicking on the download button, which will download all the corpora and packages into a chosen directory.', "The frequency distinct function in NLTK is used to count the frequency of individual words in a given set of tokens, with examples showing the frequency of certain words such as 'comma' and 'intelligence'."]}, {'end': 1341.611, 'segs': [{'end': 1055.469, 'src': 'embed', 'start': 1027.186, 'weight': 2, 'content': [{'end': 1033.432, 'text': 'Now coming back to our tokenization part, we have bigrams, trigrams, and engrams.', 'start': 1027.186, 'duration': 6.246}, {'end': 1037.126, 'text': 'Now, bigrams are tokens of two consecutive written words.', 'start': 1033.905, 'duration': 3.221}, {'end': 1042.126, 'text': 'Similarly, tigrams are referred to tokens of three consecutive written words,', 'start': 1037.786, 'duration': 4.34}, {'end': 1047.207, 'text': 'and usually ngrams is referred to as tokens of any number of consecutive written words for n numbers.', 'start': 1042.126, 'duration': 5.081}, {'end': 1055.469, 'text': 'So let us see how we can implement the same using NLTK libraries for bigrams, tigrams, and the ngrams.', 'start': 1047.987, 'duration': 7.482}], 'summary': 'Tokenization involves bigrams, trigrams, and ngrams for consecutive written words using nltk libraries.', 'duration': 28.283, 'max_score': 1027.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1027185.jpg'}, {'end': 1105.041, 'src': 'heatmap', 'start': 1047.987, 'weight': 0.75, 'content': [{'end': 1055.469, 'text': 'So let us see how we can implement the same using NLTK libraries for bigrams, tigrams, and the ngrams.', 'start': 1047.987, 'duration': 7.482}, {'end': 1061.19, 'text': 'So first what we need to do is import bigrams, trigrams, and ngrams from nltk.util.', 'start': 1056.369, 'duration': 4.821}, {'end': 1067.918, 'text': "So let's take a string the best and the most real thing in the world cannot be seen or even test.", 'start': 1062.915, 'duration': 5.003}, {'end': 1069.739, 'text': 'They must be filled with the heart.', 'start': 1068.558, 'duration': 1.181}, {'end': 1070.8, 'text': 'What a beautiful code.', 'start': 1069.959, 'duration': 0.841}, {'end': 1078.785, 'text': 'So let us now first create the tokens of the our string using the word underscore tokenize as I did earlier now to create a bigram.', 'start': 1071.38, 'duration': 7.405}, {'end': 1086.65, 'text': 'What we need to do is use the list function and inside that we are going to use the NL ticket or bigrams and pass on the tokens.', 'start': 1078.825, 'duration': 7.825}, {'end': 1092.033, 'text': 'So as you can see it has created a bigram of the given document.', 'start': 1088.11, 'duration': 3.923}, {'end': 1101.879, 'text': 'Similarly, if we create the trigrams and the n-grams, so all you need to do is change the bigrams to trigrams, and it will give you the trigram list.', 'start': 1092.631, 'duration': 9.248}, {'end': 1105.041, 'text': 'Now, let us now create an n-gram list, okay?', 'start': 1102.499, 'duration': 2.542}], 'summary': 'Using nltk libraries to implement bigrams, trigrams, and n-grams for given text.', 'duration': 57.054, 'max_score': 1047.987, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1047987.jpg'}, {'end': 1190.143, 'src': 'embed', 'start': 1160.43, 'weight': 1, 'content': [{'end': 1166.134, 'text': 'Now, this indiscriminate cutting can be successful in some occasions, but not always,', 'start': 1160.43, 'duration': 5.704}, {'end': 1170.516, 'text': 'and that is why we affirm that this approach presents some limitations.', 'start': 1166.134, 'duration': 4.382}, {'end': 1177.46, 'text': "Now, let's see how we can implement stemming and we'll see what are the limitations of stemming and how we can overcome them.", 'start': 1171.196, 'duration': 6.264}, {'end': 1180.336, 'text': 'Now there are quite a few different types of stemmer.', 'start': 1178.294, 'duration': 2.042}, {'end': 1182.377, 'text': "So let's start with the Porter stemmer.", 'start': 1180.716, 'duration': 1.661}, {'end': 1190.143, 'text': "So for that we are going to use from NLTK dot stem, import Porter simmer, and let's see what does this give us?", 'start': 1182.778, 'duration': 7.365}], 'summary': 'Exploring limitations of indiscriminate cutting and implementing porter stemmer from nltk.', 'duration': 29.713, 'max_score': 1160.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1160430.jpg'}, {'end': 1262.76, 'src': 'embed', 'start': 1205.276, 'weight': 0, 'content': [{'end': 1212.638, 'text': 'So for words, inverse system, print words and we are using the Potter stemmer, which is the PST dot stem method, says you can see,', 'start': 1205.276, 'duration': 7.362}, {'end': 1216.92, 'text': 'it has given us the output give, give, given and gave.', 'start': 1212.638, 'duration': 4.282}, {'end': 1221.601, 'text': 'so you can see the stemmer remove the only ing and replace it with e.', 'start': 1216.92, 'duration': 4.681}, {'end': 1224.622, 'text': 'now there is another stemmer, which is known as the Lancaster stemmer.', 'start': 1221.601, 'duration': 3.021}, {'end': 1229.504, 'text': "So let's try to stem the same thing using the Lancaster stemmer and see what is the difference.", 'start': 1225.022, 'duration': 4.482}, {'end': 1236.198, 'text': "So let's stem the same thing using the Lancaster stemmer and see what are the differences we have here.", 'start': 1230.495, 'duration': 5.703}, {'end': 1244.203, 'text': 'So, first of all, we are going to import the Lancaster stemmer and we are going to provide LST, which is a Lancaster stemmer function.', 'start': 1236.679, 'duration': 7.524}, {'end': 1248.706, 'text': 'and in the similar manner that we did for the Potter stemmer, let us execute the Lancaster stemmer also.', 'start': 1244.203, 'duration': 4.503}, {'end': 1251.867, 'text': 'So as you can see here the stemmer has stemmed all the words.', 'start': 1249.246, 'duration': 2.621}, {'end': 1258.071, 'text': 'As a result of it you can conclude that the Lancaster stemmer is more aggressive than the Potter stemmer.', 'start': 1252.188, 'duration': 5.883}, {'end': 1262.76, 'text': 'Now the use of each of the stemmers depend on the type of the task you want to perform.', 'start': 1258.479, 'duration': 4.281}], 'summary': 'Comparing potter and lancaster stemmers, lancaster is more aggressive, producing different stems for the same words.', 'duration': 57.484, 'max_score': 1205.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1205276.jpg'}], 'start': 907.181, 'title': 'Text analysis, tokenization, and stemming techniques', 'summary': 'Covers the use of functions like frequency testing and tokenization to analyze a paragraph, resulting in 121 distinct tokens from 273 tokens and identifying the top 10 most recurring words. it also discusses tokenization, including bigrams, trigrams, and ngrams, and the implementation of stemming techniques like porter, lancaster, and snowball stemmers, their differences, and use cases.', 'chapters': [{'end': 1004.71, 'start': 907.181, 'title': 'Text analysis and tokenization', 'summary': 'Covers the use of functions like frequency testing and tokenization to analyze a given paragraph, resulting in 121 distinct tokens from 273 tokens, and identifying the top 10 most recurring words.', 'duration': 97.529, 'highlights': ['Using the F test function, the paragraph was analyzed to yield 121 distinct tokens from the initial 273 tokens, showcasing the effectiveness of the function.', "Identifying the top 10 most recurring words, it was found that the word 'comma' appeared 30 times, while 'is' appeared five times, providing valuable insight into the frequency of specific words in the paragraph.", 'The process of tokenizing the paragraph using the blank line tokenizer resulted in 9 tokens, demonstrating the successful application of the tokenizer to segment the paragraph.']}, {'end': 1341.611, 'start': 1004.71, 'title': 'Tokenization and stemming techniques', 'summary': 'Discusses tokenization, including bigrams, trigrams, and ngrams, using nltk libraries, and the implementation of stemming techniques like porter, lancaster, and snowball stemmers, their differences, and use cases.', 'duration': 336.901, 'highlights': ['The chapter discusses tokenization, including bigrams, trigrams, and ngrams, using NLTK libraries', 'The implementation of stemming techniques like Porter, Lancaster, and Snowball stemmers, their differences, and use cases', 'Explaining the concept of stemming and its limitations', 'Comparison of different stemmers and their aggressiveness', 'Introduction to lemmatization and its differences from stemming']}], 'duration': 434.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM907181.jpg', 'highlights': ['Using the F test function, the paragraph was analyzed to yield 121 distinct tokens from the initial 273 tokens, showcasing the effectiveness of the function.', "Identifying the top 10 most recurring words, it was found that the word 'comma' appeared 30 times, while 'is' appeared five times, providing valuable insight into the frequency of specific words in the paragraph.", 'The process of tokenizing the paragraph using the blank line tokenizer resulted in 9 tokens, demonstrating the successful application of the tokenizer to segment the paragraph.', 'The chapter discusses tokenization, including bigrams, trigrams, and ngrams, using NLTK libraries', 'The implementation of stemming techniques like Porter, Lancaster, and Snowball stemmers, their differences, and use cases', 'Explaining the concept of stemming and its limitations', 'Comparison of different stemmers and their aggressiveness', 'Introduction to lemmatization and its differences from stemming']}, {'end': 1805.139, 'segs': [{'end': 1383.43, 'src': 'embed', 'start': 1358.637, 'weight': 2, 'content': [{'end': 1365.803, 'text': 'So for example, if you look back to our output given by the Lancaster stemmer, you can see it has given us the output giv which is not a word.', 'start': 1358.637, 'duration': 7.166}, {'end': 1368.425, 'text': 'So the output of limit ization is a proper word.', 'start': 1366.243, 'duration': 2.182}, {'end': 1376.032, 'text': 'And for example, if you take a limit ization of gone going it all goes into the word go.', 'start': 1369.246, 'duration': 6.786}, {'end': 1379.868, 'text': 'Now again, we can see how it works with the same example of words.', 'start': 1376.587, 'duration': 3.281}, {'end': 1383.43, 'text': "Let's try limit ization using the NLTK library.", 'start': 1380.529, 'duration': 2.901}], 'summary': "The lancaster stemmer output 'giv' is not a word, but the output of 'limitization' is a proper word, as seen in the nltk library example.", 'duration': 24.793, 'max_score': 1358.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1358637.jpg'}, {'end': 1477.065, 'src': 'embed', 'start': 1425.739, 'weight': 0, 'content': [{'end': 1432.124, 'text': "This is because we haven't assigned any POS tags and hence it has assumed all the words as known.", 'start': 1425.739, 'duration': 6.385}, {'end': 1440.546, 'text': "Now, we'll learn about POS later in this video, but just to give you a hint of what POS is, POS is basically parts of speech.", 'start': 1432.782, 'duration': 7.764}, {'end': 1448.709, 'text': 'So as to define which word is a noun, which is a pronoun, and which is a subject, and much more.', 'start': 1441.326, 'duration': 7.383}, {'end': 1450.31, 'text': 'Now, do you know?', 'start': 1449.43, 'duration': 0.88}, {'end': 1457.453, 'text': 'there are several words in the English language, such as I at for begin gone?', 'start': 1450.31, 'duration': 7.143}, {'end': 1465.862, 'text': "no, where is which are thought of as useful in the formation of sentence, and without it the sentences won't even make sense.", 'start': 1457.453, 'duration': 8.409}, {'end': 1469.503, 'text': 'But these do not provide any help in NLP.', 'start': 1466.762, 'duration': 2.741}, {'end': 1473.104, 'text': 'So these lists of words are known as stop words.', 'start': 1470.223, 'duration': 2.881}, {'end': 1477.065, 'text': 'So you might be confused as if are they helpful or not.', 'start': 1473.864, 'duration': 3.201}], 'summary': "Pos tags define word types; stop words don't aid nlp.", 'duration': 51.326, 'max_score': 1425.739, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1425739.jpg'}, {'end': 1585.786, 'src': 'heatmap', 'start': 1529.775, 'weight': 0.785, 'content': [{'end': 1537.561, 'text': 'They are like special digits and characters which do not add any value to the language processing and hence it can be removed.', 'start': 1529.775, 'duration': 7.786}, {'end': 1545.887, 'text': 'First of all, we will use the compile from the re module to create a string that matches any digit or the special character.', 'start': 1538.922, 'duration': 6.965}, {'end': 1551.011, 'text': "Now we'll create an empty list and append the words without any punctuation into the list.", 'start': 1546.468, 'duration': 4.543}, {'end': 1553.953, 'text': "So I'm naming this as post punctuation.", 'start': 1551.812, 'duration': 2.141}, {'end': 1558.616, 'text': 'And if you have a look at the output of the post punctuation,', 'start': 1554.573, 'duration': 4.043}, {'end': 1566.321, 'text': 'so as you can see it has removed all the various numbers and digits and the comma and the different elements.', 'start': 1558.616, 'duration': 7.705}, {'end': 1578.021, 'text': 'Now, when I was talking about POS, which is parts of speech, now, generally speaking, the grammatical type of the word, the verb, the noun, adjective,', 'start': 1567.482, 'duration': 10.539}, {'end': 1579.362, 'text': 'adverb and article.', 'start': 1578.021, 'duration': 1.341}, {'end': 1585.786, 'text': 'Now it indicates how a word functions in meaning as well as grammatically within the sentence.', 'start': 1579.862, 'duration': 5.924}], 'summary': 'Removing non-value adding characters, creating word list, understanding parts of speech.', 'duration': 56.011, 'max_score': 1529.775, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1529775.jpg'}, {'end': 1669.067, 'src': 'embed', 'start': 1641.923, 'weight': 4, 'content': [{'end': 1645.306, 'text': 'So the first sentence is the waiter clears the plates from the table.', 'start': 1641.923, 'duration': 3.383}, {'end': 1652.151, 'text': 'So, as you can see from the starting, if we use the word take into consideration, though it is a determiner.', 'start': 1645.826, 'duration': 6.325}, {'end': 1654.273, 'text': 'now, waiter, is a noun cleared?', 'start': 1652.151, 'duration': 2.122}, {'end': 1656.615, 'text': 'is a verb, though, is again determiner?', 'start': 1654.273, 'duration': 2.342}, {'end': 1660.3, 'text': 'The plates are noun from is not defined here.', 'start': 1656.977, 'duration': 3.323}, {'end': 1663.202, 'text': 'The is again the data minor and the table is again a noun.', 'start': 1660.56, 'duration': 2.642}, {'end': 1669.067, 'text': 'Now again, if we take into consideration the sentence, the dog ate the cat.', 'start': 1663.842, 'duration': 5.225}], 'summary': 'Analyzing grammatical structures and parts of speech.', 'duration': 27.144, 'max_score': 1641.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1641923.jpg'}], 'start': 1342.771, 'title': 'Nlp text processing with nltk', 'summary': 'Covers nlp text processing using nltk, including lemmatization, stop word removal, part-of-speech tagging, and named entity recognition, emphasizing the importance of pos tagging and highlighting challenges in nlp.', 'chapters': [{'end': 1383.43, 'start': 1342.771, 'title': 'Lemmatization vs stemming', 'summary': 'Explains the concept of lemmatization, highlighting how it groups different word forms into a common root, ensuring the output is a proper word and providing a comparison with stemming, showcasing the advantages of lemmatization over stemming.', 'duration': 40.659, 'highlights': ['Lemmatization groups different word forms into a common root, ensuring the output is a proper word.', "It is similar to stemming but maps several words into one common root, as exemplified by the transformation of 'gone' and 'going' to the word 'go'.", "Comparison with stemming is made by highlighting the disadvantage of stemming, exemplified by the output 'giv' which is not a word."]}, {'end': 1805.139, 'start': 1384.431, 'title': 'Nlp text processing with nltk', 'summary': 'Covers nlp text processing using nltk, including lemmatization, stop word removal, part-of-speech tagging, and named entity recognition, with emphasis on the importance of pos tagging and the challenges in nlp.', 'duration': 420.708, 'highlights': ['The chapter covers NLP text processing using NLTK, including lemmatization, stop word removal, part-of-speech tagging, and named entity recognition', 'The importance of POS tagging and the challenges in NLP', 'The number of stop words in English format is 179', 'The shortcomings of the POS taggers when it comes to tagging the words', 'The process of named entity recognition involves three phases: noun phrase identification, phrase classification, and entity disambiguation']}], 'duration': 462.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1342771.jpg', 'highlights': ['The process of named entity recognition involves three phases: noun phrase identification, phrase classification, and entity disambiguation', 'The number of stop words in English format is 179', 'Lemmatization groups different word forms into a common root, ensuring the output is a proper word', "It is similar to stemming but maps several words into one common root, as exemplified by the transformation of 'gone' and 'going' to the word 'go'", "Comparison with stemming is made by highlighting the disadvantage of stemming, exemplified by the output 'giv' which is not a word", 'The chapter covers NLP text processing using NLTK, including lemmatization, stop word removal, part-of-speech tagging, and named entity recognition', 'The importance of POS tagging and the challenges in NLP', 'The shortcomings of the POS taggers when it comes to tagging the words']}, {'end': 2289.345, 'segs': [{'end': 1857.895, 'src': 'embed', 'start': 1832.28, 'weight': 1, 'content': [{'end': 1842.006, 'text': 'This is an additional layer on top of the POS tagging so as to clarify and give us more depth into what the sentence is about and what the sentence is conveying us.', 'start': 1832.28, 'duration': 9.726}, {'end': 1849.27, 'text': 'So for using any are in Python you need to import the any underscore chunk from the NLTK module in Python.', 'start': 1842.606, 'duration': 6.664}, {'end': 1857.895, 'text': 'So once we have imported any underscore chunk now, let us take this sentence into consideration, which is the US president stays in the White House.', 'start': 1850.01, 'duration': 7.885}], 'summary': "Additional layer for pos tagging adds depth to sentence understanding. import any underscore chunk from nltk module in python for usage. example: 'us president stays in white house.'", 'duration': 25.615, 'max_score': 1832.28, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1832280.jpg'}, {'end': 1914.01, 'src': 'embed', 'start': 1892.706, 'weight': 5, 'content': [{'end': 1902.074, 'text': 'adding a layer of dictionary and then adding the tags and then creating the name entity recognition makes the understanding of language so much more easier.', 'start': 1892.706, 'duration': 9.368}, {'end': 1910.12, 'text': 'Now, as you can see in the NER entity list, we have geosocial political group, geopolitical entity.', 'start': 1903.935, 'duration': 6.185}, {'end': 1914.01, 'text': 'we have facility, location, organization and person.', 'start': 1910.12, 'duration': 3.89}], 'summary': 'Enhancing language understanding with ner: geosocial political group, geopolitical entity, facility, location, organization, and person.', 'duration': 21.304, 'max_score': 1892.706, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1892706.jpg'}, {'end': 1967.025, 'src': 'embed', 'start': 1942.284, 'weight': 7, 'content': [{'end': 1947.609, 'text': 'So we have a certain rules as to what part of sentence should come up at what position.', 'start': 1942.284, 'duration': 5.325}, {'end': 1953.614, 'text': 'Now with these rules we create a syntax tree whenever there is a sentence as an input.', 'start': 1948.189, 'duration': 5.425}, {'end': 1961.42, 'text': 'So syntax tree in layman terms is basically a tree representation of the syntactic structure of sentences or strings.', 'start': 1954.194, 'duration': 7.226}, {'end': 1967.025, 'text': 'Now it is a way of representing the syntax of a programming language as a hierarchical tree like structure.', 'start': 1961.941, 'duration': 5.084}], 'summary': 'Rules dictate sentence structure for syntax trees, a hierarchical representation of language syntax.', 'duration': 24.741, 'max_score': 1942.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1942284.jpg'}, {'end': 2114.509, 'src': 'embed', 'start': 2079.768, 'weight': 0, 'content': [{'end': 2084.411, 'text': 'Now this also helps in determining and the processing of the language.', 'start': 2079.768, 'duration': 4.643}, {'end': 2093.358, 'text': 'So suppose if someone is asking whom did we caught this morning? So the regular response to this question would be we caught the Pink Panther.', 'start': 2085.252, 'duration': 8.106}, {'end': 2098.282, 'text': 'Now the question is asking who so the Pink Panther becomes a noun phrase basically.', 'start': 2093.779, 'duration': 4.503}, {'end': 2099.586, 'text': 'So this is something.', 'start': 2098.646, 'duration': 0.94}, {'end': 2103.707, 'text': 'what the chunking does is that it understands the language,', 'start': 2099.586, 'duration': 4.121}, {'end': 2114.509, 'text': 'and when it has understood what it does is basically picks up the individual pieces of information and groups them into chunks so that it will be easier for us to process that data.', 'start': 2103.707, 'duration': 10.802}], 'summary': 'Chunking helps process language by grouping information into chunks for easier processing.', 'duration': 34.741, 'max_score': 2079.768, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM2079768.jpg'}], 'start': 1805.479, 'title': 'Nlp basics and ner with nltk', 'summary': 'Discusses nlp basics, including tokenization and pos tagging, and demonstrates ner using nltk to identify and categorize entities in text. it also emphasizes the application of these techniques for language understanding and processing.', 'chapters': [{'end': 1892.706, 'start': 1805.479, 'title': 'Named entity recognition with nltk', 'summary': 'Discusses the application of named entity recognition (ner) using nltk for identifying and categorizing entities like organizations, locations, and persons in a given text, and demonstrates the process with examples using nltk module in python.', 'duration': 87.227, 'highlights': ['The process involves using popular knowledge crafts like Google knowledge craft, IBM Watson, and Wikipedia for Named Entity Recognition (NER).', "The chapter explains the process of importing 'any underscore chunk' from the NLTK module in Python for implementing NER using NLTK.", "It demonstrates the application of NER with examples like 'the Google CEO, Sundar Pichai, introduced a new pixel at Minnesota Roy Center event' and 'the US president stays in the White House' for identifying entities and their types.", "The output of the NER process includes categorizing entities like 'Google' as an organization, 'Sundar Pichai' as a person, 'Minnesota' as a location, 'Roy Center event' as an organization, and 'U.S.' as an organization and 'White House' as a facility."]}, {'end': 2289.345, 'start': 1892.706, 'title': 'Nlp basics and application', 'summary': 'Discusses the basics of nlp, including tokenization, pos tagging, named entity recognition, syntax tree creation, and chunking, emphasizing how these techniques are used to understand and process language.', 'duration': 396.639, 'highlights': ['Named Entity Recognition (NER)', 'Syntax Tree Creation', 'Chunking', 'Tokenization and POS Tagging']}], 'duration': 483.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/X2vAabgKiuM/pics/X2vAabgKiuM1805479.jpg', 'highlights': ['The process involves using popular knowledge crafts like Google knowledge craft, IBM Watson, and Wikipedia for Named Entity Recognition (NER).', "The chapter explains the process of importing 'any underscore chunk' from the NLTK module in Python for implementing NER using NLTK.", "It demonstrates the application of NER with examples like 'the Google CEO, Sundar Pichai, introduced a new pixel at Minnesota Roy Center event' and 'the US president stays in the White House' for identifying entities and their types.", "The output of the NER process includes categorizing entities like 'Google' as an organization, 'Sundar Pichai' as a person, 'Minnesota' as a location, 'Roy Center event' as an organization, and 'U.S.' as an organization and 'White House' as a facility.", 'Named Entity Recognition (NER)', 'Syntax Tree Creation', 'Chunking', 'Tokenization and POS Tagging']}], 'highlights': ['NLP applications include automation, summarization, machine translation, named entity recognition, relationship extraction, sentimental analysis, speech recognition, and topic segmentations.', 'Tokenization is the process of breaking up strings into tokens, and can be implemented using the NLTK library to divide a paragraph into tokens, with an example showing 273 tokens from a given paragraph.', 'The NLTK library provides interfaces to 50 corpora and lexical resources, along with text processing libraries for classification, tokenization, stemming, tagging, and more.', 'The process of named entity recognition involves three phases: noun phrase identification, phrase classification, and entity disambiguation', 'The number of stop words in English format is 179', 'The process involves using popular knowledge crafts like Google knowledge craft, IBM Watson, and Wikipedia for Named Entity Recognition (NER).', 'The agenda includes explanations of the evolution of human language, the definition of NLP, its applications in the industry, and the challenges faced during implementation.', 'The process of downloading NLTK involves selecting all options in the NLTK downloader and clicking on the download button, which will download all the corpora and packages into a chosen directory.', 'Using the F test function, the paragraph was analyzed to yield 121 distinct tokens from the initial 273 tokens, showcasing the effectiveness of the function.', 'The implementation of stemming techniques like Porter, Lancaster, and Snowball stemmers, their differences, and use cases']}