title
Introduction to Text Analytics with R Part 1 | Overview
description
This data science series introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:
– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating the accuracy of the trained classification models
The overview of this video series provides an introduction to text analytics as a whole and what is to be expected throughout the instruction. It also includes specific coverage of:
– Overview of the spam dataset used throughout the series
– Loading the data and initial data cleaning
– Some initial data analysis, feature engineering, and data visualization
Kaggle Dataset:
https://www.kaggle.com/uciml/sms-spam-collection-dataset
The data and R code used in this series is available here:
https://code.datasciencedojo.com/datasciencedojo/tutorials/tree/master/Introduction%20to%20Text%20Analytics%20with%20R
Table of Contents:
0:00 Introduction
11:06 Packages
13:21 Read CSV
17:04 Find missing data
19:05 Explore the data
23:14 Text length
--
At Data Science Dojo, we believe data science is for everyone. Our data science trainings have been attended by more than 10,000 employees from over 2,500 companies globally, including many leaders in tech like Microsoft, Google, and Facebook. For more information please visit: https://hubs.la/Q01Z-13k0
💼 Learn to build LLM-powered apps in just 40 hours with our Large Language Models bootcamp: https://hubs.la/Q01ZZGL-0
💼 Get started in the world of data with our top-rated data science bootcamp: https://hubs.la/Q01ZZDpt0
💼 Master Python for data science, analytics, machine learning, and data engineering: https://hubs.la/Q01ZZD-s0
💼 Explore, analyze, and visualize your data with Power BI desktop: https://hubs.la/Q01ZZF8B0
--
Unleash your data science potential for FREE! Dive into our tutorials, events & courses today!
📚 Learn the essentials of data science and analytics with our data science tutorials: https://hubs.la/Q01ZZJJK0
📚 Stay ahead of the curve with the latest data science content, subscribe to our newsletter now: https://hubs.la/Q01ZZBy10
📚 Connect with other data scientists and AI professionals at our community events: https://hubs.la/Q01ZZLd80
📚 Checkout our free data science courses: https://hubs.la/Q01ZZMcm0
📚 Get your daily dose of data science with our trending blogs: https://hubs.la/Q01ZZMWl0
--
📱 Social media links
Connect with us: https://www.linkedin.com/company/data-science-dojo
Follow us: https://twitter.com/DataScienceDojo
Keep up with us: https://www.instagram.com/data_science_dojo/
Like us: https://www.facebook.com/datasciencedojo
Find us: https://www.threads.net/@data_science_dojo
--
Also, join our communities:
LinkedIn: https://www.linkedin.com/groups/13601597/
Twitter: https://twitter.com/i/communities/1677363761399865344
Facebook: https://www.facebook.com/groups/AIandMachineLearningforEveryone/
Vimeo: https://vimeo.com/datasciencedojo
Discord: https://discord.com/invite/tj8ken4Err
_
Want to share your data science knowledge? Boost your profile and share your knowledge with our community: https://hubs.la/Q01ZZNCn0
#textanalytics #rprogramming
detail
{'title': 'Introduction to Text Analytics with R Part 1 | Overview', 'heatmap': [{'end': 846.274, 'start': 824.623, 'weight': 0.713}, {'end': 929.221, 'start': 858.434, 'weight': 0.703}, {'end': 1160.436, 'start': 1136.032, 'weight': 0.741}, {'end': 1692.243, 'start': 1650.945, 'weight': 0.714}], 'summary': 'Series introduces text analytics with r, featuring dave langer with 20 years of experience. it emphasizes the importance of text analytics, expectation setting, github repository, svd, feature extraction, data loading, data exploration in sms classification analysis, and text length analysis for sms spam detection.', 'chapters': [{'end': 103.931, 'segs': [{'end': 103.931, 'src': 'embed', 'start': 54.624, 'weight': 0, 'content': [{'end': 57.365, 'text': "But most importantly, I've been a data professional.", 'start': 54.624, 'duration': 2.741}, {'end': 66.427, 'text': 'So I have an extensive background in business intelligence, in data warehousing and analytics, both traditional analytics,', 'start': 57.645, 'duration': 8.782}, {'end': 68.168, 'text': 'descriptive analytics as well as predictive.', 'start': 66.427, 'duration': 1.741}, {'end': 76.083, 'text': 'My last job before joining Data Science Dojo was as a senior director of BI and analytics at Microsoft.', 'start': 69.161, 'duration': 6.922}, {'end': 78.543, 'text': 'At my last job,', 'start': 77.263, 'duration': 1.28}, {'end': 88.286, 'text': "I had a team of technical program managers and we owned all of the data platforms that ran Microsoft's $10 billion plus supply chain operation.", 'start': 78.543, 'duration': 9.743}, {'end': 97.769, 'text': 'That included everything from BI to data warehousing, to analytics anything you know cubes, OLAP,', 'start': 89.306, 'duration': 8.463}, {'end': 103.931, 'text': 'all the way up to R and Spark running on a Hadoop cluster in the cloud and everything in between.', 'start': 97.769, 'duration': 6.162}], 'summary': "Experienced data professional with background in bi and analytics. led team managing microsoft's $10 billion+ supply chain data platforms.", 'duration': 49.307, 'max_score': 54.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw54624.jpg'}], 'start': 4.98, 'title': 'Introduction to text analytics with r', 'summary': "Introduces the host, dave langer, who has over 20 years of experience in technology and data science, including leading microsoft's $10 billion plus supply chain operation.", 'chapters': [{'end': 103.931, 'start': 4.98, 'title': 'Introduction to text analytics with r', 'summary': "Introduces the host, dave langer, who has over 20 years of experience in technology and data science, including leading microsoft's $10 billion plus supply chain operation.", 'duration': 98.951, 'highlights': ["Dave Langer has over 20 years of experience in technology and data science, including roles as a senior director of BI and analytics at Microsoft, where he led a team managing Microsoft's $10 billion plus supply chain operation.", 'Langer has extensive background in business intelligence, data warehousing, and analytics, covering traditional and predictive analytics, as well as managing data platforms including BI, data warehousing, and analytics at Microsoft.']}], 'duration': 98.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw4980.jpg', 'highlights': ["Dave Langer has over 20 years of experience in technology and data science, including leading Microsoft's $10 billion plus supply chain operation.", 'Langer has extensive background in business intelligence, data warehousing, and analytics, covering traditional and predictive analytics.']}, {'end': 261.998, 'segs': [{'end': 159.349, 'src': 'embed', 'start': 128.316, 'weight': 0, 'content': [{'end': 130.338, 'text': "And we'll see that as we go through this video series.", 'start': 128.316, 'duration': 2.022}, {'end': 135.603, 'text': "So I joined Data Science Dojo to realize the company's vision.", 'start': 132.202, 'duration': 3.401}, {'end': 140.344, 'text': 'The mission statement of Data Science Dojo is data science for everyone.', 'start': 135.703, 'duration': 4.641}, {'end': 141.464, 'text': 'We really do believe that.', 'start': 140.484, 'duration': 0.98}, {'end': 145.845, 'text': 'We believe that you do not need a PhD in statistics.', 'start': 142.184, 'duration': 3.661}, {'end': 153.587, 'text': 'you do not need a PhD in machine learning to learn data science tools and techniques and apply them in your daily work.', 'start': 145.845, 'duration': 7.742}, {'end': 155.127, 'text': 'and derive business value.', 'start': 154.247, 'duration': 0.88}, {'end': 159.349, 'text': 'We absolutely believe that, and this video series is proof positive of that.', 'start': 155.388, 'duration': 3.961}], 'summary': 'Data science dojo promotes data science for everyone, without needing a phd, to derive business value.', 'duration': 31.033, 'max_score': 128.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw128316.jpg'}, {'end': 216.087, 'src': 'embed', 'start': 176.96, 'weight': 1, 'content': [{'end': 180.422, 'text': 'First up, there is going to be a GitHub repository for this tutorial.', 'start': 176.96, 'duration': 3.462}, {'end': 188.225, 'text': "You will be able to get all of the code that you see demonstrated throughout this video series from this GitHub repo, and here's the URL for it.", 'start': 180.942, 'duration': 7.283}, {'end': 194.066, 'text': "Next up, if you like what we're doing, if you like what we're doing in this video series,", 'start': 190.262, 'duration': 3.804}, {'end': 200.313, 'text': "if you like what we're doing on some of the other things that we do on our YouTube channel, please follow us on social media.", 'start': 194.066, 'duration': 6.247}, {'end': 204.377, 'text': 'We actually provide a wealth of content not only our own,', 'start': 200.613, 'duration': 3.764}, {'end': 210.985, 'text': 'but also hand-curated goodness of data science from a variety of sources across the internet.', 'start': 204.377, 'duration': 6.608}, {'end': 216.087, 'text': 'And we have a presence on all of the social media channels that you would expect.', 'start': 211.665, 'duration': 4.422}], 'summary': 'Github repo for tutorial code. follow us on social media for curated data science content.', 'duration': 39.127, 'max_score': 176.96, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw176960.jpg'}, {'end': 261.998, 'src': 'embed', 'start': 232.814, 'weight': 3, 'content': [{'end': 239.84, 'text': "So, if you like, please subscribe to the YouTube channel and you'll get not only the stuff that I'm doing in this particular video series,", 'start': 232.814, 'duration': 7.026}, {'end': 243.583, 'text': "but you'll get updated when we produce more content and more tutorials as well.", 'start': 239.84, 'duration': 3.743}, {'end': 247.645, 'text': "And lastly, don't forget about our five-day boot camp.", 'start': 245.083, 'duration': 2.562}, {'end': 254.572, 'text': 'If you like what you see here, if you like how I teach, if you like the kind of information that you get, the kind of skills that you build,', 'start': 248.126, 'duration': 6.446}, {'end': 258.394, 'text': 'we do teach a five-day intensive boot camp at Data Science Dojo.', 'start': 254.572, 'duration': 3.822}, {'end': 261.998, 'text': 'And of course, you can find more about that at datasciencedojo.com.', 'start': 258.535, 'duration': 3.463}], 'summary': 'Subscribe to the youtube channel for updates and consider enrolling in the five-day boot camp at data science dojo.', 'duration': 29.184, 'max_score': 232.814, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw232814.jpg'}], 'start': 104.977, 'title': 'Introduction to text analytics', 'summary': "Introduces the importance of text analytics as a tool for data professionals, aligning with data science dojo's mission to democratize data science and offering resources for professional development.", 'chapters': [{'end': 261.998, 'start': 104.977, 'title': 'Introduction to text analytics', 'summary': 'Emphasizes the importance of text analytics as a valuable tool for data professionals, promoting the mission of data science dojo to make data science accessible to everyone, and providing resources for further learning and professional development.', 'duration': 157.021, 'highlights': ["Data Science Dojo's mission is to make data science accessible to everyone, without the need for a PhD in statistics or machine learning, as demonstrated in the video series. (relevance: 5)", 'The tutorial provides a GitHub repository for accessing all the demonstrated code throughout the series. (relevance: 4)', 'Data Science Dojo offers a wealth of content on social media and encourages following on various platforms for updates and additional tutorials. (relevance: 3)', "The chapter encourages subscribing to Data Science Dojo's YouTube channel for more content, including tutorials and updates. (relevance: 2)", 'Data Science Dojo also offers a five-day intensive boot camp for those interested in further learning and skill-building. (relevance: 1)']}], 'duration': 157.021, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw104977.jpg', 'highlights': ["Data Science Dojo's mission is to make data science accessible to everyone, without the need for a PhD in statistics or machine learning, as demonstrated in the video series. (relevance: 5)", 'The tutorial provides a GitHub repository for accessing all the demonstrated code throughout the series. (relevance: 4)', 'Data Science Dojo offers a wealth of content on social media and encourages following on various platforms for updates and additional tutorials. (relevance: 3)', "The chapter encourages subscribing to Data Science Dojo's YouTube channel for more content, including tutorials and updates. (relevance: 2)", 'Data Science Dojo also offers a five-day intensive boot camp for those interested in further learning and skill-building. (relevance: 1)']}, {'end': 487.07, 'segs': [{'end': 352.036, 'src': 'embed', 'start': 285.683, 'weight': 1, 'content': [{'end': 291.165, 'text': 'Because we will be using, while not necessarily super advanced aspects of R programming.', 'start': 285.683, 'duration': 5.482}, {'end': 297.828, 'text': "we will be doing some things like we'll be creating our own functions and we'll be using the apply function using those functions.", 'start': 291.165, 'duration': 6.663}, {'end': 299.109, 'text': 'That sort of thing.', 'start': 298.468, 'duration': 0.641}, {'end': 302.992, 'text': 'So I will make a certain amount of assumption around your R skill in general.', 'start': 299.149, 'duration': 3.843}, {'end': 309.818, 'text': 'The focus of this tutorial is not R programming, but how you do text analytics with R programming.', 'start': 303.432, 'duration': 6.386}, {'end': 311.86, 'text': 'So I will make some assumptions about your R skills.', 'start': 309.858, 'duration': 2.002}, {'end': 319.266, 'text': 'If you need to brush up on your R skills, we have some YouTube assets and tutorials on our YouTube channel that can help you with that.', 'start': 312.981, 'duration': 6.285}, {'end': 324.632, 'text': "Okay, you'll need to have some math background.", 'start': 321.528, 'duration': 3.104}, {'end': 335.425, 'text': 'So, to be clear, the focus of this particular tutorial series will be most definitely practice and not theory, but we will have some math.', 'start': 325.333, 'duration': 10.092}, {'end': 337.207, 'text': "Unfortunately, that's inevitable.", 'start': 336.166, 'duration': 1.041}, {'end': 343.692, 'text': 'We will have to have some math coverage of some kind Otherwise, for example, TF-IDF.', 'start': 337.247, 'duration': 6.445}, {'end': 350.095, 'text': 'We need to explain the mathematics behind it because that will also explain the benefit that you get for using it.', 'start': 344.012, 'duration': 6.083}, {'end': 352.036, 'text': 'TF-IDF is mighty.', 'start': 350.915, 'duration': 1.121}], 'summary': 'The tutorial focuses on text analytics using r programming with a primary emphasis on practical application and assumes a certain level of r skill and math background.', 'duration': 66.353, 'max_score': 285.683, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw285683.jpg'}, {'end': 399.003, 'src': 'embed', 'start': 375.704, 'weight': 0, 'content': [{'end': 383.789, 'text': 'Text analytics, text mining, natural language processing, they all kind of fall under this same banner, the same umbrella.', 'start': 375.704, 'duration': 8.085}, {'end': 387.192, 'text': "So there's lots and lots to learn, lots and lots of goodness to be had.", 'start': 383.869, 'duration': 3.323}, {'end': 389.593, 'text': 'But this is going to be very much an introduction.', 'start': 387.612, 'duration': 1.981}, {'end': 399.003, 'text': 'In particular, the focus of this series is going to be on the creation of predictive classification models over textual data.', 'start': 390.914, 'duration': 8.089}], 'summary': 'Introduction to text analytics, text mining, and nlp with a focus on creating predictive classification models over textual data.', 'duration': 23.299, 'max_score': 375.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw375704.jpg'}, {'end': 487.07, 'src': 'embed', 'start': 421.338, 'weight': 2, 'content': [{'end': 431.962, 'text': 'we will also be focusing on the 20% of the tools and techniques that you use to build predictive classification models over textual data that can be used in 80% of your projects,', 'start': 421.338, 'duration': 10.624}, {'end': 434.143, 'text': "right?. So we're really trying to do the 80-20 rule here.", 'start': 431.962, 'duration': 2.181}, {'end': 436.444, 'text': 'Introduce you to the 20%.', 'start': 435.244, 'duration': 1.2}, {'end': 442.787, 'text': "that's 80% useful, but note that this is very much an introduction and it is a broad, broad topic and there's a lot to learn.", 'start': 436.444, 'duration': 6.343}, {'end': 450.086, 'text': "And lastly, since this is an introduction, we're going to use decision tree ensembles.", 'start': 445.14, 'duration': 4.946}, {'end': 456.054, 'text': 'The predictive models that we were going to build will be ensembles of decision trees.', 'start': 451.268, 'duration': 4.786}, {'end': 458.977, 'text': "And in particular, we'll use the mighty random forest algorithm.", 'start': 456.094, 'duration': 2.883}, {'end': 461.941, 'text': "And the reason for that is that it's relatively simple.", 'start': 459.558, 'duration': 2.383}, {'end': 467.043, 'text': 'There are other models that can be used support vector machines, for example,', 'start': 463.062, 'duration': 3.981}, {'end': 473.606, 'text': "or boosted decision trees but we're going to focus mainly on the mighty random forest, because it is a relatively simple model,", 'start': 467.043, 'duration': 6.563}, {'end': 475.246, 'text': 'but it is also relatively powerful.', 'start': 473.606, 'duration': 1.64}, {'end': 478.667, 'text': "Once again, it's an introduction.", 'start': 476.847, 'duration': 1.82}, {'end': 480.928, 'text': "What we'll teach is broadly useful.", 'start': 479.527, 'duration': 1.401}, {'end': 487.07, 'text': "You can always do more research on your own at the end of this video series if you're interested in using support vector machines, for example.", 'start': 481.768, 'duration': 5.302}], 'summary': 'Introduction to building predictive classification models over textual data using the 80-20 rule, with a focus on decision tree ensembles, particularly the mighty random forest algorithm.', 'duration': 65.732, 'max_score': 421.338, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw421338.jpg'}], 'start': 264.611, 'title': 'Text analytics and expectation setting', 'summary': 'Introduces text analytics with r programming, focusing on expectation setting for viewers with coding experience and a math background, and emphasizing the creation of predictive classification models over textual data. it also emphasizes the 80-20 rule in using tools and techniques for projects.', 'chapters': [{'end': 352.036, 'start': 264.611, 'title': 'Text analytics with r: expectation setting and focus', 'summary': 'Emphasizes the expectation setting for viewers with coding experience in r and a math background, and focuses on text analytics with r programming, including creating functions and using the apply function.', 'duration': 87.425, 'highlights': ['The tutorial assumes viewers have coding experience with R, including creating functions and using the apply function, but not necessarily super advanced aspects of R programming.', 'The focus is on text analytics with R programming and assumes viewers have some math background, with the tutorial series emphasizing practice over theory.', 'The chapter provides resources for viewers to brush up on their R skills through YouTube assets and tutorials on the channel.', 'Explanation of the mathematics behind TF-IDF is included to elucidate its benefits in text analytics with R.']}, {'end': 487.07, 'start': 352.296, 'title': 'Introduction to text analytics', 'summary': 'Introduces the broad and deep domain of text analytics, focusing on the creation of predictive classification models over textual data, and emphasizes on the 80-20 rule in using tools and techniques for projects.', 'duration': 134.774, 'highlights': ['The chapter emphasizes on the broad and deep domain of text analytics, focusing on the creation of predictive classification models over textual data.', 'The chapter highlights the 80-20 rule, emphasizing on the 20% of tools and techniques that can be used in 80% of projects for building predictive classification models over textual data.', 'The introduction focuses on the use of decision tree ensembles, particularly the mighty random forest algorithm, for building predictive models.', 'The chapter mentions the potential use of other models like support vector machines and boosted decision trees, but mainly focuses on the mighty random forest for its simplicity and power.']}], 'duration': 222.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw264611.jpg', 'highlights': ['The chapter emphasizes the creation of predictive classification models over textual data.', 'The tutorial assumes viewers have coding experience with R, including creating functions and using the apply function.', 'The chapter highlights the 80-20 rule, emphasizing on the 20% of tools and techniques that can be used in 80% of projects for building predictive classification models over textual data.', 'The tutorial assumes viewers have some math background, with the tutorial series emphasizing practice over theory.', 'The chapter provides resources for viewers to brush up on their R skills through YouTube assets and tutorials on the channel.', 'The introduction focuses on the use of decision tree ensembles, particularly the mighty random forest algorithm, for building predictive models.', 'Explanation of the mathematics behind TF-IDF is included to elucidate its benefits in text analytics with R.', 'The chapter mentions the potential use of other models like support vector machines and boosted decision trees, but mainly focuses on the mighty random forest for its simplicity and power.']}, {'end': 744.946, 'segs': [{'end': 521.538, 'src': 'embed', 'start': 488.482, 'weight': 0, 'content': [{'end': 491.827, 'text': "As I mentioned earlier, there's going to be a GitHub for this video series.", 'start': 488.482, 'duration': 3.345}, {'end': 494.892, 'text': "And I wanted just to show it to everyone, just to make sure they're familiar with it.", 'start': 492.107, 'duration': 2.785}, {'end': 495.993, 'text': "So here's the URL.", 'start': 494.992, 'duration': 1.001}, {'end': 498.978, 'text': 'This is the same URL that was in the PowerPoint that we just saw.', 'start': 496.133, 'duration': 2.845}, {'end': 503.339, 'text': "Now, as you can see right now, there's nothing in it.", 'start': 500.276, 'duration': 3.063}, {'end': 510.827, 'text': "That is because I haven't uploaded the content yet, but by the time you see this video, this will be flushed out.", 'start': 504.4, 'duration': 6.427}, {'end': 517.534, 'text': 'You will see a nice introductory README, and there will be an R file corresponding to this first video in the series as well.', 'start': 510.887, 'duration': 6.647}, {'end': 521.538, 'text': "So you can get all of the code that you'll see in this video series from this GitHub.", 'start': 517.573, 'duration': 3.965}], 'summary': 'Github for video series will contain introductory readme and r file with all code.', 'duration': 33.056, 'max_score': 488.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw488482.jpg'}, {'end': 627.389, 'src': 'embed', 'start': 547.613, 'weight': 3, 'content': [{'end': 559.421, 'text': "The reason why we're not using the UCIML datasets directly is because the folks that created this version of the dataset on Kaggle did some data munging for us and created a nice CSV file.", 'start': 547.613, 'duration': 11.808}, {'end': 562.423, 'text': 'So we want to use that because it makes our lives easier.', 'start': 560.541, 'duration': 1.882}, {'end': 571.202, 'text': 'Now, in particular, this is a classic, classic scenario for text analytics with classification models.', 'start': 563.536, 'duration': 7.666}, {'end': 574.144, 'text': 'In particular, this is spam.', 'start': 571.442, 'duration': 2.702}, {'end': 585.773, 'text': "We're going to have our data set be composed of SMS messages, texts of various kinds, and we have a label to say whether they're ham,", 'start': 574.945, 'duration': 10.828}, {'end': 589.256, 'text': "they're legit or spam, they're bad.", 'start': 585.773, 'duration': 3.483}, {'end': 592.118, 'text': 'Okay, ham versus spam.', 'start': 589.876, 'duration': 2.242}, {'end': 601.348, 'text': 'now i will warn you that this is this is raw text data of actual people sending texts back and forth.', 'start': 592.118, 'duration': 9.23}, {'end': 606.193, 'text': 'so there is some stuff in here that is not g-rated.', 'start': 601.348, 'duration': 4.845}, {'end': 612.758, 'text': 'there are, there are, There is profanity, there is adult material in here.', 'start': 606.193, 'duration': 6.565}, {'end': 619.803, 'text': "it's nothing too untoward, but just be aware that this is not a rated G data, set by any stretch of the imagination.", 'start': 612.758, 'duration': 7.045}, {'end': 627.389, 'text': "So just wanted to let you know that, because we are working with real-world textual data, it's the best kind to get your feet wet and unfortunately,", 'start': 619.904, 'duration': 7.485}], 'summary': 'Dataset on kaggle munged for text analytics, spam classification with sms messages, includes profanity and adult material.', 'duration': 79.776, 'max_score': 547.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw547613.jpg'}, {'end': 754.012, 'src': 'embed', 'start': 723.095, 'weight': 5, 'content': [{'end': 727.157, 'text': "And in particular, we're going to be using ggplot2 for some visualizations.", 'start': 723.095, 'duration': 4.062}, {'end': 729.377, 'text': "We'll be using e1071 via the caret package.", 'start': 728.177, 'duration': 1.2}, {'end': 735.08, 'text': 'Carat depends upon E1071, so we need to have it.', 'start': 732.538, 'duration': 2.542}, {'end': 737.541, 'text': "We'll be using the Quantita package.", 'start': 735.7, 'duration': 1.841}, {'end': 740.803, 'text': 'This will be our main package for actually doing text analytics.', 'start': 737.581, 'duration': 3.222}, {'end': 741.764, 'text': "It's very, very cool.", 'start': 740.883, 'duration': 0.881}, {'end': 744.946, 'text': 'It has a lot of really good functionality for doing text analytics.', 'start': 741.784, 'duration': 3.162}, {'end': 754.012, 'text': "We'll be using IRLBA for doing some singular value decomposition, some SVD, some feature extraction, which is really cool.", 'start': 745.726, 'duration': 8.286}], 'summary': 'Using ggplot2, e1071, caret, quantita, and irlba for text analytics and visualizations.', 'duration': 30.917, 'max_score': 723.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw723095.jpg'}], 'start': 488.482, 'title': 'Github and text analytics', 'summary': 'Covers the introduction of a github repository for a video series with upcoming uploads, and an introduction to a real-world sms dataset sourced from kaggle for text analytics using r packages ggplot2, caret, and quantita.', 'chapters': [{'end': 521.538, 'start': 488.482, 'title': 'Github for video series', 'summary': 'Introduces a github repository for the video series, providing a url and mentioning the upcoming upload of an introductory readme and an r file corresponding to the first video.', 'duration': 33.056, 'highlights': ['The GitHub repository for the video series will include an introductory README and an R file corresponding to the first video.', 'The provided URL will allow access to all the code demonstrated in this video series.', 'The repository is currently empty but will be populated with content before the video is released.']}, {'end': 744.946, 'start': 523.501, 'title': 'Introduction to text analytics with spam classification', 'summary': "Introduces the dataset for the series, sourced from kaggle, based on sms messages labeled as 'ham' or 'spam', and emphasizes the real-world nature of the data, including profanity and adult material, with a focus on using r packages ggplot2, caret, and quantita for text analytics.", 'duration': 221.445, 'highlights': ["The dataset for the series is sourced from Kaggle and contains SMS messages labeled as 'ham' or 'spam'. The curated dataset from Kaggle, originally from UCIML, includes SMS messages labeled as 'ham' or 'spam', making it suitable for text analytics with classification models.", 'The dataset contains real-world textual data, including profanity and adult material. The dataset contains real-world textual data with profanity and adult material, emphasizing the need to be aware of its content.', 'The R packages ggplot2, caret, and Quantita will be used for visualizations and text analytics. The tutorial series will use R packages ggplot2, caret, and Quantita for visualizations and text analytics, with a possibility of adding more packages later.']}], 'duration': 256.464, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw488482.jpg', 'highlights': ['The provided URL will allow access to all the code demonstrated in this video series.', 'The GitHub repository for the video series will include an introductory README and an R file corresponding to the first video.', 'The repository is currently empty but will be populated with content before the video is released.', "The dataset for the series is sourced from Kaggle and contains SMS messages labeled as 'ham' or 'spam'.", "The curated dataset from Kaggle, originally from UCIML, includes SMS messages labeled as 'ham' or 'spam', making it suitable for text analytics with classification models.", 'The R packages ggplot2, caret, and Quantita will be used for visualizations and text analytics.', 'The dataset contains real-world textual data, including profanity and adult material, emphasizing the need to be aware of its content.']}, {'end': 1140.737, 'segs': [{'end': 784.678, 'src': 'embed', 'start': 745.726, 'weight': 0, 'content': [{'end': 754.012, 'text': "We'll be using IRLBA for doing some singular value decomposition, some SVD, some feature extraction, which is really cool.", 'start': 745.726, 'duration': 8.286}, {'end': 756.954, 'text': "And then lastly, we'll be using the Mighty Random Forest.", 'start': 754.532, 'duration': 2.422}, {'end': 758.975, 'text': "It's one of my favorite algorithms.", 'start': 757.794, 'duration': 1.181}, {'end': 763.019, 'text': "It has a lot of power, it has a lot of capability, it's very easy to tune.", 'start': 759.756, 'duration': 3.263}, {'end': 769.745, 'text': "You can get really, really far with Random Forest if you do good feature engineering, and that's where SVD will come in, as we'll talk about later.", 'start': 763.739, 'duration': 6.006}, {'end': 783.417, 'text': 'Okay, so if I run this line of code You can see here that R will reach out across the internet start grabbing these binaries for these packages and start downloading them,', 'start': 769.765, 'duration': 13.652}, {'end': 784.678, 'text': 'installing them on my machine here.', 'start': 783.417, 'duration': 1.261}], 'summary': 'Using irlba for svd feature extraction and the mighty random forest for powerful, easy-to-tune algorithms.', 'duration': 38.952, 'max_score': 745.726, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw745726.jpg'}, {'end': 853.411, 'src': 'heatmap', 'start': 811.161, 'weight': 3, 'content': [{'end': 814.621, 'text': "But let me just quickly demonstrate that for some folks if they're not used to doing this.", 'start': 811.161, 'duration': 3.46}, {'end': 824.503, 'text': 'So if you go to session, set working directory, choose directory, this is how you tell R where the location of your data files are.', 'start': 814.801, 'duration': 9.702}, {'end': 831.805, 'text': 'This tells R, the R studio environment, hey, read files from here and write files to here if I so choose.', 'start': 824.623, 'duration': 7.182}, {'end': 840.29, 'text': "So if I do that and you'll get a similar experience on Mac, you'll get a finder window and you can just navigate, using this,", 'start': 832.825, 'duration': 7.465}, {'end': 846.274, 'text': "to the location on disk where you've loaded the spam file, the spam CSV file.", 'start': 840.29, 'duration': 5.984}, {'end': 853.411, 'text': "Okay And as always, let's go ahead and take a look at the read.csv function in the help file.", 'start': 847.135, 'duration': 6.276}], 'summary': 'Demonstrating how to set working directory and read files in r environment.', 'duration': 42.25, 'max_score': 811.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw811161.jpg'}, {'end': 929.221, 'src': 'heatmap', 'start': 858.434, 'weight': 0.703, 'content': [{'end': 869.362, 'text': 'But not surprisingly, read.csv says, okay, R, go out to my file system, read in the CSV file, convert it into a data frame, and store it in memory.', 'start': 858.434, 'duration': 10.928}, {'end': 879.012, 'text': "And, in particular, one of the things that we're actually going to tell R to do is hey, by the way, R, when you read in this CSV file,", 'start': 870.629, 'duration': 8.383}, {'end': 883.593, 'text': 'when you read in string data, when you read in textual data, do not make it a factor.', 'start': 879.012, 'duration': 4.581}, {'end': 891.895, 'text': 'By default, R makes string data into categorical variables, factor variables.', 'start': 885.293, 'duration': 6.602}, {'end': 893.936, 'text': "And we don't want that, because we're doing text analytics.", 'start': 892.055, 'duration': 1.881}, {'end': 895.036, 'text': 'We want the raw text.', 'start': 894.016, 'duration': 1.02}, {'end': 900.238, 'text': "So we tell R, hey, don't make the strings into categorical variables, into factors.", 'start': 895.637, 'duration': 4.601}, {'end': 908.315, 'text': 'So if we run these two lines of code, we get a nice spreadsheet view in R.', 'start': 902.253, 'duration': 6.062}, {'end': 909.975, 'text': "That's what this view function right here does.", 'start': 908.315, 'duration': 1.66}, {'end': 914.316, 'text': 'Capital V view brings up this nice spreadsheet view in R.', 'start': 910.355, 'duration': 3.961}, {'end': 915.696, 'text': 'And we can take a look at the raw data.', 'start': 914.316, 'duration': 1.38}, {'end': 918.237, 'text': 'And a couple things jump out at us right away.', 'start': 916.437, 'duration': 1.8}, {'end': 929.221, 'text': "First up, These column names aren't particularly useful and it looks like we've got some funkiness going on here in these three columns,", 'start': 918.257, 'duration': 10.964}], 'summary': 'Using r to read csv data, avoiding string conversion to factors for text analytics.', 'duration': 70.787, 'max_score': 858.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw858434.jpg'}, {'end': 1038.55, 'src': 'embed', 'start': 1009.333, 'weight': 4, 'content': [{'end': 1011.475, 'text': "Okay, and lastly, let's go ahead and run the view code again.", 'start': 1009.333, 'duration': 2.142}, {'end': 1019.841, 'text': "And you can see here we now have a nice, well-ordered, well-structured data frame where I've got my label column ham and spam,", 'start': 1012.876, 'duration': 6.965}, {'end': 1022.844, 'text': 'and I got my text column cool.', 'start': 1019.841, 'duration': 3.003}, {'end': 1032.207, 'text': 'Okay, so next up, we can see here that we have 5, 572 observations.', 'start': 1025.103, 'duration': 7.104}, {'end': 1038.55, 'text': "We've got 5572, that's the number of rows that we have in our data frame, in our data table.", 'start': 1032.347, 'duration': 6.203}], 'summary': 'Data frame contains 5572 rows with ham and spam labels.', 'duration': 29.217, 'max_score': 1009.333, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1009333.jpg'}], 'start': 745.726, 'title': 'Svd, feature extraction, and data loading in r', 'summary': 'Discusses the use of irlba for singular value decomposition, feature extraction, and the application of random forest, emphasizing its power and ease of tuning. additionally, it demonstrates data loading in r, involving setting the working directory, loading a csv file, cleaning the data by filtering columns and renaming variables, with 5572 observations and no missing data.', 'chapters': [{'end': 784.678, 'start': 745.726, 'title': 'Svd, feature extraction, and random forest', 'summary': 'Discusses the use of irlba for singular value decomposition, feature extraction, and the application of the random forest algorithm, emphasizing its power and ease of tuning, with the potential to achieve significant results through good feature engineering.', 'duration': 38.952, 'highlights': ["The Mighty Random Forest algorithm is highlighted for its power, capability, and ease of tuning, making it one of the presenter's favorite algorithms.", 'Feature engineering is emphasized as a crucial factor for achieving significant results with the Random Forest algorithm.', 'IRLBA is mentioned as a tool for performing singular value decomposition and feature extraction, highlighting its relevance to the overall process.']}, {'end': 1140.737, 'start': 785.578, 'title': 'Loading and cleaning data in r', 'summary': 'Demonstrates how to set the working directory, load a csv file, read it into a data frame, clean the data by filtering columns and renaming variables, and check for missing values, with 5572 observations and no missing data.', 'duration': 355.159, 'highlights': ['The chapter demonstrates how to set the working directory, load a CSV file, read it into a data frame, clean the data by filtering columns and renaming variables, and check for missing values. Setting working directory, loading CSV file, reading into data frame, cleaning data, checking for missing values', 'The data frame has complete data in it, with 5572 observations and no missing data. Complete data in data frame, no missing data']}], 'duration': 395.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw745726.jpg', 'highlights': ['Random Forest algorithm is highlighted for its power, capability, and ease of tuning.', 'Feature engineering is emphasized as a crucial factor for achieving significant results with the Random Forest algorithm.', 'IRLBA is mentioned as a tool for performing singular value decomposition and feature extraction, highlighting its relevance to the overall process.', 'The chapter demonstrates setting the working directory, loading a CSV file, reading it into a data frame, cleaning the data by filtering columns and renaming variables, and checking for missing values.', 'The data frame has complete data with 5572 observations and no missing data.']}, {'end': 1404.781, 'segs': [{'end': 1189.084, 'src': 'embed', 'start': 1160.436, 'weight': 0, 'content': [{'end': 1175.861, 'text': "and especially when you're starting to do anything like classification of text documents whether that's sentiment analysis or spam you name it is that you may have to understand what the class distribution of your labels are.", 'start': 1160.436, 'duration': 15.425}, {'end': 1186.503, 'text': "Because, for example, in a ham versus spam scenario, it's not uncommon for spam to be actually relatively rare, thank goodness,", 'start': 1177.301, 'duration': 9.202}, {'end': 1189.084, 'text': 'compared to legitimate textual documents.', 'start': 1186.503, 'duration': 2.581}], 'summary': 'Understanding class distribution is crucial for text classification, e.g., in ham vs. spam where spam may be relatively rare.', 'duration': 28.648, 'max_score': 1160.436, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1160436.jpg'}, {'end': 1404.781, 'src': 'embed', 'start': 1322.193, 'weight': 2, 'content': [{'end': 1322.953, 'text': '4, 825 hams, 747 spams.', 'start': 1322.193, 'duration': 0.76}, {'end': 1333.25, 'text': 'What prop.table was is actually to convert that into the percentages.', 'start': 1329.926, 'duration': 3.324}, {'end': 1334.932, 'text': 'So our data set is 86.6% ham.', 'start': 1333.53, 'duration': 1.402}, {'end': 1335.572, 'text': "It's 86.6% legit.", 'start': 1334.952, 'duration': 0.62}, {'end': 1345.203, 'text': "And it's unfortunately 13.4%.", 'start': 1335.612, 'duration': 9.591}, {'end': 1351.97, 'text': "spam. now this this is well well, non-trivial class imbalance, it's not.", 'start': 1345.203, 'duration': 6.767}, {'end': 1353.492, 'text': "it's not bad right.", 'start': 1351.97, 'duration': 1.522}, {'end': 1362.802, 'text': "it's not like one percent spam and 99 ham, but it may be, it may be problematic as we move on, when we start building our machine learning models,", 'start': 1353.492, 'duration': 9.31}, {'end': 1368.806, 'text': 'we may want to do, we may want to try out some techniques to account for this imbalance between the two.', 'start': 1362.802, 'duration': 6.004}, {'end': 1379.409, 'text': '13% is not terrible, but we may actually get better performance out of our model if we do some magic to kind of account for this disparity.', 'start': 1368.826, 'duration': 10.583}, {'end': 1381.729, 'text': "So that's just something to keep in the back of your mind.", 'start': 1380.269, 'duration': 1.46}, {'end': 1387.311, 'text': 'But generally speaking, this is a good practice, especially in text analytics and classification in text analytics.', 'start': 1382.19, 'duration': 5.121}, {'end': 1390.552, 'text': 'Take a look at your distributions of your labels.', 'start': 1387.831, 'duration': 2.721}, {'end': 1392.852, 'text': "It's extremely important, extremely important.", 'start': 1390.752, 'duration': 2.1}, {'end': 1404.781, 'text': "Okay, Next up, let's just get a general feel for some more about our data, right?", 'start': 1395.013, 'duration': 9.768}], 'summary': 'Data set is 86.6% ham, 13.4% spam, may need techniques to account for imbalance in machine learning models.', 'duration': 82.588, 'max_score': 1322.193, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1322193.jpg'}], 'start': 1141.218, 'title': 'Data exploration in text analytics and sms classification analysis', 'summary': "Emphasizes the importance of exploring data in text analytics, particularly in understanding class distribution of labels such as spam in text classification, where spam is relatively rare compared to legitimate documents. it also discusses the binary classification problem of sms texts as either 'ham' or 'spam', with 86.6% labeled as 'ham' and 13.4% as 'spam, emphasizing the potential class imbalance and the need to account for it in machine learning models.", 'chapters': [{'end': 1211.112, 'start': 1141.218, 'title': 'Data exploration for text analytics', 'summary': 'Emphasizes the importance of exploring data in text analytics, particularly in understanding the class distribution of labels such as spam in text classification, as spam is relatively rare compared to legitimate documents, and in sentiment analysis, negative tweets about a brand are also relatively rare.', 'duration': 69.894, 'highlights': ['Understanding class distribution of labels is crucial in text analytics, as in scenarios like spam classification, spam is relatively rare compared to legitimate textual documents, and in sentiment analysis, negative tweets about a brand are also relatively rare.', 'Exploring data is always the first step in text analytics, especially in text classification and sentiment analysis, to understand the distribution of labels and ensure the rarity of spam or negative sentiment in the data.']}, {'end': 1404.781, 'start': 1211.172, 'title': 'Sms classification analysis', 'summary': "Introduces a binary classification problem of sms texts as either 'ham' or 'spam', with 86.6% labeled as 'ham' and 13.4% as 'spam', highlighting the potential class imbalance and the need to account for it in machine learning models.", 'duration': 193.609, 'highlights': ["86.6% of the dataset is labeled as 'ham' and 13.4% as 'spam', indicating a potential class imbalance. The dataset consists of 86.6% 'ham' and 13.4% 'spam', highlighting the class distribution.", 'The importance of accounting for the class imbalance in machine learning models to improve performance is emphasized. The text stresses the need to address the class imbalance in machine learning models to potentially enhance model performance.', 'Early consideration of label distributions is crucial in text analytics and classification, particularly in addressing potential class imbalances. The chapter emphasizes the significance of early evaluation of label distributions in text analytics and classification, especially in managing class imbalances.']}], 'duration': 263.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1141218.jpg', 'highlights': ['Understanding class distribution of labels is crucial in text analytics, as in scenarios like spam classification, spam is relatively rare compared to legitimate textual documents, and in sentiment analysis, negative tweets about a brand are also relatively rare.', 'Exploring data is always the first step in text analytics, especially in text classification and sentiment analysis, to understand the distribution of labels and ensure the rarity of spam or negative sentiment in the data.', "86.6% of the dataset is labeled as 'ham' and 13.4% as 'spam', indicating a potential class imbalance.", 'The importance of accounting for the class imbalance in machine learning models to improve performance is emphasized.', 'Early consideration of label distributions is crucial in text analytics and classification, particularly in addressing potential class imbalances.']}, {'end': 1824.448, 'segs': [{'end': 1602.979, 'src': 'embed', 'start': 1545.432, 'weight': 0, 'content': [{'end': 1552.738, 'text': 'the value at which half the texts are shorter than this, and the value at which half the texts are longer than this is 61 characters.', 'start': 1545.432, 'duration': 7.306}, {'end': 1555.52, 'text': "So that's a pretty big disparity here.", 'start': 1553.658, 'duration': 1.862}, {'end': 1556.941, 'text': 'Pretty big disparity.', 'start': 1556.18, 'duration': 0.761}, {'end': 1560.384, 'text': "And you'll also notice that the mean is what 19 characters?", 'start': 1557.562, 'duration': 2.822}, {'end': 1568.565, 'text': "fully 19 characters higher than the median, which means that there's probably a certain amount of skew here,", 'start': 1562.318, 'duration': 6.247}, {'end': 1570.767, 'text': 'which is indicative right here of the third quartile.', 'start': 1568.565, 'duration': 2.202}, {'end': 1576.593, 'text': 'So 75% of the text messages are 121 characters or less, and only 25% are 122 or more.', 'start': 1571.448, 'duration': 5.145}, {'end': 1588.045, 'text': 'So there may be some skew here to account for these two differences.', 'start': 1584.201, 'duration': 3.844}, {'end': 1593.752, 'text': "So let's take a look at the distribution using a histogram using the mighty ggplot2.", 'start': 1588.786, 'duration': 4.966}, {'end': 1594.993, 'text': "That's what we'll do next.", 'start': 1593.792, 'duration': 1.201}, {'end': 1597.917, 'text': "So we're going to go ahead and load up ggplot2.", 'start': 1595.294, 'duration': 2.623}, {'end': 1602.979, 'text': 'Oh, and look at that.', 'start': 1602.199, 'duration': 0.78}], 'summary': 'Median text length is 61 characters, with 75% of messages being 121 characters or less.', 'duration': 57.547, 'max_score': 1545.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1545432.jpg'}, {'end': 1692.243, 'src': 'heatmap', 'start': 1631.166, 'weight': 2, 'content': [{'end': 1643.578, 'text': "So that'll give me some sort of graphical understanding of the relative proportion of HAM versus SPAM in the context of text lengths the lengths of the texts.", 'start': 1631.166, 'duration': 12.412}, {'end': 1646.06, 'text': 'And this is super useful.', 'start': 1645.019, 'duration': 1.041}, {'end': 1650.945, 'text': "The reason why we had to make label a factor earlier was because, if it wasn't, this,", 'start': 1646.14, 'duration': 4.805}, {'end': 1654.588, 'text': "fill wouldn't work and we wouldn't get the awesomeness that we're about ready to see.", 'start': 1650.945, 'duration': 3.643}, {'end': 1658.112, 'text': "And you'll notice here I've arbitrarily picked a bin width of five.", 'start': 1655.429, 'duration': 2.683}, {'end': 1661.795, 'text': 'And I did that mainly because, just eyeballing this data,', 'start': 1658.132, 'duration': 3.663}, {'end': 1669.138, 'text': '10 seemed like it was probably a little too big of a bandwidth and 1 was probably way too small.', 'start': 1664.197, 'duration': 4.941}, {'end': 1669.878, 'text': 'so I just used 5..', 'start': 1669.138, 'duration': 0.74}, {'end': 1674.479, 'text': "Okay, so let's go ahead and run this line of code.", 'start': 1669.878, 'duration': 4.601}, {'end': 1689.822, 'text': 'And here we go, we have this nice handy-dandy plot, where, as you can see here, the turquoise color is the spam, and the orange color is the ham.', 'start': 1678.36, 'duration': 11.462}, {'end': 1692.243, 'text': "And notice, it's super interesting, right?", 'start': 1690.963, 'duration': 1.28}], 'summary': 'Graphical representation of ham vs spam text proportions illustrated with a bin width of five, providing valuable insight.', 'duration': 23.422, 'max_score': 1631.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1631166.jpg'}, {'end': 1741.452, 'src': 'embed', 'start': 1700.134, 'weight': 3, 'content': [{'end': 1709.301, 'text': 'by eyeballing the data, we had an intuition that, in general, spam would tend to be probably longer texts,', 'start': 1700.134, 'duration': 9.167}, {'end': 1711.923, 'text': 'and shorter texts would probably tend to be ham.', 'start': 1709.301, 'duration': 2.622}, {'end': 1716.967, 'text': 'And we see that actually quite clearly, quite clearly.', 'start': 1712.583, 'duration': 4.384}, {'end': 1718.847, 'text': 'Quite clearly.', 'start': 1718.347, 'duration': 0.5}, {'end': 1723.928, 'text': 'Now, as we move on later in the series, we start talking about feature engineering.', 'start': 1719.267, 'duration': 4.661}, {'end': 1734.35, 'text': "we may actually end up creating a few features that aren't actually based on the raw text itself of what was actually typed into the SMS text message,", 'start': 1723.928, 'duration': 10.422}, {'end': 1737.951, 'text': 'but we will derive some features potentially that may be useful, like this one.', 'start': 1734.35, 'duration': 3.601}, {'end': 1741.452, 'text': 'This one may actually prove to be quite useful, actually from a classification perspective, right?', 'start': 1737.971, 'duration': 3.481}], 'summary': 'Intuition suggests spam is longer, ham is shorter. feature engineering may create useful features for classification.', 'duration': 41.318, 'max_score': 1700.134, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1700134.jpg'}, {'end': 1789.187, 'src': 'embed', 'start': 1763.654, 'weight': 4, 'content': [{'end': 1771.183, 'text': "the extremely long lengths of text, though, are actually the vast majority of them are in fact ham, they're not spam.", 'start': 1763.654, 'duration': 7.529}, {'end': 1773.125, 'text': "So there's this kind of sweet spot right?", 'start': 1771.203, 'duration': 1.922}, {'end': 1779.032, 'text': "So maybe this working in conjunction with some textual based features that we'll engineer later on,", 'start': 1773.766, 'duration': 5.266}, {'end': 1782.096, 'text': 'may actually be pretty powerful in determining ham from spam.', 'start': 1779.032, 'duration': 3.064}, {'end': 1789.187, 'text': "Okay, so we're gonna try and keep these videos around a half hour or so.", 'start': 1783.843, 'duration': 5.344}], 'summary': 'Text lengths are mostly ham, not spam. text-based features may be powerful in determining ham from spam.', 'duration': 25.533, 'max_score': 1763.654, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1763654.jpg'}], 'start': 1404.821, 'title': 'Text length and sms spam detection analysis', 'summary': 'Focuses on analyzing the length of text messages, revealing a significant disparity between the mean and median text length, and explores the analysis of sms data to identify patterns indicating spam messages, potentially leading to effective feature engineering for classification.', 'chapters': [{'end': 1674.479, 'start': 1404.821, 'title': 'Text length analysis', 'summary': 'Focuses on analyzing the length of text messages in a dataset, revealing a significant disparity between the mean and median text length and a potential skew in the distribution, with 75% of messages being 121 characters or less and the rest being 122 or more, and aims to visualize this distribution using a histogram.', 'duration': 269.658, 'highlights': ['The median text length is 61 characters, with a mean of 19 characters higher than the median, indicating potential skew in the distribution. The median value of text length is 61 characters, with the mean being 19 characters higher, suggesting potential skew in the distribution.', '75% of the text messages are 121 characters or less, while 25% are 122 or more, indicating a potential skew in the distribution. 75% of the text messages have a length of 121 characters or less, while 25% are 122 or more, suggesting a potential skew in the distribution.', 'The chapter aims to visualize the distribution of text lengths using a histogram to understand the relative proportion of legitimate (HAM) and spam messages. The chapter aims to use a histogram to visualize the distribution of text lengths and understand the relative proportion of legitimate (HAM) and spam messages.']}, {'end': 1824.448, 'start': 1678.36, 'title': 'Sms spam detection analysis', 'summary': 'Explores the analysis of sms data to identify patterns indicating spam messages, revealing a clear distinction in text length between spam and ham messages, potentially leading to effective feature engineering for classification.', 'duration': 146.088, 'highlights': ['The analysis shows a clear distinction in text length between spam and ham messages, with longer texts tending to be spam and shorter texts tending to be ham, providing valuable insight for feature engineering and potential classification effectiveness.', "The majority of extremely long texts are found to be ham rather than spam, indicating a potential 'sweet spot' in text length for distinguishing between ham and spam messages.", 'Feature engineering may involve creating new features not solely based on raw text, which could prove useful for classification purposes.', 'The video concludes with a call-to-action for viewers to engage with the Data Science Dojo community and upcoming data science boot camps.']}], 'duration': 419.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/4vuw0AsHeGw/pics/4vuw0AsHeGw1404821.jpg', 'highlights': ['The median text length is 61 characters, with a mean of 19 characters higher than the median, indicating potential skew in the distribution.', '75% of the text messages are 121 characters or less, while 25% are 122 or more, indicating a potential skew in the distribution.', 'The chapter aims to visualize the distribution of text lengths using a histogram to understand the relative proportion of legitimate (HAM) and spam messages.', 'The analysis shows a clear distinction in text length between spam and ham messages, with longer texts tending to be spam and shorter texts tending to be ham, providing valuable insight for feature engineering and potential classification effectiveness.', "The majority of extremely long texts are found to be ham rather than spam, indicating a potential 'sweet spot' in text length for distinguishing between ham and spam messages.", 'Feature engineering may involve creating new features not solely based on raw text, which could prove useful for classification purposes.']}], 'highlights': ["Dave Langer has over 20 years of experience in technology and data science, including leading Microsoft's $10 billion plus supply chain operation.", 'Langer has extensive background in business intelligence, data warehousing, and analytics, covering traditional and predictive analytics.', "Data Science Dojo's mission is to make data science accessible to everyone, without the need for a PhD in statistics or machine learning, as demonstrated in the video series.", 'The tutorial provides a GitHub repository for accessing all the demonstrated code throughout the series.', 'The chapter emphasizes the creation of predictive classification models over textual data.', 'The tutorial assumes viewers have coding experience with R, including creating functions and using the apply function.', 'The chapter highlights the 80-20 rule, emphasizing on the 20% of tools and techniques that can be used in 80% of projects for building predictive classification models over textual data.', 'The tutorial assumes viewers have some math background, with the tutorial series emphasizing practice over theory.', 'The introduction focuses on the use of decision tree ensembles, particularly the mighty random forest algorithm, for building predictive models.', 'Explanation of the mathematics behind TF-IDF is included to elucidate its benefits in text analytics with R.', 'The provided URL will allow access to all the code demonstrated in this video series.', 'The GitHub repository for the video series will include an introductory README and an R file corresponding to the first video.', 'The repository is currently empty but will be populated with content before the video is released.', "The dataset for the series is sourced from Kaggle and contains SMS messages labeled as 'ham' or 'spam'.", "The curated dataset from Kaggle, originally from UCIML, includes SMS messages labeled as 'ham' or 'spam', making it suitable for text analytics with classification models.", 'The R packages ggplot2, caret, and Quantita will be used for visualizations and text analytics.', 'Random Forest algorithm is highlighted for its power, capability, and ease of tuning.', 'Feature engineering is emphasized as a crucial factor for achieving significant results with the Random Forest algorithm.', 'IRLBA is mentioned as a tool for performing singular value decomposition and feature extraction, highlighting its relevance to the overall process.', 'Understanding class distribution of labels is crucial in text analytics, as in scenarios like spam classification, spam is relatively rare compared to legitimate textual documents, and in sentiment analysis, negative tweets about a brand are also relatively rare.', 'Exploring data is always the first step in text analytics, especially in text classification and sentiment analysis, to understand the distribution of labels and ensure the rarity of spam or negative sentiment in the data.', "86.6% of the dataset is labeled as 'ham' and 13.4% as 'spam', indicating a potential class imbalance.", 'The importance of accounting for the class imbalance in machine learning models to improve performance is emphasized.', 'Early consideration of label distributions is crucial in text analytics and classification, particularly in addressing potential class imbalances.', 'The median text length is 61 characters, with a mean of 19 characters higher than the median, indicating potential skew in the distribution.', '75% of the text messages are 121 characters or less, while 25% are 122 or more, indicating a potential skew in the distribution.', 'The chapter aims to visualize the distribution of text lengths using a histogram to understand the relative proportion of legitimate (HAM) and spam messages.', 'The analysis shows a clear distinction in text length between spam and ham messages, with longer texts tending to be spam and shorter texts tending to be ham, providing valuable insight for feature engineering and potential classification effectiveness.', "The majority of extremely long texts are found to be ham rather than spam, indicating a potential 'sweet spot' in text length for distinguishing between ham and spam messages.", 'Feature engineering may involve creating new features not solely based on raw text, which could prove useful for classification purposes.']}