title
Data Cleaning in Pandas | Python Pandas Tutorials
description
Take my Full Python Course Here: https://www.analystbuilder.com/courses/pandas-for-data-analysis
In this series we will be walking through everything you need to know to get started in Pandas! In this video, we learn about Data Cleaning in Pandas.
Datasets in GitHub:
https://github.com/AlexTheAnalyst/PandasYouTubeSeries/blob/main/Customer%20Call%20List.xlsx
Code in GitHub: https://github.com/AlexTheAnalyst/PandasYouTubeSeries/blob/main/Pandas%20101%20-%20Data%20Cleaning%20in%20Pandas.ipynb
Favorite Pandas Course:
Data Analysis with Pandas and Python - https://bit.ly/3KHMLlu
____________________________________________
SUBSCRIBE!
Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________
RESOURCES:
Coursera Courses:
📖Google Data Analyst Certification: https://coursera.pxf.io/5bBd62
📖Data Analysis with Python - https://coursera.pxf.io/BXY3Wy
📖IBM Data Analysis Specialization - https://coursera.pxf.io/AoYOdR
📖Tableau Data Visualization - https://coursera.pxf.io/MXYqaN
Udemy Courses:
📖Python for Data Analysis and Visualization- https://bit.ly/3hhX4LX
📖Statistics for Data Science - https://bit.ly/37jqDbq
📖SQL for Data Analysts (SSMS) - https://bit.ly/3fkqEij
📖Tableau A-Z - http://bit.ly/385lYvN
*Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!*
____________________________________________
BECOME A MEMBER -
Want to support the channel? Consider becoming a member! I do Monthly Livestreams and you get some awesome Emoji's to use in chat and comments!
https://www.youtube.com/channel/UC7cs8q-gJRlGwj4A8OmCmXg/join
____________________________________________
Websites:
💻Website: AlexTheAnalyst.com
💾GitHub: https://github.com/AlexTheAnalyst
📱Instagram: @Alex_The_Analyst
____________________________________________
0:00 Intro
0:41 First Look at Data
2:34 Removing Duplicates
3:41 Dropping Columns
5:10 Strip
12:15 Cleaning/Standardizing Phone Numbers
21:29 Splitting Columns
24:58 Standardizing Column Values using Replace
28:40 Fill Null Values
29:42 Filtering Down Rows of Data
36:42 Outro
*All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*
detail
{'title': 'Data Cleaning in Pandas | Python Pandas Tutorials', 'heatmap': [{'end': 905.545, 'start': 877.782, 'weight': 0.708}, {'end': 1785.929, 'start': 1761.023, 'weight': 1}], 'summary': "Learn data cleaning using pandas, including import of excel files, cleaning diverse phone number formats, removing duplicates, special characters, and null values, standardizing phone numbers, and filtering out 'do not contact' entries, resulting in a visually better and standardized dataset for quicker customer calls and improved data accuracy.", 'chapters': [{'end': 38.608, 'segs': [{'end': 52.673, 'src': 'embed', 'start': 19.468, 'weight': 0, 'content': [{'end': 20.97, 'text': "And now we're going to import our file.", 'start': 19.468, 'duration': 1.502}, {'end': 23.312, 'text': "So we're going to say data frame is equal to PD.", 'start': 21.45, 'duration': 1.862}, {'end': 26.775, 'text': "So that's pandas dot read underscore.", 'start': 24.033, 'duration': 2.742}, {'end': 28.537, 'text': 'And we actually have this in an Excel file.', 'start': 26.876, 'duration': 1.661}, {'end': 36.846, 'text': "So we'll say read, oops, say read Excel, do an open parentheses and we'll do R and then we'll paste the path right here.", 'start': 28.657, 'duration': 8.189}, {'end': 38.608, 'text': "And now we're just going to call that variable.", 'start': 37.166, 'duration': 1.442}, {'end': 41.35, 'text': "So we'll call data frame and we'll actually read it in and look at the data.", 'start': 38.648, 'duration': 2.702}, {'end': 46.912, 'text': "So let's scroll down here and let's take a look at this data frame of this Excel file that we're reading in.", 'start': 42.091, 'duration': 4.821}, {'end': 52.673, 'text': 'So right off the bat, we have this customer ID that goes from 1001 all the way down to 1020.', 'start': 46.932, 'duration': 5.741}], 'summary': 'Using pandas to read excel file, extracting customer id data from 1001 to 1020.', 'duration': 33.205, 'max_score': 19.468, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw19468.jpg'}], 'start': 0.109, 'title': 'Cleaning data using pandas', 'summary': 'Introduces data cleaning using pandas, showcasing the import of an excel file and the application of pandas functions for data cleaning.', 'chapters': [{'end': 38.608, 'start': 0.109, 'title': 'Cleaning data using pandas', 'summary': 'Introduces data cleaning using pandas, demonstrating the import of an excel file and the utilization of pandas functions for data cleaning.', 'duration': 38.499, 'highlights': ['The chapter demonstrates importing an Excel file using pandas read_excel function and assigning it to a variable, showcasing practical application of data cleaning techniques within pandas.', 'The presenter highlights the importance of learning data cleaning techniques within pandas, emphasizing their significance in the process of cleaning datasets for analysis.']}], 'duration': 38.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw109.jpg', 'highlights': ['The chapter demonstrates importing an Excel file using pandas read_excel function and assigning it to a variable, showcasing practical application of data cleaning techniques within pandas.', 'The presenter highlights the importance of learning data cleaning techniques within pandas, emphasizing their significance in the process of cleaning datasets for analysis.']}, {'end': 310.796, 'segs': [{'end': 67.096, 'src': 'embed', 'start': 38.648, 'weight': 1, 'content': [{'end': 41.35, 'text': "So we'll call data frame and we'll actually read it in and look at the data.", 'start': 38.648, 'duration': 2.702}, {'end': 46.912, 'text': "So let's scroll down here and let's take a look at this data frame of this Excel file that we're reading in.", 'start': 42.091, 'duration': 4.821}, {'end': 52.673, 'text': 'So right off the bat, we have this customer ID that goes from 1001 all the way down to 1020.', 'start': 46.932, 'duration': 5.741}, {'end': 59.775, 'text': 'We have this first name and everything looks pretty good here, except in this last name column.', 'start': 52.673, 'duration': 7.102}, {'end': 61.875, 'text': 'Looks like we have some errors.', 'start': 60.635, 'duration': 1.24}, {'end': 67.096, 'text': 'We have some forward slashes, some dots, some null values.', 'start': 61.915, 'duration': 5.181}], 'summary': 'Data frame shows customer ids from 1001 to 1020 with errors in last name column.', 'duration': 28.448, 'max_score': 38.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw38648.jpg'}, {'end': 130.888, 'src': 'embed', 'start': 102.984, 'weight': 2, 'content': [{'end': 108.149, 'text': "We have a paying customer, which is yes and no's and some of those are not the same.", 'start': 102.984, 'duration': 5.165}, {'end': 109.17, 'text': 'So I have to standardize that.', 'start': 108.189, 'duration': 0.981}, {'end': 113.133, 'text': 'We have a do not contact and the same thing as the paying customer.', 'start': 109.89, 'duration': 3.243}, {'end': 116.897, 'text': "And we have this not useful column, which we'll probably just want to get rid of.", 'start': 113.594, 'duration': 3.303}, {'end': 126.546, 'text': 'Okay, So the scenario is is that we got handed this list of names and we need to clean it up and hand it off to the people who are actually going to make these calls,', 'start': 117.277, 'duration': 9.269}, {'end': 127.587, 'text': 'to this customer list.', 'start': 126.546, 'duration': 1.041}, {'end': 130.888, 'text': 'So they want all the data in here standardized and cleaned,', 'start': 127.887, 'duration': 3.001}], 'summary': 'Standardize customer data for calling list with yes/no and do not contact fields.', 'duration': 27.904, 'max_score': 102.984, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw102984.jpg'}, {'end': 226.617, 'src': 'embed', 'start': 200.057, 'weight': 0, 'content': [{'end': 203.899, 'text': "So now it's going to save that to the data frame variable again.", 'start': 200.057, 'duration': 3.842}, {'end': 207.902, 'text': 'And now when we run this, our data frame now does not have any duplicates.', 'start': 203.919, 'duration': 3.983}, {'end': 211.144, 'text': "That's definitely one of the easier steps that we're going to look at.", 'start': 208.543, 'duration': 2.601}, {'end': 215.828, 'text': "Things are going to get quite a bit more complicated as we go, but I'm starting out, you know, kind of simple,", 'start': 211.164, 'duration': 4.664}, {'end': 217.689, 'text': 'so that we can kind of get a feel for it.', 'start': 215.828, 'duration': 1.861}, {'end': 220.231, 'text': "And then we'll start getting into the really tough stuff.", 'start': 218.169, 'duration': 2.062}, {'end': 224.535, 'text': "So the next thing that I want to do is remove any columns that we don't need.", 'start': 220.891, 'duration': 3.644}, {'end': 226.617, 'text': "I don't want to clean data that we're not going to use.", 'start': 224.555, 'duration': 2.062}], 'summary': 'Data frame has no duplicates. next step: remove unnecessary columns.', 'duration': 26.56, 'max_score': 200.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw200057.jpg'}], 'start': 38.648, 'title': 'Data frame examination, data cleaning for customer list, and data cleaning with pandas', 'summary': "Examines a data frame, finding customer id ranging from 1001 to 1020, issues in the last name column, diverse phone number formats, and the need to standardize and clean the data. it also discusses the need to standardize and clean a customer list, removing non-useful columns and 'do not contact' entries, all to facilitate quicker customer calls. additionally, it covers the process of removing duplicates and unwanted columns from a dataset using pandas, resulting in the elimination of 1 duplicate row and 1 unwanted column.", 'chapters': [{'end': 102.384, 'start': 38.648, 'title': 'Data frame examination', 'summary': 'Examines a data frame from an excel file, finding customer id ranging from 1001 to 1020, issues in the last name column, diverse phone number formats and the need to standardize and clean the data, along with the presence of varying address formats and the necessity to split them out.', 'duration': 63.736, 'highlights': ['The customer ID ranges from 1001 to 1020, indicating 20 unique customer records.', 'The last name column contains errors such as forward slashes, dots, and null values that need to be cleaned up.', 'The phone number column contains various formats and NAs, necessitating standardization for uniformity.', 'The address column exhibits varying formats, including street address, additional location, and zip codes, requiring splitting for consistency.']}, {'end': 153.395, 'start': 102.984, 'title': 'Data cleaning for customer list', 'summary': "Discusses the need to standardize and clean a customer list, including removing non-useful columns and standardizing 'do not contact' entries, all to facilitate quicker customer calls.", 'duration': 50.411, 'highlights': ["The need to standardize and clean a customer list to facilitate quicker customer calls, with a focus on removing non-useful columns and standardizing 'do not contact' entries.", "Identifying 'do not contact' entries that say 'yes' to remove them, to ensure efficient customer contact.", "The presence of a 'not useful' column that needs to be removed from the dataset."]}, {'end': 310.796, 'start': 153.835, 'title': 'Data cleaning with pandas', 'summary': 'Covers the process of removing duplicates and unwanted columns from a dataset using pandas, resulting in the elimination of 1 duplicate row and 1 unwanted column, contributing to simplifying the data cleaning process.', 'duration': 156.961, 'highlights': ["The process of removing duplicates from the dataset using 'df.drop_duplicates' results in the elimination of 1 duplicate row, enhancing the dataset's cleanliness and accuracy. 1 duplicate row eliminated", "The elimination of the unwanted column using 'df.drop' simplifies the dataset by removing unnecessary data, streamlining the data cleaning process. 1 unwanted column eliminated"]}], 'duration': 272.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw38648.jpg', 'highlights': ['The customer ID ranges from 1001 to 1020, indicating 20 unique customer records.', "The process of removing duplicates from the dataset using 'df.drop_duplicates' results in the elimination of 1 duplicate row, enhancing the dataset's cleanliness and accuracy.", "The need to standardize and clean a customer list to facilitate quicker customer calls, with a focus on removing non-useful columns and standardizing 'do not contact' entries."]}, {'end': 710.44, 'segs': [{'end': 341.885, 'src': 'embed', 'start': 310.816, 'weight': 2, 'content': [{'end': 314.118, 'text': 'This customer ID to me looks perfectly fine.', 'start': 310.816, 'duration': 3.302}, {'end': 315.318, 'text': "I'm not going to mess with it at all.", 'start': 314.218, 'duration': 1.1}, {'end': 320.26, 'text': 'The first name at a glance also looks perfectly fine.', 'start': 315.999, 'duration': 4.261}, {'end': 323.082, 'text': "I don't see anything wrong with it visually, which is a good thing.", 'start': 320.3, 'duration': 2.782}, {'end': 329.965, 'text': "Although sometimes that can be deceiving and it can cause errors down the line, but we're not going to assume that there are errors in here.", 'start': 324.002, 'duration': 5.963}, {'end': 331.967, 'text': "Now let's look at this last name.", 'start': 330.865, 'duration': 1.102}, {'end': 337.457, 'text': "Now the last name, obviously, I'm seeing some obvious things, things that we talked about when we were first looking at this dataset.", 'start': 332.168, 'duration': 5.289}, {'end': 341.885, 'text': 'We have this forward slash, which we definitely need to get rid of.', 'start': 338.178, 'duration': 3.707}], 'summary': 'Customer id and first name are fine, last name needs correction due to forward slash.', 'duration': 31.069, 'max_score': 310.816, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw310816.jpg'}, {'end': 545.164, 'src': 'embed', 'start': 518.368, 'weight': 1, 'content': [{'end': 525.414, 'text': "Now, is it possible to do something like this, where we put these values inside of a list? let's try it.", 'start': 518.368, 'duration': 7.046}, {'end': 528.796, 'text': "So we'll say just like this, one, two, three, let's run it.", 'start': 525.855, 'duration': 2.941}, {'end': 530.817, 'text': "And no, it doesn't.", 'start': 529.737, 'duration': 1.08}, {'end': 535.099, 'text': 'This Lstrip actually sits within the realm of regular expression.', 'start': 531.598, 'duration': 3.501}, {'end': 540.742, 'text': "So if you've ever worked with regular expression, you know it gets very complicated, very complex.", 'start': 535.58, 'duration': 5.162}, {'end': 545.164, 'text': "So you wanna keep it kind of simple, especially with these values where we're just taking a few out.", 'start': 541.202, 'duration': 3.962}], 'summary': 'Exploring the use of lstrip with a list of values, but encountering complexities in regular expression.', 'duration': 26.796, 'max_score': 518.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw518368.jpg'}, {'end': 615.348, 'src': 'embed', 'start': 587.007, 'weight': 0, 'content': [{'end': 589.53, 'text': "So we're gonna do the same thing, but for those other values.", 'start': 587.007, 'duration': 2.523}, {'end': 591.66, 'text': "And we'll do it just like this.", 'start': 590.779, 'duration': 0.881}, {'end': 595.824, 'text': "We'll do a forward slash and it's a left strip.", 'start': 591.88, 'duration': 3.944}, {'end': 600.049, 'text': "And then we'll do, I'll do the left strip on this underscore to just to show you that it won't work.", 'start': 596.225, 'duration': 3.824}, {'end': 604.233, 'text': 'And then we will go on from there.', 'start': 601.17, 'duration': 3.063}, {'end': 607.537, 'text': "So it's not pulling it because we're looking at the left-hand side only.", 'start': 604.514, 'duration': 3.023}, {'end': 609.559, 'text': 'We need to use our strip.', 'start': 608.057, 'duration': 1.502}, {'end': 611.461, 'text': "So now let's use our strip.", 'start': 610.18, 'duration': 1.281}, {'end': 615.348, 'text': 'And now that looks perfect.', 'start': 613.468, 'duration': 1.88}], 'summary': 'Demonstrating left strip function with examples, emphasizing the need to use strip for accurate results.', 'duration': 28.341, 'max_score': 587.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw587007.jpg'}], 'start': 310.816, 'title': 'Data cleaning techniques', 'summary': "Discusses techniques for cleaning data, such as removing special characters like forward slash, null values, periods, and underscores from the 'last name' column to ensure accuracy in customer records and improve call quality. it also covers the usage of strip and replace functions to remove specified characters from a column, demonstrating the removal of dots, underscores, and forward slashes from the 'last name' column of a data frame.", 'chapters': [{'end': 382.82, 'start': 310.816, 'title': 'Data cleaning: removing special characters', 'summary': 'Discusses cleaning up the last name column in a dataset by removing special characters such as forward slash, null values, periods, and underscores, to ensure accuracy in customer records and improve call quality.', 'duration': 72.004, 'highlights': ['The last name column contains special characters such as forward slashes, null values, periods, and underscores, which need to be cleaned up to ensure accurate customer records and improve call quality.', "The process of cleaning the last name column involves using the 'Strip' function to remove the specified special characters and ensure that the changes only apply to the targeted column for improved data accuracy."]}, {'end': 710.44, 'start': 382.82, 'title': 'Cleaning data with strip and replace', 'summary': "Covers the usage of strip and replace functions to clean data by removing specified characters from a column, demonstrating the removal of dots, underscores, and forward slashes from the 'last name' column of a data frame.", 'duration': 327.62, 'highlights': ["The strip function is used to remove specified characters from both sides of a string, demonstrated by removing dots, underscores, and forward slashes from the 'last name' column. The strip function is demonstrated to remove specified characters from both sides of a string, such as dots, underscores, and forward slashes from the 'last name' column.", 'The application of strip and replace functions is restricted by the requirement to combine all specified characters into one value, as opposed to using a list of values. The limitation of the strip and replace functions is explained, highlighting the need to combine all specified characters into one value instead of using a list of values.', "The usage of the strip and replace functions is showcased to clean up the 'last name' column by removing dots, underscores, and forward slashes, resulting in a cleaned data frame. The strip and replace functions are used to clean up the 'last name' column by removing dots, underscores, and forward slashes, resulting in a cleaned data frame."]}], 'duration': 399.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw310816.jpg', 'highlights': ['The last name column contains special characters such as forward slashes, null values, periods, and underscores, which need to be cleaned up to ensure accurate customer records and improve call quality.', "The process of cleaning the last name column involves using the 'Strip' function to remove the specified special characters and ensure that the changes only apply to the targeted column for improved data accuracy.", "The strip function is used to remove specified characters from both sides of a string, demonstrated by removing dots, underscores, and forward slashes from the 'last name' column.", 'The application of strip and replace functions is restricted by the requirement to combine all specified characters into one value, as opposed to using a list of values.', "The usage of the strip and replace functions is showcased to clean up the 'last name' column by removing dots, underscores, and forward slashes, resulting in a cleaned data frame."]}, {'end': 1259.094, 'segs': [{'end': 832.591, 'src': 'embed', 'start': 803.539, 'weight': 1, 'content': [{'end': 805.761, 'text': 'but we just want to do it for the entire column.', 'start': 803.539, 'duration': 2.222}, {'end': 809.863, 'text': "so let's go right up here and we're going to try replace for the first time.", 'start': 805.761, 'duration': 4.102}, {'end': 811.705, 'text': "so let's do phone number.", 'start': 809.863, 'duration': 1.842}, {'end': 814.086, 'text': 'it just loops.', 'start': 811.705, 'duration': 2.381}, {'end': 814.827, 'text': "that's not what i wanted.", 'start': 814.086, 'duration': 0.741}, {'end': 817.988, 'text': "So we're gonna do a bracket.", 'start': 815.987, 'duration': 2.001}, {'end': 823.829, 'text': "let's say phone number string replace, just like we did before.", 'start': 817.988, 'duration': 5.841}, {'end': 829.19, 'text': "Now we're gonna use some regular expression in here and I'll kinda do a really high overview,", 'start': 823.849, 'duration': 5.341}, {'end': 832.591, 'text': "although I'm not gonna dive super deep into the regular expression.", 'start': 829.19, 'duration': 3.401}], 'summary': 'Using regular expressions to replace phone numbers in a column.', 'duration': 29.052, 'max_score': 803.539, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw803539.jpg'}, {'end': 905.545, 'src': 'heatmap', 'start': 877.782, 'weight': 0.708, 'content': [{'end': 881.786, 'text': "and then we're gonna replace them by saying comma, and we're gonna replace them with nothing.", 'start': 877.782, 'duration': 4.004}, {'end': 883.928, 'text': 'So this is just an empty string.', 'start': 881.906, 'duration': 2.022}, {'end': 890.274, 'text': "So literally we're taking everything that is not an A, B, C, a one, two, three, so a letter or a number.", 'start': 884.328, 'duration': 5.946}, {'end': 894.038, 'text': "we're replacing all of that and then we're replacing it with nothing.", 'start': 890.274, 'duration': 3.764}, {'end': 895.92, 'text': "So let's run this and see what it looks like.", 'start': 894.418, 'duration': 1.502}, {'end': 899.402, 'text': 'And it looks like, that worked properly.', 'start': 896.56, 'duration': 2.842}, {'end': 905.545, 'text': "Now, we do have this NA, because we had an N-A for, I don't remember, maybe that was Creed Bratton.", 'start': 899.462, 'duration': 6.083}], 'summary': 'Replacing non-alphanumeric characters with empty string successfully.', 'duration': 27.763, 'max_score': 877.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw877782.jpg'}, {'end': 1097.276, 'src': 'embed', 'start': 1073.236, 'weight': 2, 'content': [{'end': 1080.081, 'text': 'So we need to change this entire column into strings before we can apply this, um, formatting.', 'start': 1073.236, 'duration': 6.845}, {'end': 1085.282, 'text': "Now, when I was creating this, if I'm being honest, my first thought when I was doing this was to do it like this.", 'start': 1080.6, 'duration': 4.682}, {'end': 1088.064, 'text': 'String DF phone number.', 'start': 1085.823, 'duration': 2.241}, {'end': 1089.485, 'text': "Let's just run that.", 'start': 1088.824, 'duration': 0.661}, {'end': 1092.293, 'text': 'This is what the values look like.', 'start': 1090.872, 'duration': 1.421}, {'end': 1097.276, 'text': "And I don't remember why or why it was doing this.", 'start': 1093.053, 'duration': 4.223}], 'summary': 'Converting entire column into strings, with a mention of formatting and running a string df phone number.', 'duration': 24.04, 'max_score': 1073.236, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1073236.jpg'}, {'end': 1181.032, 'src': 'embed', 'start': 1147.557, 'weight': 0, 'content': [{'end': 1150.639, 'text': "So we're just going to copy this, apply it.", 'start': 1147.557, 'duration': 3.082}, {'end': 1158.904, 'text': "Good And now we're going to take this and we're going to run this again.", 'start': 1153.08, 'duration': 5.824}, {'end': 1161.446, 'text': 'Just ignore all my commented out stuff.', 'start': 1159.285, 'duration': 2.161}, {'end': 1162.587, 'text': "Pretend I don't have that.", 'start': 1161.466, 'duration': 1.121}, {'end': 1165.569, 'text': 'So now when we run this, it should work.', 'start': 1163.688, 'duration': 1.881}, {'end': 1166.75, 'text': 'There we go.', 'start': 1166.33, 'duration': 0.42}, {'end': 1181.032, 'text': "Now, if we look at these numbers 123-545-5421, And it does that for every single one where there's values, even when there's NAN or NA,", 'start': 1167.691, 'duration': 13.341}], 'summary': 'Applying a process to the data, resulting in successful execution and consistent formatting of numbers.', 'duration': 33.475, 'max_score': 1147.557, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1147557.jpg'}], 'start': 710.6, 'title': 'Cleaning up and formatting phone numbers', 'summary': 'Covers techniques for cleaning up phone numbers in a data frame using regular expressions, standardizing the format, handling null values, and formatting phone numbers using a lambda function to rearrange digits and add dashes, with a focus on converting float values to strings and replacing na values in a data frame. the techniques aim to automate the process for a large dataset of 20,000 phone numbers.', 'chapters': [{'end': 935.684, 'start': 710.6, 'title': 'Cleaning up phone numbers in data frame', 'summary': 'Focuses on cleaning up phone numbers in a data frame, using regular expressions to remove non-numeric characters, standardizing the format, and handling null values, in order to automate the process for a large dataset of 20,000 phone numbers.', 'duration': 225.084, 'highlights': ['The chapter demonstrates the challenge of working with phone numbers and the need to automate cleaning for a large dataset of 20,000 phone numbers.', 'The process involves using regular expressions to remove non-numeric characters from the phone numbers and standardizing the format for the entire column.', 'Handling null values by converting them to blank, and ensuring the phone numbers are properly standardized and cleaned up for the data frame.']}, {'end': 1259.094, 'start': 936.365, 'title': 'Formatting phone numbers with lambda', 'summary': 'Explains how to format phone numbers using a lambda function to rearrange the digits and add dashes, with a focus on converting float values to strings and replacing na values in a data frame.', 'duration': 322.729, 'highlights': ['The chapter demonstrates using a lambda function to format phone numbers by rearranging the digits and adding dashes for every three digits, resulting in a transformed data frame.', "It explains the need to convert float values to strings before applying the formatting, as demonstrated by showing the error 'Float object is not subscriptable' when attempting to index float values as strings.", "The transcript highlights the process of replacing 'NA' values in the data frame with empty strings using the 'replace' method, ensuring the final output is free from unwanted characters and inconsistencies."]}], 'duration': 548.494, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw710600.jpg', 'highlights': ['Automating cleaning process for 20,000 phone numbers using regular expressions and standardizing format.', 'Using lambda function to rearrange digits and add dashes for every three digits in phone numbers.', "Converting float values to strings and replacing 'NA' values with empty strings in the data frame."]}, {'end': 1838.737, 'segs': [{'end': 1345.2, 'src': 'embed', 'start': 1318.387, 'weight': 0, 'content': [{'end': 1322.833, 'text': 'They can read it easier, see what the zip code is, where they live, you know, whatever they want it for.', 'start': 1318.387, 'duration': 4.446}, {'end': 1324.314, 'text': "Let's just say we want to do that.", 'start': 1323.273, 'duration': 1.041}, {'end': 1329.037, 'text': 'And this is, you know, again, For this use case, it may not make sense, but you have to do this.', 'start': 1324.334, 'duration': 4.703}, {'end': 1330.158, 'text': 'I do this all the time.', 'start': 1329.117, 'duration': 1.041}, {'end': 1332.538, 'text': 'You need to split those columns.', 'start': 1330.998, 'duration': 1.54}, {'end': 1335.999, 'text': 'Now, luckily, all of these things are separated by a comma.', 'start': 1333.018, 'duration': 2.981}, {'end': 1337.479, 'text': 'So we can specify that.', 'start': 1336.419, 'duration': 1.06}, {'end': 1339.139, 'text': "We're going to split on this column.", 'start': 1337.499, 'duration': 1.64}, {'end': 1345.2, 'text': "And then we'll be able to create three separate columns based off of this one column, which is exactly what we want.", 'start': 1339.539, 'duration': 5.661}], 'summary': 'Using data manipulation, split one column into three based on comma separation.', 'duration': 26.813, 'max_score': 1318.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1318387.jpg'}, {'end': 1674.317, 'src': 'embed', 'start': 1642.224, 'weight': 2, 'content': [{'end': 1646.325, 'text': "But I'm going to leave that because I'm just going to apply it to the entire thing all at once,", 'start': 1642.224, 'duration': 4.101}, {'end': 1649.587, 'text': 'to get rid of those at the end instead of just going column by column.', 'start': 1646.325, 'duration': 3.262}, {'end': 1652.328, 'text': "And then it's literally going to be the exact same thing.", 'start': 1650.127, 'duration': 2.201}, {'end': 1654.288, 'text': "So I'm not even going to scroll down.", 'start': 1652.368, 'duration': 1.92}, {'end': 1659.21, 'text': "Whoops I'm just going to put it right up here because this is the exact same thing.", 'start': 1654.369, 'duration': 4.841}, {'end': 1660.651, 'text': "I'm going to save us all some time.", 'start': 1659.23, 'duration': 1.421}, {'end': 1668.055, 'text': "And when we run this, this looks exactly like what we're looking for.", 'start': 1664.474, 'duration': 3.581}, {'end': 1674.317, 'text': 'Again, some not a number of values, but we can get rid of that in just a second by doing a place over the entire data frame.', 'start': 1668.115, 'duration': 6.202}], 'summary': 'Applying a single action to the entire dataset to save time and achieve the desired result.', 'duration': 32.093, 'max_score': 1642.224, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1642224.jpg'}, {'end': 1785.929, 'src': 'heatmap', 'start': 1761.023, 'weight': 1, 'content': [{'end': 1766.227, 'text': "And when we run that, every value that doesn't have something in it is going to show up blank.", 'start': 1761.023, 'duration': 5.204}, {'end': 1772.211, 'text': "Even over here where we only had a few, all of them throughout the data frame, if it doesn't have a value, it is now blank.", 'start': 1766.507, 'duration': 5.704}, {'end': 1774.753, 'text': "So let's apply that.", 'start': 1772.811, 'duration': 1.942}, {'end': 1777.295, 'text': "And we'll run this.", 'start': 1776.494, 'duration': 0.801}, {'end': 1785.929, 'text': "and now all of our cleaning we're actually cleaning up the individual columns is completely done.", 'start': 1778.783, 'duration': 7.146}], 'summary': 'Data cleaning process completed, all blank values removed.', 'duration': 24.906, 'max_score': 1761.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1761023.jpg'}], 'start': 1259.955, 'title': 'Data cleaning and standardization', 'summary': 'Demonstrates cleaning and standardizing data, including phone number cleanup, address splitting, and column naming. it also involves standardizing and formatting data, replacing values, and removing empty entries.', 'chapters': [{'end': 1507.94, 'start': 1259.955, 'title': 'Data frame cleaning and standardization', 'summary': 'Demonstrates the process of cleaning and standardizing data in a data frame, including cleaning up phone numbers, splitting addresses into separate columns, and naming the columns accordingly. it also shows the process of creating different columns without replacing the original data. the chapter involves standardizing phone numbers and splitting addresses into separate columns based on a comma separator.', 'duration': 247.985, 'highlights': ['The process of creating different columns without replacing the original data is demonstrated, allowing easy access to the new columns while retaining the original data.', 'The chapter provides a detailed process of splitting addresses into separate columns based on a comma separator, allowing for easier readability and accessibility of the data.', 'The chapter demonstrates the process of standardizing phone numbers, including handling tricky formats like the plus one area code.']}, {'end': 1838.737, 'start': 1507.94, 'title': 'Data cleaning and formatting', 'summary': "Focuses on cleaning and formatting the data, including standardizing 'paying customer' and 'do not contact' columns, replacing values with 'yes' and 'no', and removing empty phone numbers and 'do not contact' entries.", 'duration': 330.797, 'highlights': ["Standardizing 'paying customer' and 'do not contact' columns by replacing 'n' with 'no' and 'y' with 'yes', and removing empty entries, resulting in consistent formatting and cleaning up the individual columns. Standardizing columns: 'paying customer' and 'do not contact'; Replacing 'n' with 'no' and 'y' with 'yes'; Removing empty entries", "Removing entries without phone numbers and 'do not contact' entries to provide a list of phone numbers for the call center, ensuring that only valid contacts are included. Removing entries without phone numbers and 'do not contact' entries", "The chapter also involves splitting columns, formatting and cleaning up phone numbers, and standardizing 'paying customer' and 'do not contact' columns, contributing to comprehensive data cleaning and formatting. Splitting columns; Formatting and cleaning up phone numbers; Standardizing 'paying customer' and 'do not contact' columns"]}], 'duration': 578.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1259955.jpg', 'highlights': ['The chapter provides a detailed process of splitting addresses into separate columns based on a comma separator, allowing for easier readability and accessibility of the data.', 'The process of creating different columns without replacing the original data is demonstrated, allowing easy access to the new columns while retaining the original data.', "Standardizing 'paying customer' and 'do not contact' columns by replacing 'n' with 'no' and 'y' with 'yes', and removing empty entries, resulting in consistent formatting and cleaning up the individual columns."]}, {'end': 2304.144, 'segs': [{'end': 1918.946, 'src': 'embed', 'start': 1886.314, 'weight': 2, 'content': [{'end': 1888.215, 'text': 'How do we look at these indexes? We use lock.', 'start': 1886.314, 'duration': 1.901}, {'end': 1891.156, 'text': "That's going to be df.loc.", 'start': 1888.775, 'duration': 2.381}, {'end': 1895.537, 'text': 'And then we need to look at the value, which is this x right here.', 'start': 1891.176, 'duration': 4.361}, {'end': 1898.697, 'text': "So each time it looks at the index, it's looking at the value.", 'start': 1895.577, 'duration': 3.12}, {'end': 1904.199, 'text': 'But we want to look at the value this column do not contact.', 'start': 1899.197, 'duration': 5.002}, {'end': 1905.46, 'text': "I don't know if I copied this before.", 'start': 1904.199, 'duration': 1.261}, {'end': 1906.72, 'text': 'let me copy it.', 'start': 1905.46, 'duration': 1.26}, {'end': 1909.282, 'text': 'we only want to look at the value in this one column.', 'start': 1906.72, 'duration': 2.562}, {'end': 1912.943, 'text': "if we didn't, it would look at a different value.", 'start': 1909.282, 'duration': 3.661}, {'end': 1914.304, 'text': "so we don't want that.", 'start': 1912.943, 'duration': 1.361}, {'end': 1918.946, 'text': "so we're looking at just that value if it's equal to Y.", 'start': 1914.304, 'duration': 4.642}], 'summary': 'Using df.loc to access and filter values in a specific column based on a condition.', 'duration': 32.632, 'max_score': 1886.314, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1886314.jpg'}, {'end': 1969.778, 'src': 'embed', 'start': 1939.368, 'weight': 4, 'content': [{'end': 1941.13, 'text': "true here, otherwise it won't take effect.", 'start': 1939.368, 'duration': 1.762}, {'end': 1945.104, 'text': 'Um, otherwise you have to say like DF is equal to DF.', 'start': 1942.043, 'duration': 3.061}, {'end': 1947.464, 'text': "Yeah And I don't, I don't want to start messing with that.", 'start': 1945.124, 'duration': 2.34}, {'end': 1949.545, 'text': "Let's just do in place equals true.", 'start': 1947.504, 'duration': 2.041}, {'end': 1953.326, 'text': "Um, and let's see if that works.", 'start': 1951.005, 'duration': 2.321}, {'end': 1955.646, 'text': "I, I can't remember if this is going to work or not.", 'start': 1953.626, 'duration': 2.02}, {'end': 1957.887, 'text': 'Invalid syntax.', 'start': 1956.907, 'duration': 0.98}, {'end': 1959.227, 'text': 'Okay Neo colon.', 'start': 1957.967, 'duration': 1.26}, {'end': 1961.728, 'text': "And now let's try to run this.", 'start': 1960.668, 'duration': 1.06}, {'end': 1964.576, 'text': 'Okay Okay.', 'start': 1963.436, 'duration': 1.14}, {'end': 1969.778, 'text': 'Yeah If we look at our index, we can already tell that there are ones missing the one, the one is missing.', 'start': 1964.596, 'duration': 5.182}], 'summary': 'Testing in-place equals true resulted in an error, with one missing from the index.', 'duration': 30.41, 'max_score': 1939.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1939368.jpg'}, {'end': 2053.259, 'src': 'embed', 'start': 1984.963, 'weight': 0, 'content': [{'end': 1989.205, 'text': 'Um, let me just go up here really quick.', 'start': 1984.963, 'duration': 4.242}, {'end': 1993.11, 'text': "I'll copy this.", 'start': 1992.29, 'duration': 0.82}, {'end': 1997.132, 'text': "We probably should populate that and I didn't plan on doing this.", 'start': 1993.831, 'duration': 3.301}, {'end': 2010.419, 'text': "So, um, if it's blank, oops, it's blank given an N and we want to attribute it to do not contact do not contact.", 'start': 1997.192, 'duration': 13.227}, {'end': 2015.321, 'text': "Whoops Let's see if that works.", 'start': 2010.439, 'duration': 4.882}, {'end': 2018.883, 'text': 'And we probably need to do dot string.', 'start': 2017.062, 'duration': 1.821}, {'end': 2020.844, 'text': "Let's just see if it works.", 'start': 2020.003, 'duration': 0.841}, {'end': 2026.351, 'text': "So if it's blank, dude, okay.", 'start': 2022.867, 'duration': 3.484}, {'end': 2030.115, 'text': "I don't know why it's giving us a triple N.", 'start': 2026.431, 'duration': 3.684}, {'end': 2032.358, 'text': "Maybe there's, maybe I need to strip this or something.", 'start': 2030.115, 'duration': 2.243}, {'end': 2036.342, 'text': 'Okay Nevermind.', 'start': 2034.5, 'duration': 1.842}, {'end': 2037.999, 'text': "Let's not do that.", 'start': 2037.338, 'duration': 0.661}, {'end': 2046.43, 'text': "But now we basically need to do the exact same thing for this phone number, because if it's blank, we don't want them calling it.", 'start': 2038.46, 'duration': 7.97}, {'end': 2050.355, 'text': 'So we can copy this entire thing right down here.', 'start': 2047.231, 'duration': 3.124}, {'end': 2053.259, 'text': "And but now we're looking at phone number.", 'start': 2050.375, 'duration': 2.884}], 'summary': "Troubleshooting code to handle blank fields for 'do not contact' and phone numbers.", 'duration': 68.296, 'max_score': 1984.963, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1984963.jpg'}, {'end': 2126.707, 'src': 'embed', 'start': 2099.618, 'weight': 3, 'content': [{'end': 2102.559, 'text': 'Now we could, and we probably should.', 'start': 2099.618, 'duration': 2.941}, {'end': 2108.322, 'text': 'before we hand this off to the client or the customer call list, we probably should reset this index,', 'start': 2102.559, 'duration': 5.763}, {'end': 2115.564, 'text': "because they might be confused as why there's numbers missing, or they might use this index to show how many people they've called,", 'start': 2108.322, 'duration': 7.242}, {'end': 2116.825, 'text': "or I don't know something like that.", 'start': 2115.564, 'duration': 1.261}, {'end': 2118.288, 'text': "So let's go right down here.", 'start': 2117.225, 'duration': 1.063}, {'end': 2120.753, 'text': "We're going to say df.", 'start': 2118.308, 'duration': 2.445}, {'end': 2124.322, 'text': "And then we'll do reset underscore index.", 'start': 2121.415, 'duration': 2.907}, {'end': 2126.707, 'text': "And let's just see what this looks like.", 'start': 2125.584, 'duration': 1.123}], 'summary': 'Recommend resetting index before handing off to client or customer, to avoid confusion or misinterpretation of missing numbers and usage for call tracking.', 'duration': 27.089, 'max_score': 2099.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw2099618.jpg'}], 'start': 1839.037, 'title': 'Data filtering and cleaning in pandas', 'summary': "Covers the process of looping through a dataframe to filter out entries with a value of 'y' in the 'do not contact' column and cleaning data in pandas, resulting in a more visually better and standardized dataset important for visualizations and data analysis.", 'chapters': [{'end': 1964.576, 'start': 1839.037, 'title': 'Data filtering and removal', 'summary': "Discusses the process of looping through a column in a dataframe and dropping rows based on a specific condition, aiming to filter out entries with a value of 'y' in the 'do not contact' column.", 'duration': 125.539, 'highlights': ["The process involves looping through the dataframe column and dropping rows based on a specific condition. The speaker outlines the approach of looping through the dataframe column and dropping rows based on a specific condition, aiming to remove entries with a value of 'Y' in the 'do not contact' column.", "The speaker plans to use the 'df.loc' method to identify and drop rows with a value of 'Y' in the specified column. The speaker plans to utilize the 'df.loc' method to pinpoint and eliminate rows with a value of 'Y' in the specified column, ensuring the removal of unwanted entries from the dataframe."]}, {'end': 2304.144, 'start': 1964.596, 'title': 'Data cleaning in pandas', 'summary': 'Covers the process of cleaning and standardizing data in pandas, including removing missing values and standardizing formats, resulting in a more visually better and standardized dataset that is crucial for visualizations and data analysis.', 'duration': 339.548, 'highlights': ["The dataset had missing values for 'one', 'three', and '18', which were removed, resulting in a cleaner dataset.", "An attempt to populate blank values with 'do not contact' for 'wise' was made, but encountered issues with handling the values.", "The process of removing blank values in the 'phone number' column resulted in a much smaller dataset, indicating successful cleaning and standardization.", 'Resetting the index was suggested before handing off the dataset to the client or customer call list to avoid confusion or misinterpretation of missing numbers.', 'Alternative methods for dropping null values in the dataset were discussed, providing additional insights into data cleaning techniques.']}], 'duration': 465.107, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bDhvCp3_lYw/pics/bDhvCp3_lYw1839037.jpg', 'highlights': ['The process involves looping through the dataframe column and dropping rows based on a specific condition.', "The speaker plans to utilize the 'df.loc' method to pinpoint and eliminate rows with a value of 'Y' in the specified column.", "The process of removing blank values in the 'phone number' column resulted in a much smaller dataset, indicating successful cleaning and standardization.", "The dataset had missing values for 'one', 'three', and '18', which were removed, resulting in a cleaner dataset.", 'Resetting the index was suggested before handing off the dataset to the client or customer call list to avoid confusion or misinterpretation of missing numbers.']}], 'highlights': ['Automating cleaning process for 20,000 phone numbers using regular expressions and standardizing format.', "The process of removing blank values in the 'phone number' column resulted in a much smaller dataset, indicating successful cleaning and standardization.", "The process of removing duplicates from the dataset using 'df.drop_duplicates' results in the elimination of 1 duplicate row, enhancing the dataset's cleanliness and accuracy.", 'The process of creating different columns without replacing the original data is demonstrated, allowing easy access to the new columns while retaining the original data.', 'The process involves looping through the dataframe column and dropping rows based on a specific condition.', 'The last name column contains special characters such as forward slashes, null values, periods, and underscores, which need to be cleaned up to ensure accurate customer records and improve call quality.', 'The chapter provides a detailed process of splitting addresses into separate columns based on a comma separator, allowing for easier readability and accessibility of the data.', 'The chapter demonstrates importing an Excel file using pandas read_excel function and assigning it to a variable, showcasing practical application of data cleaning techniques within pandas.', 'The application of strip and replace functions is restricted by the requirement to combine all specified characters into one value, as opposed to using a list of values.', "Standardizing 'paying customer' and 'do not contact' columns by replacing 'n' with 'no' and 'y' with 'yes', and removing empty entries, resulting in consistent formatting and cleaning up the individual columns."]}