title
Python Pandas Tutorial 19 | How to Identify and Drop Duplicate Values | Removing duplicate values

description
In this video I have talked about how you can identify and drop duplicate values in python. In pandas library you have two very straight forward functions duplicated() and drop_duplicates() to perform these operations and in this video I have shown you how you can apply these functions along with their various parameters like keep, inplace and subset.

detail
{'title': 'Python Pandas Tutorial 19 | How to Identify and Drop Duplicate Values | Removing duplicate values', 'heatmap': [], 'summary': 'The tutorial covers using pandas in python to identify and remove duplicate values in a superstore sales dataset with 8399 rows and 21 columns, focusing on the process of importing data, detecting and managing duplicate rows, and analyzing data set with 795 duplicate entries in the customername column.', 'chapters': [{'end': 120.272, 'segs': [{'end': 46.317, 'src': 'embed', 'start': 23.963, 'weight': 0, 'content': [{'end': 33.111, 'text': 'which is a sample superstore sales data containing sales of a superstore and then, after reading it from the excel file,', 'start': 23.963, 'duration': 9.148}, {'end': 40.898, 'text': 'using the function pd.readexcel and storing that sheet into the orders, I am viewing the top two rows.', 'start': 33.111, 'duration': 7.787}, {'end': 43.814, 'text': "So let's go ahead and execute that.", 'start': 41.812, 'duration': 2.002}, {'end': 46.317, 'text': 'All right, here are my top.', 'start': 45.076, 'duration': 1.241}], 'summary': 'Analyzing sample superstore sales data using pd.readexcel function to view top two rows.', 'duration': 22.354, 'max_score': 23.963, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U23963.jpg'}, {'end': 120.272, 'src': 'embed', 'start': 73.937, 'weight': 1, 'content': [{'end': 78.458, 'text': "And similarly, let's say product category, customer segment and all of those things.", 'start': 73.937, 'duration': 4.521}, {'end': 83.039, 'text': "But sometimes these are necessary and you really don't want to remove that,", 'start': 78.958, 'duration': 4.081}, {'end': 87.72, 'text': 'because they actually represent the data which should be there in your data set.', 'start': 83.039, 'duration': 4.681}, {'end': 95.862, 'text': "But let's say for example, you don't have a requirement where you need the multiple customer names in the data set.", 'start': 87.84, 'duration': 8.022}, {'end': 99.863, 'text': 'You just want to have a unique one maybe for some research requirement or something.', 'start': 95.902, 'duration': 3.961}, {'end': 106.565, 'text': 'You want to get to know about your data set by looking at the unique customer name.', 'start': 100.483, 'duration': 6.082}, {'end': 112.247, 'text': 'Or maybe for some other reasons you just need the unique values for a particular column.', 'start': 106.805, 'duration': 5.442}, {'end': 120.272, 'text': "So how we can do that I'll show you in this video which is pretty straightforward using the methods given in the pandas library.", 'start': 113.007, 'duration': 7.265}], 'summary': 'Removing duplicate values from a data set using pandas library.', 'duration': 46.335, 'max_score': 73.937, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U73937.jpg'}], 'start': 0.189, 'title': 'Identifying and removing duplicate reports in python', 'summary': 'Discusses using pandas in python to identify and remove duplicate values in a superstore sales dataset, highlighting the process of importing data, identifying duplicates, and obtaining unique values for specific columns.', 'chapters': [{'end': 120.272, 'start': 0.189, 'title': 'Identifying and removing duplicate reports in python with pandas', 'summary': "Discusses how to identify and remove duplicate values in a superstore sales dataset using python's pandas library, showcasing the process of importing data, identifying duplicate values, and employing pandas methods to obtain unique values for specific columns.", 'duration': 120.083, 'highlights': ['The process involves importing the Pandas library, reading a sample superstore sales dataset from an Excel file, and displaying the top two rows to identify the presence of duplicate values. The speaker demonstrates the initial steps of importing the Pandas library, reading the dataset from an Excel file, and displaying the top two rows to identify duplicate values.', 'The speaker explains the necessity of certain duplicate values in a dataset while also outlining scenarios where obtaining unique values for specific columns is required, providing a clear understanding of the motivation behind removing duplicates. The speaker emphasizes the importance of understanding the necessity of duplicate values in a dataset while also highlighting specific scenarios where obtaining unique values for specific columns is essential.', 'The speaker promises to demonstrate a straightforward method for removing duplicate values using Pandas library methods, catering to research requirements or other purposes necessitating unique values for specific columns. The speaker assures viewers of a straightforward method for removing duplicate values using Pandas library methods, addressing research requirements or other purposes necessitating unique values for specific columns.']}], 'duration': 120.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U189.jpg', 'highlights': ['The process involves importing the Pandas library, reading a sample superstore sales dataset from an Excel file, and displaying the top two rows to identify the presence of duplicate values.', 'The speaker explains the necessity of certain duplicate values in a dataset while also outlining scenarios where obtaining unique values for specific columns is required, providing a clear understanding of the motivation behind removing duplicates.', 'The speaker promises to demonstrate a straightforward method for removing duplicate values using Pandas library methods, catering to research requirements or other purposes necessitating unique values for specific columns.']}, {'end': 439.875, 'segs': [{'end': 147.929, 'src': 'embed', 'start': 120.932, 'weight': 3, 'content': [{'end': 128.437, 'text': 'So the first thing is looking at the entire data set which is in the object orders and using the method duplicate it.', 'start': 120.932, 'duration': 7.505}, {'end': 129.858, 'text': 'What it will do?', 'start': 128.497, 'duration': 1.361}, {'end': 141.725, 'text': 'that wherever it will identify the duplicate row, it will mark it as duplicate and show true or false, based on whether it is duplicate or not.', 'start': 129.858, 'duration': 11.867}, {'end': 145.968, 'text': "so I'll just go ahead and execute this.", 'start': 141.725, 'duration': 4.243}, {'end': 147.929, 'text': 'and here, as you can see,', 'start': 145.968, 'duration': 1.961}], 'summary': "Analyzing the 'orders' object to identify and mark duplicate rows as true or false.", 'duration': 26.997, 'max_score': 120.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U120932.jpg'}, {'end': 242.216, 'src': 'embed', 'start': 212.375, 'weight': 2, 'content': [{'end': 222.462, 'text': 'all right, okay, here it is after 29, you have three dots, and so there may be a scenario that you may have a true over here.', 'start': 212.375, 'duration': 10.087}, {'end': 233.114, 'text': "so let's say how, if you want to know about that, first thing is getting the count how many rows are actually duplicated?", 'start': 222.462, 'duration': 10.652}, {'end': 239.936, 'text': 'so for that again, orders.duplicated.sum.', 'start': 233.114, 'duration': 6.822}, {'end': 242.216, 'text': 'sum basically true.', 'start': 239.936, 'duration': 2.28}], 'summary': 'After 29, there are three duplicated rows in the dataset.', 'duration': 29.841, 'max_score': 212.375, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U212375.jpg'}, {'end': 338.835, 'src': 'embed', 'start': 316.054, 'weight': 1, 'content': [{'end': 324.079, 'text': 'what you can do is copy this entire piece, this previous command, and write orders and within that, paste it.', 'start': 316.054, 'duration': 8.025}, {'end': 332.945, 'text': 'so what it is doing is wherever the true is coming, it will filter out the data set by that and since it will be so many rows seven, six, zero,', 'start': 324.079, 'duration': 8.866}, {'end': 335.647, 'text': "four so let's see the first two rows.", 'start': 332.945, 'duration': 2.702}, {'end': 338.835, 'text': 'so head and two, all right.', 'start': 335.647, 'duration': 3.188}], 'summary': "Filter data using 'orders' to retrieve first two rows.", 'duration': 22.781, 'max_score': 316.054, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U316054.jpg'}, {'end': 413.87, 'src': 'embed', 'start': 387.694, 'weight': 0, 'content': [{'end': 394.638, 'text': 'but there is an option if you go down keep first, last and false, and by default is first.', 'start': 387.694, 'duration': 6.944}, {'end': 405.085, 'text': 'so probably, if you want to use the last row in this data set, you can use keep is equals to last.', 'start': 394.638, 'duration': 10.447}, {'end': 413.87, 'text': 'so in this case, Since we are looking at the last row that we want to keep it, you are getting a different set of results.', 'start': 405.085, 'duration': 8.785}], 'summary': "The option 'keep=last' can be used to retain the last row in a dataset, resulting in different set of results.", 'duration': 26.176, 'max_score': 387.694, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U387694.jpg'}], 'start': 120.932, 'title': 'Identifying duplicate rows', 'summary': 'Demonstrates techniques for identifying and managing duplicate rows in a data set using the orders object. it covers detecting duplicates, counting duplicated rows, and filtering data based on duplicate values.', 'chapters': [{'end': 439.875, 'start': 120.932, 'title': 'Detecting and managing duplicate rows in data sets', 'summary': 'Demonstrates how to identify and manage duplicate rows in a data set using the orders object, including detecting duplicates, counting duplicated rows, and filtering data based on duplicate values.', 'duration': 318.943, 'highlights': ["The method 'duplicated()' is used to identify duplicate rows in the 'orders' data set and marks them as true or false, indicating whether they are duplicate or not, with the default result showing no duplicate rows.", "The count of duplicated rows in the 'orders' data set is obtained using 'orders.duplicated().sum()', resulting in no duplicated rows, and the process is explained using the sum of true values as 1 and false values as 0.", "The chapter explains how to filter the data set based on duplicate values by copying the previous command and filtering out the data set based on the 'true' values, providing a practical example of the first two duplicated rows using the 'head(2)' function.", "The concept of 'keep_first', 'keep_last', and 'false' arguments is introduced to manage duplicate rows, allowing users to specify whether to keep the first or last duplicate row or include all duplicate rows in the data set."]}], 'duration': 318.943, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U120932.jpg', 'highlights': ["The concept of 'keep_first', 'keep_last', and 'false' arguments is introduced to manage duplicate rows, allowing users to specify whether to keep the first or last duplicate row or include all duplicate rows in the data set.", "The chapter explains how to filter the data set based on duplicate values by copying the previous command and filtering out the data set based on the 'true' values, providing a practical example of the first two duplicated rows using the 'head(2)' function.", "The count of duplicated rows in the 'orders' data set is obtained using 'orders.duplicated().sum()', resulting in no duplicated rows, and the process is explained using the sum of true values as 1 and false values as 0.", "The method 'duplicated()' is used to identify duplicate rows in the 'orders' data set and marks them as true or false, indicating whether they are duplicate or not, with the default result showing no duplicate rows."]}, {'end': 650.708, 'segs': [{'end': 592.631, 'src': 'embed', 'start': 538.275, 'weight': 0, 'content': [{'end': 546.18, 'text': "so maybe let's say drop underscore, duplicate subset and i want to indicate that look at just couple of columns.", 'start': 538.275, 'duration': 7.905}, {'end': 556.889, 'text': "so maybe I don't want to specify it over here, I want to specify it over here that look at just customer name or maybe some other column name,", 'start': 546.18, 'duration': 10.709}, {'end': 558.591, 'text': 'maybe ship mode.', 'start': 556.889, 'duration': 1.702}, {'end': 565.157, 'text': 'so this will subset and look at only these two column names when dropping the duplicates.', 'start': 558.591, 'duration': 6.566}, {'end': 575.624, 'text': 'all right, and if I show you the other parameter, shift tab tab, The next parameter is keep is equals to first.', 'start': 565.157, 'duration': 10.467}, {'end': 578.886, 'text': 'so as I indicated it earlier, that by default it is first.', 'start': 575.624, 'duration': 3.262}, {'end': 583.107, 'text': 'but if you want, you can use the last.', 'start': 578.886, 'duration': 4.221}, {'end': 592.631, 'text': 'that means drop the last row because you are dropping the duplicates and if you want, you can drop the first row or you can use the false one.', 'start': 583.107, 'duration': 9.524}], 'summary': 'Dropping duplicates can be specified for specific columns using the keep parameter, with options for first, last or false.', 'duration': 54.356, 'max_score': 538.275, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U538275.jpg'}, {'end': 644.906, 'src': 'embed', 'start': 616.271, 'weight': 1, 'content': [{'end': 621.236, 'text': 'getting impacted on your data set you need to use in place is equals to true.', 'start': 616.271, 'duration': 4.965}, {'end': 630.481, 'text': 'So whatever changes, drop, duplicates or anything else you are doing related to this, like keep first or subset by indicating in place,', 'start': 622.318, 'duration': 8.163}, {'end': 631.941, 'text': 'is equals to true.', 'start': 630.481, 'duration': 1.46}, {'end': 639.484, 'text': 'your final data set, or the data set that you are using, will be having that changes once you finally view it.', 'start': 631.941, 'duration': 7.543}, {'end': 644.906, 'text': "So that's about identifying the duplicates and finally dropping it.", 'start': 640.444, 'duration': 4.462}], 'summary': 'Using inplace=true to impact data set changes, including dropping duplicates.', 'duration': 28.635, 'max_score': 616.271, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U616271.jpg'}], 'start': 440.475, 'title': 'Data set analysis and duplicate removal', 'summary': 'Focuses on analyzing a data set with 8399 rows and 21 columns, specifically examining the customername column, identifying and removing 795 duplicate entries, and using subset arguments to focus on specific columns for duplicate removal. it also explains the parameters keep, in place, and the impact of using them while identifying and dropping duplicates in a dataset in pandas, emphasizing default settings and options for altering data, with the importance of in place parameter for data set changes.', 'chapters': [{'end': 565.157, 'start': 440.475, 'title': 'Data set analysis and duplicate removal', 'summary': 'Focuses on analyzing a data set with 8399 rows and 21 columns, specifically examining the customername column, identifying and removing 795 duplicate entries, and using subset arguments to focus on specific columns for duplicate removal.', 'duration': 124.682, 'highlights': ['The data set contains 8399 rows and 21 columns. The dataset consists of 8399 rows and 21 columns, providing a comprehensive overview of the data structure.', '795 duplicate entries were identified and removed from the customerName column. After removing duplicates from the customerName column, the dataset now contains 795 unique customer names.', "Subset arguments can be used to focus on specific columns during duplicate removal. Subset arguments allow for the selection of specific columns, such as 'customer name' or 'ship mode', when removing duplicate entries."]}, {'end': 650.708, 'start': 565.157, 'title': 'Identifying and dropping duplicates in pandas', 'summary': 'Explains the parameters keep, in place, and the impact of using them while identifying and dropping duplicates in a dataset in pandas, emphasizing default settings and options for altering data, with the importance of in place parameter for data set changes.', 'duration': 85.551, 'highlights': ['The in place parameter, when set to true, ensures that the changes made to the data set while dropping duplicates are reflected in the final data set, emphasizing the importance of this parameter for maintaining the changes in the data set for future use.', 'The keep parameter allows the choice of keeping the first or last occurrence when dropping duplicates, providing flexibility in managing duplicate data within the dataset.', 'The explanation of the parameters keep, in place, and their impact on the dataset provides a clear understanding of how to handle and manipulate duplicate data effectively in Pandas, enhancing the overall data management process.']}], 'duration': 210.233, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ix9iGffOA5U/pics/ix9iGffOA5U440475.jpg', 'highlights': ['Subset arguments allow for the selection of specific columns during duplicate removal.', 'The in place parameter, when set to true, ensures that the changes made to the data set while dropping duplicates are reflected in the final data set.', 'The keep parameter allows the choice of keeping the first or last occurrence when dropping duplicates.']}], 'highlights': ['The process involves importing the Pandas library, reading a sample superstore sales dataset from an Excel file, and displaying the top two rows to identify the presence of duplicate values.', 'The speaker explains the necessity of certain duplicate values in a dataset while also outlining scenarios where obtaining unique values for specific columns is required, providing a clear understanding of the motivation behind removing duplicates.', 'The speaker promises to demonstrate a straightforward method for removing duplicate values using Pandas library methods, catering to research requirements or other purposes necessitating unique values for specific columns.', "The concept of 'keep_first', 'keep_last', and 'false' arguments is introduced to manage duplicate rows, allowing users to specify whether to keep the first or last duplicate row or include all duplicate rows in the data set.", "The chapter explains how to filter the data set based on duplicate values by copying the previous command and filtering out the data set based on the 'true' values, providing a practical example of the first two duplicated rows using the 'head(2)' function.", "The count of duplicated rows in the 'orders' data set is obtained using 'orders.duplicated().sum()', resulting in no duplicated rows, and the process is explained using the sum of true values as 1 and false values as 0.", "The method 'duplicated()' is used to identify duplicate rows in the 'orders' data set and marks them as true or false, indicating whether they are duplicate or not, with the default result showing no duplicate rows.", 'Subset arguments allow for the selection of specific columns during duplicate removal.', 'The in place parameter, when set to true, ensures that the changes made to the data set while dropping duplicates are reflected in the final data set.', 'The keep parameter allows the choice of keeping the first or last occurrence when dropping duplicates.']}