title
Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018

description
PyData DC 2018 Most of your time is going to involve processing/cleaning/munging data. How do you know your data is clean? Sometimes you know what you need beforehand, but other times you don't. We'll cover the basics of looking at your data and getting started with the Pandas Python library, and then focus on how to "tidy" and reshape data. We'll finish with applying customized processing functions on our data. === www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 0:00 Introduction 0:18 Setup: Github Repo, Jupyter Setup 5:35 Loading Datasets - panda.read_csv() 7:43 Dataset / Dataframe At A Glance 7:53 Get First Rows: df.head() 8:58 Get Columns: df.columns 9:15 Get Index: df.index 9:37 Get Body: df.values 10:46 Get Shape: df.shape 12:04 Get Summarizing Statistics: df.info() 13:12 Filtering, Slicing a Dataset / Dataframe 13:25 Extract a Single Column: df['col_name'] 14:12 Dataframe vs Series 14:41 Extract N Columns: df[['col1_name', 'col2_name']] 15:51 Panda's Version: df.version 16:26 Extract Rows: df.iloc 17:30 Extract Rows: df.loc vs df.iloc vs df.idx 18:45 Extract Rows: df.iloc 19:37 Extract Rows: df.ix - Deprecated 20:38 Extract Multiple Rows and Columns 22:00 Extract Rows using Boolean Subsetting 23:24 Extract Rows using Multiple Boolean Subsetting 24:55 Cleaning a Dataset / Dataframe 25:38 General Issues according to a "Tidy Data" Research Paper 29:45 Issue 1: Column Headers are Values and not Variables Names 30:19 Load Pew Dataset 32:55 Transform Columns into Rows: pd.melt() 36:59 Load Billboard Dataset 37:05 Transform Columns into Rows: pd.melt() 42:00 Issue 2: Multiple Variables are Stored in 1 Column 43:06 Load Ebola Dataset 46:22 Transform Columns into Rows: pd.melt() 47:14 Split Column using String Manipulation through Accessors 51:19 Extract Column / Series from Accessor Split: accessor.get() 53:13 Add Column to Dataframe 54:13 Contracted Form for pd.melt() and Accessor String Manipulation: pd.merge() 56:10 Issue 3: Variables Stored in Rows And Columns 56:25 Load Weather Dataset 58:30 Transform Columns into Rows: pd.melt() 1:1:00 Transform Rows into Columns 1:2:00 Transform Rows into Columns: pd.pivot() vs pd.pivot_table() 1:4:30 Transform Rows into Columns: pd.pivot_table() 1:6:19 Flatten nested / hierarchical table: pd.reset_index() 1:7:42 Issue 4: Multiple Types of Observational Unit in Same Table (i.e De-nomalized Table) 1:9:43 Extract Type Observational Unit in new Dataframe, Drop Duplicates 1:11:30 Create "key" for extracted observational unit dataframe 1:12:11 Save new dataframe: pd.to_csv() 1:13:22 Merge / Join dataframe on common columns 1:16:25 Randomly Sample a dataframe 1:17:15 Note on Memory Consumption between all 3 dataframes 01:18:25 Summary from "Tidy Data" Research Paper 01:20:06 Q&A 01:21:21 Q&A 1: Simulating R's Chaining in Python 01:24:49 Q&A 2: Best Practices on Braquet Notation vs Chaining Huge s/o to https://github.com/KMurphs for the video timestamps! Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps

detail
{'title': 'Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018', 'heatmap': [{'end': 577.765, 'start': 523.136, 'weight': 0.715}, {'end': 2100.165, 'start': 1991.197, 'weight': 0.938}, {'end': 3416.861, 'start': 3350.853, 'weight': 0.743}, {'end': 3984.246, 'start': 3930.376, 'weight': 0.966}, {'end': 4098.491, 'start': 4034.97, 'weight': 0.748}, {'end': 4251.252, 'start': 4140.603, 'weight': 1}], 'summary': 'Tutorial covers accessing github and using jupyterlab, an introductory tutorial on pandas for data analysis, dataframe operations, tidy data and data transformation, reshaping and transforming data, dataset transformation and manipulation, data storage and pivot functions, and merging, tidying, and chaining data in python for efficient data cleaning and manipulation.', 'chapters': [{'end': 243.884, 'segs': [{'end': 106.759, 'src': 'embed', 'start': 0.829, 'weight': 0, 'content': [{'end': 8.392, 'text': "So those of you guys who are coming in, this is the GitHub repository for everything that we'll be working on.", 'start': 0.829, 'duration': 7.563}, {'end': 12.753, 'text': "So even after the tutorial, the notebooks, I'll publish them all here.", 'start': 8.512, 'duration': 4.241}, {'end': 21.156, 'text': "What I do want you guys to do is if you can go to that link that's highlighted up here, and if you don't know your way around GitHub,", 'start': 13.593, 'duration': 7.563}, {'end': 28.158, 'text': 'YOU CAN CLICK THIS GREEN BUTTON HERE AND CLICK DOWNLOAD ZIP AND THAT WILL AT LEAST DOWNLOAD EVERYTHING,', 'start': 22.216, 'duration': 5.942}, {'end': 34.56, 'text': "AND THE MOST IMPORTANT PART ABOUT THAT IS BECAUSE THE DATA SETS THAT WE'LL BE WORKING WITH ARE ALL STORED IN THIS REPOSITORY.", 'start': 28.158, 'duration': 6.402}, {'end': 38.981, 'text': 'SO IF YOU WANT TO FOLLOW ALONG AT LEAST, YOU CAN FOLLOW ALONG THAT WAY.', 'start': 34.56, 'duration': 4.421}, {'end': 54.104, 'text': "SO I'LL JUST KEEP TALKING A LITTLE BIT ABOUT SETTING UP.", 'start': 51.763, 'duration': 2.341}, {'end': 57.226, 'text': 'SO WHEN YOU DOWNLOAD THIS?', 'start': 54.745, 'duration': 2.481}, {'end': 63.669, 'text': "IF YOU DOWNLOAD IT, I'LL ASSUME THAT YOU DOWNLOADED THE ZIP VERSION OF IT AND I'M NOT GOING TO ASSUME THAT EVERYONE IN THE ROOM KNOWS GIT.", 'start': 57.226, 'duration': 6.443}, {'end': 70.912, 'text': "UNZIP IT SOMEWHERE THAT'S CONVENIENT, SO THE DESKTOP IS PROBABLY A USEFUL PLACE, AT LEAST FOR THE TUTORIAL.", 'start': 65.029, 'duration': 5.883}, {'end': 78.816, 'text': "FOR THOSE OF YOU ON WINDOWS AND MACS, I SORT OF ASSUME THAT YOU'RE RUNNING THE ANAKONA DISTRIBUTION OF PYTHON.", 'start': 72.073, 'duration': 6.743}, {'end': 86.863, 'text': "And so if you open up the Anaconda Navigator, if it's the first time you're opening it, it might take a while, take a little bit of time.", 'start': 79.537, 'duration': 7.326}, {'end': 90.566, 'text': 'But there is a button in there that says Jupyter Notebook or Jupyter Lab.', 'start': 86.943, 'duration': 3.623}, {'end': 94.469, 'text': 'And once you click that, a browser should open up.', 'start': 91.226, 'duration': 3.243}, {'end': 100.274, 'text': 'What I am going to show is a little bit different, because this is a Linux machine.', 'start': 95.87, 'duration': 4.404}, {'end': 106.759, 'text': 'But let me just hide all this random other stuff.', 'start': 100.294, 'duration': 6.465}], 'summary': 'Github repository for tutorial materials; download zip to access datasets; use anaconda for python', 'duration': 105.93, 'max_score': 0.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU829.jpg'}, {'end': 191.323, 'src': 'embed', 'start': 157.232, 'weight': 4, 'content': [{'end': 160.853, 'text': "I'll probably end up using JupyterLab for this tutorial.", 'start': 157.232, 'duration': 3.621}, {'end': 167.207, 'text': 'And JupyterLab is, you know, you can think of it as like the next iteration of the notebook system.', 'start': 162.503, 'duration': 4.704}, {'end': 172.351, 'text': "It's pretty handy because you'll have a lot more convenient access to everything.", 'start': 167.247, 'duration': 5.104}, {'end': 174.432, 'text': 'So you can browse your files right away.', 'start': 172.391, 'duration': 2.041}, {'end': 178.776, 'text': "What's also really cool is CSV files can also be rendered as spreadsheets in JupyterLab,", 'start': 174.512, 'duration': 4.264}, {'end': 181.017, 'text': "which is really useful if you're just playing around with data.", 'start': 178.776, 'duration': 2.241}, {'end': 191.323, 'text': "What's also really useful is you can have multiple notebooks open at the same time and you can have like two views of the same thing,", 'start': 181.738, 'duration': 9.585}], 'summary': 'Jupyterlab provides convenient access, renders csv as spreadsheets, and allows multiple open notebooks.', 'duration': 34.091, 'max_score': 157.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU157232.jpg'}], 'start': 0.829, 'title': 'Accessing github and using jupyterlab', 'summary': 'Emphasizes the importance of accessing the github repository for downloading notebooks and datasets, providing guidance on unzipping files. it also explains how to open and use jupyterlab in anaconda navigator, highlighting its advantages such as convenient file access and the ability to have multiple notebooks open at the same time.', 'chapters': [{'end': 78.816, 'start': 0.829, 'title': 'Github repository and data sets', 'summary': 'Emphasizes the importance of accessing the github repository for downloading the necessary notebooks and datasets, providing guidance on unzipping the files and assuming the audience is using the anaconda distribution of python.', 'duration': 77.987, 'highlights': ['The GitHub repository will contain all the notebooks and datasets for the tutorial, allowing participants to access and download the necessary resources.', "Instructions are provided for downloading the content from the repository by clicking the green button and selecting 'Download ZIP', enabling participants to obtain the required files easily.", 'Participants are advised to unzip the downloaded files to a convenient location, such as the desktop, to facilitate easy access during the tutorial.', 'Assumption is made that the audience is using the Anaconda distribution of Python, indicating the expected environment for the participants.']}, {'end': 243.884, 'start': 79.537, 'title': 'Using jupyterlab for data analysis', 'summary': 'Explains how to open and use jupyterlab in anaconda navigator, highlighting its advantages such as convenient access to files, rendering csv files as spreadsheets, and the ability to have multiple notebooks open at the same time.', 'duration': 164.347, 'highlights': ['JupyterLab provides convenient access to files, rendering CSV files as spreadsheets, and allows having multiple notebooks open at the same time, enhancing the data analysis process.', 'Opening JupyterLab in Anaconda Navigator may take some time for the first time.', 'JupyterLab is considered the next iteration of the notebook system, offering more convenient access to files and multiple notebook views.', 'CSV files can be rendered as spreadsheets in JupyterLab, providing a useful tool for data analysis.']}], 'duration': 243.055, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU829.jpg', 'highlights': ['The GitHub repository contains all the necessary notebooks and datasets for the tutorial.', 'Instructions for downloading content from the repository are provided, enabling easy access to required files.', 'Participants are advised to unzip downloaded files to a convenient location for easy access during the tutorial.', 'Assumption is made that the audience is using the Anaconda distribution of Python.', 'JupyterLab provides convenient access to files and allows having multiple notebooks open simultaneously.', 'Opening JupyterLab in Anaconda Navigator may take some time for the first time.', 'JupyterLab is considered the next iteration of the notebook system, offering more convenient access to files and multiple notebook views.', 'CSV files can be rendered as spreadsheets in JupyterLab, providing a useful tool for data analysis.']}, {'end': 688.609, 'segs': [{'end': 275.38, 'src': 'embed', 'start': 244.304, 'weight': 4, 'content': [{'end': 245.485, 'text': 'I am pulling up my notes.', 'start': 244.304, 'duration': 1.181}, {'end': 247.907, 'text': 'All right.', 'start': 247.687, 'duration': 0.22}, {'end': 252.451, 'text': "Oh, that's not helpful.", 'start': 251.23, 'duration': 1.221}, {'end': 261.894, 'text': 'Render All right, that works.', 'start': 254.372, 'duration': 7.522}, {'end': 266.017, 'text': 'All right, so this is a very introductory tutorial.', 'start': 262.396, 'duration': 3.621}, {'end': 275.38, 'text': "So if you've never worked with pandas before or worked with tabular data in Python, this is generally the target audience that I'm aiming for.", 'start': 266.077, 'duration': 9.303}], 'summary': 'Introductory tutorial for working with pandas and tabular data in python.', 'duration': 31.076, 'max_score': 244.304, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU244304.jpg'}, {'end': 351.093, 'src': 'embed', 'start': 302.821, 'weight': 0, 'content': [{'end': 308.562, 'text': "If you've seen other tutorials that I've given, you'll notice that I pretty much jump from chapter one to chapter six.", 'start': 302.821, 'duration': 5.741}, {'end': 312.683, 'text': "But we'll cover holes along the way, so don't worry too much about it.", 'start': 309.063, 'duration': 3.62}, {'end': 316.689, 'text': "So hopefully you guys have The data set's downloaded.", 'start': 313.664, 'duration': 3.025}, {'end': 319.851, 'text': 'And you have this notebook system open.', 'start': 317.69, 'duration': 2.161}, {'end': 323.995, 'text': "And we'll sort of just live code along and just talk through the commands as I go through them.", 'start': 320.352, 'duration': 3.643}, {'end': 327.658, 'text': 'So Python comes with a bunch of libraries.', 'start': 324.996, 'duration': 2.662}, {'end': 333.102, 'text': "And by default, just to make things faster to load, they don't load up everything at one go.", 'start': 328.258, 'duration': 4.844}, {'end': 336.305, 'text': 'So we have to import our, oops, let me make this bigger.', 'start': 333.723, 'duration': 2.582}, {'end': 341.804, 'text': 'So we have to import libraries.', 'start': 340.202, 'duration': 1.602}, {'end': 344.867, 'text': "So the library we're going to use is called pandas.", 'start': 342.284, 'duration': 2.583}, {'end': 347.93, 'text': 'And this is our library for reading in tabular data.', 'start': 345.067, 'duration': 2.863}, {'end': 351.093, 'text': 'And so if we want, we can now use pandas.', 'start': 349.292, 'duration': 1.801}], 'summary': 'Tutorial on importing and using pandas for tabular data in python.', 'duration': 48.272, 'max_score': 302.821, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU302821.jpg'}, {'end': 577.765, 'src': 'heatmap', 'start': 523.136, 'weight': 0.715, 'content': [{'end': 524.578, 'text': "There's the column names on the top.", 'start': 523.136, 'duration': 1.442}, {'end': 526.679, 'text': 'There is this index on the left.', 'start': 524.818, 'duration': 1.861}, {'end': 530.161, 'text': 'So in this case, the index is really the row name, the row number.', 'start': 526.759, 'duration': 3.402}, {'end': 534.303, 'text': "And then there's the body of the data frame.", 'start': 530.921, 'duration': 3.382}, {'end': 538.645, 'text': 'So you can access those three components however you want.', 'start': 535.643, 'duration': 3.002}, {'end': 540.946, 'text': 'So we could say df.columns.', 'start': 538.985, 'duration': 1.961}, {'end': 544.412, 'text': 'And that will give us just the columns of our data set.', 'start': 542.17, 'duration': 2.242}, {'end': 547.034, 'text': "You'll notice that columns does not have round brackets.", 'start': 544.772, 'duration': 2.262}, {'end': 549.135, 'text': "It's just an attribute of the data frame.", 'start': 547.214, 'duration': 1.921}, {'end': 551.397, 'text': "It's not really like a function that's being called.", 'start': 549.155, 'duration': 2.242}, {'end': 557.321, 'text': "If we want to access the quote unquote row names, it's df.index.", 'start': 552.758, 'duration': 4.563}, {'end': 559.903, 'text': "And that's how you get the index portion.", 'start': 558.502, 'duration': 1.401}, {'end': 564.246, 'text': 'So if you end up going down pandas and you work with a lot of time series data,', 'start': 560.363, 'duration': 3.883}, {'end': 572.522, 'text': "You'll end up setting the date time as the index so you can access different rows by the time of day that way.", 'start': 564.998, 'duration': 7.524}, {'end': 577.765, 'text': "And then in the middle right here, the actual body of our data frame, that's called the values.", 'start': 573.442, 'duration': 4.323}], 'summary': 'The data frame consists of columns, index, and values, which can be accessed using df.columns, df.index, and df.values respectively.', 'duration': 54.629, 'max_score': 523.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU523135.jpg'}, {'end': 559.903, 'src': 'embed', 'start': 530.921, 'weight': 3, 'content': [{'end': 534.303, 'text': "And then there's the body of the data frame.", 'start': 530.921, 'duration': 3.382}, {'end': 538.645, 'text': 'So you can access those three components however you want.', 'start': 535.643, 'duration': 3.002}, {'end': 540.946, 'text': 'So we could say df.columns.', 'start': 538.985, 'duration': 1.961}, {'end': 544.412, 'text': 'And that will give us just the columns of our data set.', 'start': 542.17, 'duration': 2.242}, {'end': 547.034, 'text': "You'll notice that columns does not have round brackets.", 'start': 544.772, 'duration': 2.262}, {'end': 549.135, 'text': "It's just an attribute of the data frame.", 'start': 547.214, 'duration': 1.921}, {'end': 551.397, 'text': "It's not really like a function that's being called.", 'start': 549.155, 'duration': 2.242}, {'end': 557.321, 'text': "If we want to access the quote unquote row names, it's df.index.", 'start': 552.758, 'duration': 4.563}, {'end': 559.903, 'text': "And that's how you get the index portion.", 'start': 558.502, 'duration': 1.401}], 'summary': 'Access data frame components using df.columns and df.index.', 'duration': 28.982, 'max_score': 530.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU530921.jpg'}, {'end': 674.698, 'src': 'embed', 'start': 644.898, 'weight': 2, 'content': [{'end': 647.38, 'text': 'So what else do you do when you first load up your data set for the first time?', 'start': 644.898, 'duration': 2.482}, {'end': 653.386, 'text': "Well, there's this attribute called shape, and it gives you the number of rows and columns in your data set.", 'start': 648.121, 'duration': 5.265}, {'end': 657.669, 'text': 'So this data set, this Gapminder data set, has 1, 704 rows and six columns.', 'start': 653.866, 'duration': 3.803}, {'end': 666.177, 'text': 'And so, if you can imagine, you have a million rows or hundreds of columns instead of inspecting each and one of them,', 'start': 659.131, 'duration': 7.046}, {'end': 667.438, 'text': 'this is a quick way to make sure.', 'start': 666.177, 'duration': 1.261}, {'end': 669.3, 'text': 'did I get my data properly?', 'start': 667.438, 'duration': 1.862}, {'end': 674.698, 'text': "If you're reading data from instruments, this stuff usually ends up being really specific.", 'start': 670.134, 'duration': 4.564}], 'summary': "Using the 'shape' attribute to quickly verify data set with 1,704 rows and 6 columns.", 'duration': 29.8, 'max_score': 644.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU644898.jpg'}], 'start': 244.304, 'title': 'Pandas and data analysis', 'summary': 'Covers an introductory tutorial on working with pandas for tidying data in python, targeting new users and focusing on data manipulation. it also introduces the process of reading and analyzing tabular data using the pandas library in python, emphasizing the read_csv function, data frame components, and data set shape for quality assurance.', 'chapters': [{'end': 323.995, 'start': 244.304, 'title': 'Intro to pandas: tidying data in python', 'summary': 'Covers an introductory tutorial on working with pandas for tidying data in python, targeting those new to pandas and tabular data, aiming to skip over unnecessary details and focus on data manipulation, with a live coding approach.', 'duration': 79.691, 'highlights': ['The chapter targets beginners who have not worked with pandas or tabular data in Python, focusing on the tidying data portion and data manipulations, aiming to provide a magical experience for learners.', 'The tutorial aims to skip over unnecessary details and directly focus on tidying data, providing a faster pace for learners, with a focus on live coding and explaining commands as they are executed.', 'The chapter encourages learners to have the data set downloaded and the notebook system open, emphasizing a hands-on, interactive learning approach.']}, {'end': 688.609, 'start': 324.996, 'title': 'Reading and analyzing tabular data with pandas', 'summary': 'Introduces the process of reading tabular data using the pandas library in python, with a focus on using the read_csv function, accessing data frame components, and checking the shape of the data set for quality assurance.', 'duration': 363.613, 'highlights': ['Pandas library is used for reading in tabular data, with a specific function called read CSV for reading CSV files. The chapter emphasizes the use of the pandas library for reading tabular data, using the read_csv function to read CSV files into a data frame.', 'Accessing components of a data frame, including column names, index, and values, is demonstrated using df.columns, df.index, and df.values. The process of accessing different components of a data frame in Pandas is explained, showcasing the methods for accessing column names, index, and values within the data frame.', "The shape attribute of a data frame provides the number of rows and columns in the data set, offering a quick way to inspect the data's dimensions. The shape attribute of a data frame is highlighted as a quick method to determine the number of rows and columns in the data set, providing a rapid way to verify data integrity."]}], 'duration': 444.305, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU244304.jpg', 'highlights': ['The tutorial aims to skip over unnecessary details and directly focus on tidying data, providing a faster pace for learners, with a focus on live coding and explaining commands as they are executed.', 'The chapter encourages learners to have the data set downloaded and the notebook system open, emphasizing a hands-on, interactive learning approach.', "The shape attribute of a data frame provides the number of rows and columns in the data set, offering a quick way to inspect the data's dimensions.", 'Accessing components of a data frame, including column names, index, and values, is demonstrated using df.columns, df.index, and df.values.', 'The chapter targets beginners who have not worked with pandas or tabular data in Python, focusing on the tidying data portion and data manipulations, aiming to provide a magical experience for learners.', 'Pandas library is used for reading in tabular data, with a specific function called read CSV for reading CSV files.']}, {'end': 1485.673, 'segs': [{'end': 731.532, 'src': 'embed', 'start': 690.097, 'weight': 3, 'content': [{'end': 695.52, 'text': "These attributes, like columns, sometimes I'll put columns with round brackets, and you just might forget.", 'start': 690.097, 'duration': 5.423}, {'end': 699.622, 'text': 'And so if you end up doing something like df.shape with round brackets, meaning like hey,', 'start': 695.98, 'duration': 3.642}, {'end': 705.766, 'text': "I'm calling the shape function on the data frame and not just the shape attribute, you'll get an error message, something like this", 'start': 699.622, 'duration': 6.144}, {'end': 708.167, 'text': "In this case, it says it's a tuple object.", 'start': 706.386, 'duration': 1.781}, {'end': 711.689, 'text': "It's not callable because what got returned back is a tuple.", 'start': 708.207, 'duration': 3.482}, {'end': 714.091, 'text': "So think of that as like Python list that you can't change.", 'start': 711.729, 'duration': 2.362}, {'end': 718.513, 'text': "And so it doesn't know what this as a function does.", 'start': 714.631, 'duration': 3.882}, {'end': 719.914, 'text': "That's why you get an error like that.", 'start': 718.693, 'duration': 1.221}, {'end': 724.009, 'text': 'So other stuff that you do.', 'start': 723.069, 'duration': 0.94}, {'end': 731.532, 'text': "If you don't really have too many columns, one really useful thing is just saying like, hey, give me the info of this data frame.", 'start': 726.67, 'duration': 4.862}], 'summary': 'Be cautious with python dataframe attributes and functions to avoid errors.', 'duration': 41.435, 'max_score': 690.097, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU690097.jpg'}, {'end': 907.394, 'src': 'embed', 'start': 883.666, 'weight': 1, 'content': [{'end': 890.438, 'text': "So what happens if we want more columns in our dataset? Sorry, let's see, this is subset.", 'start': 883.666, 'duration': 6.772}, {'end': 898.35, 'text': "So sometimes what you want to do is like, hey, I have this giant data set, but I don't care about all of these other columns.", 'start': 893.288, 'duration': 5.062}, {'end': 899.991, 'text': 'I only care about a subset over these columns.', 'start': 898.37, 'duration': 1.621}, {'end': 907.394, 'text': "So what you can do in Python is if you're trying to specify by name multiple things, you can put in a Python list.", 'start': 900.831, 'duration': 6.563}], 'summary': 'In python, specify a subset of columns using a list.', 'duration': 23.728, 'max_score': 883.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU883666.jpg'}, {'end': 1269.223, 'src': 'embed', 'start': 1244.513, 'weight': 0, 'content': [{'end': 1250.935, 'text': "Let's say we want to subset our columns and also start filtering by rows.", 'start': 1244.513, 'duration': 6.422}, {'end': 1254.617, 'text': 'So you end up using LOC to subset both rows and columns.', 'start': 1251.556, 'duration': 3.061}, {'end': 1260.96, 'text': 'And under the hood, if you really want to be really verbose on how you type this stuff.', 'start': 1256.358, 'duration': 4.602}, {'end': 1264.261, 'text': "LOC, if you've worked in R, the notation will be really similar.", 'start': 1260.96, 'duration': 3.301}, {'end': 1269.223, 'text': "You can have a comma, and there's a portion to the left of the comma and a portion to the right of the comma.", 'start': 1265.441, 'duration': 3.782}], 'summary': 'Using loc in python for subsetting rows and columns, similar to r notation.', 'duration': 24.71, 'max_score': 1244.513, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1244513.jpg'}, {'end': 1436.013, 'src': 'embed', 'start': 1392.85, 'weight': 2, 'content': [{'end': 1404.313, 'text': "If you're trying to do this Boolean subsetting and you want to do, for example, multiple conditions or multiple Boolean cases,", 'start': 1392.85, 'duration': 11.463}, {'end': 1409.794, 'text': 'the trick is you have to wrap each individual statement in a set of round brackets.', 'start': 1404.313, 'duration': 5.481}, {'end': 1413.695, 'text': "So let's say we wanted year, that's 1967, and then also df pop.", 'start': 1411.514, 'duration': 2.181}, {'end': 1434.572, 'text': "greater than, I don't know, what is this? A million? That seems kind of low.", 'start': 1422.349, 'duration': 12.223}, {'end': 1436.013, 'text': 'Maybe this will work.', 'start': 1435.412, 'duration': 0.601}], 'summary': 'Use round brackets for multiple boolean conditions in boolean subsetting.', 'duration': 43.163, 'max_score': 1392.85, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1392850.jpg'}], 'start': 690.097, 'title': 'Dataframe operations and subsetting in python', 'summary': "Discusses the significance of correct syntax in dataframe operations, highlighting the impact of using round brackets instead of square brackets, the usefulness of the 'info' function, and methods for subsetting and filtering data frames in python, including column and row subset methods, loc and iloc usage, and boolean subsetting with multiple conditions.", 'chapters': [{'end': 731.532, 'start': 690.097, 'title': 'Avoiding errors in dataframe operations', 'summary': "Discusses the importance of using the correct syntax when operating on a dataframe, highlighting the consequences of using round brackets instead of square brackets and the usefulness of the 'info' function.", 'duration': 41.435, 'highlights': ["Using round brackets instead of square brackets for dataframe operations can lead to errors, such as receiving a 'tuple object not callable' message.", "The 'info' function is useful for obtaining information about the dataframe, especially when dealing with a smaller number of columns."]}, {'end': 1485.673, 'start': 732.312, 'title': 'Subsetting data frames in python', 'summary': 'Discusses how to subset and filter data frames in python, including methods for subsetting columns and rows, using loc and iloc, and performing boolean subsetting with multiple conditions.', 'duration': 753.361, 'highlights': ['Using LOC and ILOC to subset columns and rows The chapter covers the usage of LOC and ILOC to subset both columns and rows in data frames, with examples of specifying rows to the left of the comma and columns to the right, and using slicing notation for filtering.', "Performing Boolean subsetting with multiple conditions It explains the technique of using round brackets to wrap individual statements and using ampersand for 'AND' and vertical pipe for 'OR' to perform Boolean subsetting with multiple conditions, as well as using the underscore as a visual marker for integers in Python.", 'Subsetting data frames using column names It elaborates on using square brackets to subset data frames by specifying column names and saving the subset to a variable, as well as using Python lists to subset by multiple column names simultaneously.']}], 'duration': 795.576, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU690097.jpg', 'highlights': ['Using LOC and ILOC to subset columns and rows The chapter covers the usage of LOC and ILOC to subset both columns and rows in data frames, with examples of specifying rows to the left of the comma and columns to the right, and using slicing notation for filtering.', 'Subsetting data frames using column names It elaborates on using square brackets to subset data frames by specifying column names and saving the subset to a variable, as well as using Python lists to subset by multiple column names simultaneously.', "Performing Boolean subsetting with multiple conditions It explains the technique of using round brackets to wrap individual statements and using ampersand for 'AND' and vertical pipe for 'OR' to perform Boolean subsetting with multiple conditions, as well as using the underscore as a visual marker for integers in Python.", "Using round brackets instead of square brackets for dataframe operations can lead to errors, such as receiving a 'tuple object not callable' message.", "The 'info' function is useful for obtaining information about the dataframe, especially when dealing with a smaller number of columns."]}, {'end': 2114.753, 'segs': [{'end': 1574.99, 'src': 'embed', 'start': 1550.254, 'weight': 0, 'content': [{'end': 1555.736, 'text': "So let's just talk about tidy data and what does it mean when your dataset is quote unquote clean.", 'start': 1550.254, 'duration': 5.482}, {'end': 1559.719, 'text': 'And so there is this really nice paper by Hadley Wickham.', 'start': 1557.417, 'duration': 2.302}, {'end': 1564.382, 'text': 'He comes from the R world, but stuff like this is totally relevant.', 'start': 1560.56, 'duration': 3.822}, {'end': 1565.643, 'text': "It's language agnostic.", 'start': 1564.422, 'duration': 1.221}, {'end': 1570.467, 'text': "And so what does it mean to have clean data sets? And that's sort of the reason what this paper talks about.", 'start': 1566.224, 'duration': 4.243}, {'end': 1574.99, 'text': "It's like, hey, we have these two data sets defined by table one and table two.", 'start': 1570.867, 'duration': 4.123}], 'summary': "Tidy data means clean datasets, according to hadley wickham's paper.", 'duration': 24.736, 'max_score': 1550.254, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1550254.jpg'}, {'end': 1693.361, 'src': 'embed', 'start': 1670.604, 'weight': 1, 'content': [{'end': 1678.409, 'text': "But the key is once you get your data in a clean or tidy format, it's pretty easy to convert it to one another and that's what we'll talk about.", 'start': 1670.604, 'duration': 7.805}, {'end': 1684.674, 'text': 'THE FORMAL DEFINITION OF TIDY DATA HE TALKED, HADLEY WICKHAM TALKED ABOUT IN HIS PAPER.', 'start': 1680.311, 'duration': 4.363}, {'end': 1690.759, 'text': 'EACH VARIABLE FORMS A COLUMN, EACH OBSERVATION FORMS A ROW, EACH TYPE OF OBSERVATIONAL UNIT FORMS A TABLE.', 'start': 1685.375, 'duration': 5.384}, {'end': 1693.361, 'text': "WE'LL PROBABLY REALLY ONLY TALK ABOUT THE FIRST TWO.", 'start': 1691.439, 'duration': 1.922}], 'summary': 'Tidy data: each variable as a column, each observation as a row.', 'duration': 22.757, 'max_score': 1670.604, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1670604.jpg'}, {'end': 1769.697, 'src': 'embed', 'start': 1745.046, 'weight': 4, 'content': [{'end': 1752.112, 'text': "And so that's what we'll go through in this tutorial, is just walk through these first three things as defined in this paper.", 'start': 1745.046, 'duration': 7.066}, {'end': 1757.546, 'text': "And if you do end up reading this paper, we'll actually work with the exact same data sets, except for one of them.", 'start': 1753.342, 'duration': 4.204}, {'end': 1759.928, 'text': "You really don't need to read the entire thing.", 'start': 1758.407, 'duration': 1.521}, {'end': 1765.893, 'text': 'By the time you get to chapter four, or end of, yeah.', 'start': 1761.009, 'duration': 4.884}, {'end': 1769.697, 'text': "By the time you get to section four, that's probably all you need to know.", 'start': 1765.973, 'duration': 3.724}], 'summary': 'Tutorial covers first three things as defined in paper, working with same datasets, limited need to read entire paper.', 'duration': 24.651, 'max_score': 1745.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1745046.jpg'}, {'end': 1967.691, 'src': 'embed', 'start': 1938.001, 'weight': 3, 'content': [{'end': 1944.123, 'text': "So you can think of long data as a lot more rows, since it's long, versus wide data, which is just a bunch of columns.", 'start': 1938.001, 'duration': 6.122}, {'end': 1952.286, 'text': 'So in this case, this is an example of wide data, because we have a bunch of columns that we really want into long format.', 'start': 1944.943, 'duration': 7.343}, {'end': 1957.383, 'text': "So hopefully if I show you the code and run it, it'll make a little bit more sense.", 'start': 1953.74, 'duration': 3.643}, {'end': 1967.691, 'text': "So if we are going from wide data to long data and this is the reason why the function is named this way you can think about if you have a sword and you're trying to melt the sword.", 'start': 1958.104, 'duration': 9.587}], 'summary': 'Long data has more rows, while wide data has more columns. converting from wide to long data is like melting a sword.', 'duration': 29.69, 'max_score': 1938.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1938001.jpg'}, {'end': 2100.165, 'src': 'heatmap', 'start': 1991.197, 'weight': 0.938, 'content': [{'end': 1993.299, 'text': "And then there's a bunch of other parameters in there.", 'start': 1991.197, 'duration': 2.102}, {'end': 1998.302, 'text': 'So one of those parameters is called ID VARS.', 'start': 1993.479, 'duration': 4.823}, {'end': 2007.588, 'text': "And ID VARS, essentially what you are trying to do is you're trying to take a bunch of columns and turn them into one column.", 'start': 2000.603, 'duration': 6.985}, {'end': 2009.829, 'text': "So you're going from wide to long.", 'start': 2008.628, 'duration': 1.201}, {'end': 2015.803, 'text': "And ID vars is essentially saying hey, what are the column or columns that you don't want to touch right?", 'start': 2010.339, 'duration': 5.464}, {'end': 2020.086, 'text': 'And in this example, we want to leave the religion column alone.', 'start': 2016.103, 'duration': 3.983}, {'end': 2022.849, 'text': 'We actually want to melt down all of the other columns.', 'start': 2020.207, 'duration': 2.642}, {'end': 2030.514, 'text': "So we can say something like ID vars is religion, meaning that is the column that we don't want changed.", 'start': 2025.951, 'duration': 4.563}, {'end': 2035.418, 'text': 'So let me save that into a variable called pew long.', 'start': 2033.397, 'duration': 2.021}, {'end': 2046.65, 'text': "And if we look at it, WE NOW TOOK ALL OF THOSE COLUMNS THAT WASN'T SPECIFIED BY ID VAR, SO RELIGION, AND WE CONVERTED THEM,", 'start': 2038.38, 'duration': 8.27}, {'end': 2049.331, 'text': 'OR MADE THEM LONG INTO A COLUMN BY DEFAULT.', 'start': 2046.65, 'duration': 2.681}, {'end': 2050.553, 'text': "IT'S CALLED VARIABLE AND VALUE.", 'start': 2049.351, 'duration': 1.202}, {'end': 2056.398, 'text': "SO IF I DON'T USE HEAD, HOPEFULLY THIS WILL SHOW.", 'start': 2052.393, 'duration': 4.005}, {'end': 2061.201, 'text': 'YOU CAN SEE FOR A GIVEN RELIGION, WE HAVE AN INCOME BRACKET AND THEN LIKE ACCOUNT.', 'start': 2056.938, 'duration': 4.263}, {'end': 2064.063, 'text': 'AND THAT IS THE TIGHTY VERSION OF THAT DATA SET.', 'start': 2061.261, 'duration': 2.802}, {'end': 2070.443, 'text': 'Each column is a variable, so religion, in this case, variable and value.', 'start': 2066.4, 'duration': 4.043}, {'end': 2072.284, 'text': 'Each row is now an observation.', 'start': 2070.862, 'duration': 1.422}, {'end': 2091.637, 'text': 'Say again? The index, are the indices unique? I believe the indices are unique.', 'start': 2077.608, 'duration': 14.029}, {'end': 2096.219, 'text': 'I have to, I never actually ran into a situation where I was playing with indices.', 'start': 2091.657, 'duration': 4.562}, {'end': 2100.165, 'text': "when I'm doing this, so I don't know the actual answer to your question.", 'start': 2096.822, 'duration': 3.343}], 'summary': 'Demonstration of converting wide to long format using id vars parameter', 'duration': 108.968, 'max_score': 1991.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1991197.jpg'}], 'start': 1485.673, 'title': 'Tidy data and data transformation', 'summary': 'Introduces the concept of tidy data, its characteristics, and versatility, emphasizing the formal definition and the process of transforming wide data to long data, with a focus on specific examples and functions.', 'chapters': [{'end': 1670.123, 'start': 1485.673, 'title': 'Introduction to tidy data', 'summary': 'Introduces the concept of tidy data by explaining its characteristics, the implications of data cleanliness, and the versatility of tidy data in different data processing tasks.', 'duration': 184.45, 'highlights': ['The chapter introduces the concept of tidy data by explaining its characteristics and implications, and mentions a paper by Hadley Wickham as a reference. The chapter discusses the characteristics of tidy data and refers to a paper by Hadley Wickham, emphasizing the importance of clean data sets.', 'The chapter explains the implications of having data in a tidy format, highlighting its ease of use in processing and its suitability for various tasks like statistical modeling and result presentation. The chapter highlights the implications of tidy data, emphasizing its suitability for statistical modeling and result presentation, as well as its ease of use in data processing.', 'Different versions of data sets are discussed, emphasizing that each version serves different purposes and there is no single best version. The chapter discusses different versions of data sets, emphasizing that each version serves different purposes and there is no single best version.']}, {'end': 1859.523, 'start': 1670.604, 'title': 'Tidying data for analysis', 'summary': 'Discusses the formal definition of tidy data, emphasizing the importance of each variable forming a column, each observation forming a row, and the process of cleaning or tidying data sets, with a focus on the first three problems defined in the paper.', 'duration': 188.919, 'highlights': ['The formal definition of tidy data is presented as each variable forming a column, each observation forming a row, and each type of observational unit forming a table, with a focus on the first two aspects.', 'The tutorial focuses on walking through the first three problems defined in the paper, offering insights into the process of cleaning or tidying data sets.', 'The chapter emphasizes the significance of tidying data for analysis, providing practical examples using the Pandas library to import and tidy datasets, demonstrating the process of simplifying typing and data loading.']}, {'end': 2114.753, 'start': 1863.891, 'title': 'Transforming wide data to long data', 'summary': "Discusses the process of transforming wide data to long data using the example of a dataset involving religion, income brackets, and counts, emphasizing the use of the 'melt' function and the concept of id vars.", 'duration': 250.862, 'highlights': ['The concept of long data and wide data is explained, with long data having more rows and wide data having multiple columns.', "The 'melt' function in pandas is used to transform wide data to long data, with an example involving the 'PEW' dataset and the 'ID VARS' parameter.", 'The process results in a dataset where each column represents a variable, such as religion, income bracket, and count, and each row serves as an observation.']}], 'duration': 629.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU1485673.jpg', 'highlights': ['The chapter introduces the concept of tidy data by explaining its characteristics and implications, and mentions a paper by Hadley Wickham as a reference.', 'The formal definition of tidy data is presented as each variable forming a column, each observation forming a row, and each type of observational unit forming a table, with a focus on the first two aspects.', 'The chapter explains the implications of having data in a tidy format, highlighting its ease of use in processing and its suitability for various tasks like statistical modeling and result presentation.', 'The concept of long data and wide data is explained, with long data having more rows and wide data having multiple columns.', 'The tutorial focuses on walking through the first three problems defined in the paper, offering insights into the process of cleaning or tidying data sets.', 'The chapter emphasizes the significance of tidying data for analysis, providing practical examples using the Pandas library to import and tidy datasets, demonstrating the process of simplifying typing and data loading.']}, {'end': 2660.069, 'segs': [{'end': 2203.271, 'src': 'embed', 'start': 2114.773, 'weight': 0, 'content': [{'end': 2124.338, 'text': "All right, so if we look at this, anything that you didn't specify in IDVars get thrown into this other parameter.", 'start': 2114.773, 'duration': 9.565}, {'end': 2127.395, 'text': 'called value vars.', 'start': 2125.495, 'duration': 1.9}, {'end': 2131.897, 'text': 'So the function takes really two things, ID vars and value vars.', 'start': 2127.876, 'duration': 4.021}, {'end': 2138.438, 'text': 'And whatever you specify in value vars, if you leave ID vars alone, that will be in there and vice versa.', 'start': 2132.017, 'duration': 6.421}, {'end': 2139.078, 'text': "So you don't have to.", 'start': 2138.458, 'duration': 0.62}, {'end': 2142.359, 'text': "Whatever is easier or shorter to type, that's the one that you use.", 'start': 2139.158, 'duration': 3.201}, {'end': 2146.2, 'text': "You can also use both of them if you're also trying to subset columns at the same time.", 'start': 2142.979, 'duration': 3.221}, {'end': 2151.71, 'text': 'but the other two parameters in here, var name and value name.', 'start': 2147.568, 'duration': 4.142}, {'end': 2155.913, 'text': "that's how we change the default value variable and value columns right?", 'start': 2151.71, 'duration': 4.203}, {'end': 2171.422, 'text': 'So if we take the same exact piece of code and we say var name is now income, we can also say value name is.', 'start': 2156.973, 'duration': 14.449}, {'end': 2172.923, 'text': 'I think I have count.', 'start': 2171.422, 'duration': 1.501}, {'end': 2187.388, 'text': 'Now if we look at our data set, we actually get some sensible variable names for us.', 'start': 2175.684, 'duration': 11.704}, {'end': 2195.049, 'text': "So if you ever end up in a situation where you have a bunch of columns and you're like hey, those are clearly variables, I want to put into a model,", 'start': 2187.688, 'duration': 7.361}, {'end': 2203.271, 'text': "that's all you have to do is turn this, melt your data set, specify the columns you want fixed or the columns you want melted,", 'start': 2195.049, 'duration': 8.222}], 'summary': 'The function takes two things, id vars and value vars to subset columns, with options to change default variable and column names.', 'duration': 88.498, 'max_score': 2114.773, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2114773.jpg'}, {'end': 2488.161, 'src': 'embed', 'start': 2392.848, 'weight': 4, 'content': [{'end': 2407.727, 'text': "For a given week, what is that song's rating? Why might you want to do something like this? Again, you can do this for modeling purposes.", 'start': 2392.848, 'duration': 14.879}, {'end': 2410.508, 'text': 'This is actually the definition of like tidy data.', 'start': 2407.787, 'duration': 2.721}, {'end': 2416.03, 'text': 'If you work with databases, databases prefer data in long format.', 'start': 2411.688, 'duration': 4.342}, {'end': 2420.811, 'text': 'So before you dump something into a database, you probably will end up doing something like this.', 'start': 2416.51, 'duration': 4.301}, {'end': 2428.505, 'text': 'I had this example where we had like this 1, 000 by 1, 000 row data set, and we tried to dump it into Postgres.', 'start': 2421.611, 'duration': 6.894}, {'end': 2433.97, 'text': "And it really didn't like the fact that we were trying to put in a data set or a table that had 1, 000 columns.", 'start': 2428.686, 'duration': 5.284}, {'end': 2442.977, 'text': 'And so what I had to do was actually turn this into this many, many thousand row object with two columns.', 'start': 2434.45, 'duration': 8.527}, {'end': 2445.119, 'text': 'And then it was fine.', 'start': 2443.478, 'duration': 1.641}, {'end': 2449.223, 'text': 'So different things need different forms.', 'start': 2445.179, 'duration': 4.044}, {'end': 2454.507, 'text': "But if you're also thinking about databases, this is also the format databases prefer to do things in.", 'start': 2449.323, 'duration': 5.184}, {'end': 2466.348, 'text': 'So remember shape, so billboard.shape.', 'start': 2463.826, 'duration': 2.522}, {'end': 2470.35, 'text': 'This billboard data set really had, originally had 317 rows and 81 columns.', 'start': 2466.768, 'duration': 3.582}, {'end': 2479.035, 'text': 'Uh-oh, what do I call this? Melt.', 'start': 2470.37, 'duration': 8.665}, {'end': 2484.999, 'text': 'Say again? Yeah.', 'start': 2482.758, 'duration': 2.241}, {'end': 2488.161, 'text': "So I'll just talk what I just said.", 'start': 2486.34, 'duration': 1.821}], 'summary': "Transforming data into long format for database compatibility and modeling purposes using the 'melt' function.", 'duration': 95.313, 'max_score': 2392.848, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2392848.jpg'}, {'end': 2535.872, 'src': 'embed', 'start': 2503.999, 'weight': 6, 'content': [{'end': 2512.508, 'text': "Well, it's really the same number of cells, but one is definitely more favorable in terms of a format for data storage and just modeling in general.", 'start': 2503.999, 'duration': 8.509}, {'end': 2522.426, 'text': "All right, so let's go.", 'start': 2520.765, 'duration': 1.661}, {'end': 2525.187, 'text': "so that's usually pd.melt.", 'start': 2522.426, 'duration': 2.761}, {'end': 2527.128, 'text': "that's what you do when you have that first problem.", 'start': 2525.187, 'duration': 1.941}, {'end': 2529.029, 'text': 'like I have a bunch of columns and I just want to melt it down.', 'start': 2527.128, 'duration': 1.901}, {'end': 2533.891, 'text': 'What happens if you have multiple variables stored in one column?', 'start': 2530.37, 'duration': 3.521}, {'end': 2535.872, 'text': 'The paper.', 'start': 2534.772, 'duration': 1.1}], 'summary': 'Comparison of data storage formats and usage of pd.melt for data manipulation.', 'duration': 31.873, 'max_score': 2503.999, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2503999.jpg'}, {'end': 2603.15, 'src': 'embed', 'start': 2563.441, 'weight': 7, 'content': [{'end': 2567.364, 'text': 'And then we have the same, and it goes all the way up to like male 65 and above.', 'start': 2563.441, 'duration': 3.923}, {'end': 2570.066, 'text': 'And then we have the same thing for female 0 to 14, et cetera, et cetera.', 'start': 2567.484, 'duration': 2.582}, {'end': 2574.249, 'text': 'Right. so we know the first problem is like we need to do something with MELT,', 'start': 2571.287, 'duration': 2.962}, {'end': 2580.103, 'text': "because what we would really want is a column that's like MALE AND AN AGE GROUP.", 'start': 2574.249, 'duration': 5.854}, {'end': 2583.985, 'text': 'RIGHT?. BUT THE COLUMN NAME ITSELF IS STORING TWO BITS OF INFORMATION.', 'start': 2580.103, 'duration': 3.882}, {'end': 2586.485, 'text': "SO THAT'S A PROBLEM.", 'start': 2585.025, 'duration': 1.46}, {'end': 2588.426, 'text': 'SO HOW DO WE WORK WITH THAT?', 'start': 2587.406, 'duration': 1.02}, {'end': 2603.15, 'text': 'SO, IN OUR DATA SET, WE HAVE A SORRY.', 'start': 2591.187, 'duration': 11.963}], 'summary': 'Data set needs restructuring to separate gender and age groups.', 'duration': 39.709, 'max_score': 2563.441, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2563441.jpg'}, {'end': 2660.069, 'src': 'embed', 'start': 2630.672, 'weight': 9, 'content': [{'end': 2633.832, 'text': 'she currently works at SCHOOL OF PUBLIC HEALTH.', 'start': 2630.672, 'duration': 3.16}, {'end': 2642.037, 'text': 'I BELIEVE AT HOPKINS AND SHE CURATED THIS DATA SET FROM THE 2014 EBOLA OUTBREAK.', 'start': 2633.832, 'duration': 8.205}, {'end': 2647.461, 'text': 'SO AT THE TIME, EBOLA WAS, LIKE, FIRST BIG OUTBREAK IN WEST AFRICA.', 'start': 2642.257, 'duration': 5.204}, {'end': 2658.268, 'text': 'COUNTRIES WERE DOING LIKE GIVING REPORTS AND SHE WAS KIND OF READING THOSE REPORTS AND JUST PUTTING DOWN THESE CASES AND DEATH COUNTS.', 'start': 2648.541, 'duration': 9.727}, {'end': 2660.069, 'text': 'THAT THE COUNTRIES WERE REPORTING INTO A DATA SET.', 'start': 2658.268, 'duration': 1.801}], 'summary': 'Curated data set from 2014 ebola outbreak in west africa.', 'duration': 29.397, 'max_score': 2630.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2630672.jpg'}], 'start': 2114.773, 'title': 'Reshaping and transforming data', 'summary': 'Covers reshaping data with id and value variables, using the melt function to transform a dataset into a tidy format, and addressing challenges of working with large datasets while discussing the importance of data format for storage and modeling.', 'chapters': [{'end': 2225.862, 'start': 2114.773, 'title': 'Reshaping data with id and value variables', 'summary': 'Explains how to use id vars and value vars to reshape a dataset, and how to change default value variable and value columns using var name and value name parameters in the process of creating a tidy data set.', 'duration': 111.089, 'highlights': ['The function takes two essential parameters, ID vars and value vars, where any unspecified variables in IDVars are placed into the value vars parameter.', 'The var name and value name parameters are utilized to change the default value variable and value columns, enabling the transformation of variable names in the dataset.', 'The process of melting the dataset and specifying fixed or melted columns results in the creation of a tidy dataset, facilitating subsequent data analysis and modeling.']}, {'end': 2420.811, 'start': 2231.683, 'title': 'Melt function and tidy data', 'summary': 'Discusses the use of the melt function to transform a data set called billboard.csv, which contains song ratings from week one to 72, into a tidy format, where the columns are values instead of variables, and demonstrates how this can be beneficial for modeling purposes and database compatibility.', 'duration': 189.128, 'highlights': ["The data set 'billboard.csv' contains song ratings from week one to 72 and needs to be transformed into a tidy format. The dataset 'billboard.csv' contains song ratings from week one to 72, and the goal is to transform it into a tidy format using the melt function.", 'The melt function is used to transform the data set into a tidy format, where the columns are values instead of variables. The melt function is employed to convert the data set into a tidy format, ensuring that the columns are values instead of variables.', 'Tidy data is preferred for modeling purposes and databases, as it allows for easier manipulation and compatibility. Tidy data is beneficial for modeling purposes and database compatibility due to its ease of manipulation and preferred format for databases.']}, {'end': 2660.069, 'start': 2421.611, 'title': 'Data transformation and storage', 'summary': 'Discusses the challenges of working with large datasets, the importance of data format for storage and modeling, and the process of transforming data using pd.melt and addressing issues with column names storing multiple pieces of information, using a 2014 ebola outbreak dataset as an example.', 'duration': 238.458, 'highlights': ['The importance of data format for storage and modeling is highlighted through an example of transforming a 1000x1000 row dataset with 1000 columns into a many-thousand row object with two columns, which was more favorable.', 'The process of transforming data using pd.melt is explained in the context of dealing with multiple variables stored in one column, with a demonstration using a dataset from the 2014 Ebola outbreak.', 'Challenges with column names storing multiple pieces of information are discussed, emphasizing the need to address this problem when working with datasets, illustrated through an example of column names storing demographic and age group information in the 2014 Ebola outbreak dataset.', 'The significance of shaping data for storage and modeling is emphasized, showcasing the transformation of a dataset with 317 rows and 81 columns to a more favorable format for data storage and modeling.', 'An example of curating a dataset from the 2014 Ebola outbreak, detailing the process of collecting and organizing case and death counts reported by countries during the outbreak, is presented, highlighting the practical application of data transformation and storage.']}], 'duration': 545.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2114773.jpg', 'highlights': ['The process of melting the dataset and specifying fixed or melted columns results in the creation of a tidy dataset, facilitating subsequent data analysis and modeling.', 'The var name and value name parameters are utilized to change the default value variable and value columns, enabling the transformation of variable names in the dataset.', 'The function takes two essential parameters, ID vars and value vars, where any unspecified variables in IDVars are placed into the value vars parameter.', 'The melt function is used to transform the data set into a tidy format, where the columns are values instead of variables.', 'Tidy data is preferred for modeling purposes and databases, as it allows for easier manipulation and compatibility.', 'The importance of data format for storage and modeling is highlighted through an example of transforming a 1000x1000 row dataset with 1000 columns into a many-thousand row object with two columns, which was more favorable.', 'The process of transforming data using pd.melt is explained in the context of dealing with multiple variables stored in one column, with a demonstration using a dataset from the 2014 Ebola outbreak.', 'Challenges with column names storing multiple pieces of information are discussed, emphasizing the need to address this problem when working with datasets, illustrated through an example of column names storing demographic and age group information in the 2014 Ebola outbreak dataset.', 'The significance of shaping data for storage and modeling is emphasized, showcasing the transformation of a dataset with 317 rows and 81 columns to a more favorable format for data storage and modeling.', 'An example of curating a dataset from the 2014 Ebola outbreak, detailing the process of collecting and organizing case and death counts reported by countries during the outbreak, is presented, highlighting the practical application of data transformation and storage.']}, {'end': 3369.973, 'segs': [{'end': 2781.434, 'src': 'embed', 'start': 2708.693, 'weight': 0, 'content': [{'end': 2726.436, 'text': 'Yes Yes, you have to write like weird code for it, but yeah, you could.', 'start': 2708.693, 'duration': 17.743}, {'end': 2731.339, 'text': 'You could totally slice columns without, but you have to write like other Python code to do that.', 'start': 2726.496, 'duration': 4.843}, {'end': 2737.744, 'text': "Yeah, so let's come back at 2.20, and then we'll just pick up from there.", 'start': 2733.12, 'duration': 4.624}, {'end': 2741.166, 'text': 'All right, 2.20.', 'start': 2740.205, 'duration': 0.961}, {'end': 2745.268, 'text': 'That was actually really fast, I guess.', 'start': 2741.166, 'duration': 4.102}, {'end': 2750.132, 'text': 'All right, but we have a lot to cover, so let us keep going.', 'start': 2746.93, 'duration': 3.202}, {'end': 2752.133, 'text': 'So we have this data set.', 'start': 2751.292, 'duration': 0.841}, {'end': 2761.87, 'text': "And what we want is we wanted those specific columns date, day, country, whether or not it's a case count or a death count, and then the actual value.", 'start': 2752.543, 'duration': 9.327}, {'end': 2769.936, 'text': "This goes to 4 o'clock, right? All right.", 'start': 2764.472, 'duration': 5.464}, {'end': 2774.6, 'text': 'So we know the first thing that we want to do.', 'start': 2771.017, 'duration': 3.583}, {'end': 2778.683, 'text': 'If we melted those columns down, that gets us part of the way there.', 'start': 2774.86, 'duration': 3.823}, {'end': 2781.434, 'text': "All right, so let's say, let's do that first.", 'start': 2779.753, 'duration': 1.681}], 'summary': 'In a data analysis discussion, a plan is made to slice specific columns and melt them down for further analysis.', 'duration': 72.741, 'max_score': 2708.693, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2708693.jpg'}, {'end': 2839.64, 'src': 'embed', 'start': 2817.689, 'weight': 2, 'content': [{'end': 2826.534, 'text': 'And then the only problem here is this variable column now contains two bits of information.', 'start': 2817.689, 'duration': 8.845}, {'end': 2830.457, 'text': "So that is the, we did the first part, and now let's work on the second part.", 'start': 2827.115, 'duration': 3.342}, {'end': 2839.64, 'text': "So if we look at this, if I were to ask you like hey, so there's like case counts, death counts and then a country, how would you split this?", 'start': 2832.315, 'duration': 7.325}], 'summary': 'The variable column contains two bits of information, requiring splitting into case counts, death counts, and country.', 'duration': 21.951, 'max_score': 2817.689, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2817689.jpg'}], 'start': 2662.441, 'title': 'Data set transformation and data manipulation in python', 'summary': 'Discusses transforming the data set to include specific columns for date, day, count type, country, and count to resolve issues with the current data structure, and covers the process of manipulating data and splitting strings in python to separate case count, death count, and country, creating new columns from the split data.', 'chapters': [{'end': 2752.133, 'start': 2662.441, 'title': 'Data set transformation', 'summary': 'Discusses the need to transform a data set to include columns for date, day, type of count, country, and count, aiming to resolve issues with the current data structure, and takes a short break before continuing.', 'duration': 89.692, 'highlights': ['The data set requires transformation to include columns for date, day, count type, country, and count to resolve existing data structure issues.', 'The chapter takes a short break at 2.16 and plans to resume at 2.21 to continue discussing the data set transformation.', "The need for specific columns like date, day, count type, country, and count is emphasized to address the current data set's problems."]}, {'end': 3369.973, 'start': 2752.543, 'title': 'Data manipulation and string splitting in python', 'summary': 'Covers the process of melting specific columns, using string split to separate case count, death count, and country, and creating new columns from the split data.', 'duration': 617.43, 'highlights': ["Using pd.melt to melt specific columns like date, day, and value to reshape the data, which goes up to 4 o'clock.", "Demonstrating how to use Python's string manipulation functions, such as split, to split the variable column into case count, death count, and country.", 'Explaining the process of treating a column as a string using the str accessor in pandas to open up string manipulation methods like split.', 'Illustrating how to create new columns for status (case count or death count) and country by extracting elements from the split variable column and assigning them to new columns in the data set.']}], 'duration': 707.532, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU2662441.jpg', 'highlights': ['The data set requires transformation to include columns for date, day, count type, country, and count to resolve existing data structure issues.', "Using pd.melt to melt specific columns like date, day, and value to reshape the data, which goes up to 4 o'clock.", "Demonstrating how to use Python's string manipulation functions, such as split, to split the variable column into case count, death count, and country.", 'Illustrating how to create new columns for status (case count or death count) and country by extracting elements from the split variable column and assigning them to new columns in the data set.', "The need for specific columns like date, day, count type, country, and count is emphasized to address the current data set's problems.", 'The chapter takes a short break at 2.16 and plans to resume at 2.21 to continue discussing the data set transformation.']}, {'end': 4323.645, 'segs': [{'end': 3536.273, 'src': 'embed', 'start': 3505.943, 'weight': 0, 'content': [{'end': 3509.985, 'text': "What we don't want is a column called element where there are values for T min and T max.", 'start': 3505.943, 'duration': 4.042}, {'end': 3512.968, 'text': 'All right.', 'start': 3512.688, 'duration': 0.28}, {'end': 3514.848, 'text': 'so how do we fix this problem??', 'start': 3512.968, 'duration': 1.88}, {'end': 3519.389, 'text': "Well, the first thing we've seen over and over again.", 'start': 3515.809, 'duration': 3.58}, {'end': 3521.09, 'text': 'we have these days stored in columns.', 'start': 3519.389, 'duration': 1.701}, {'end': 3522.27, 'text': "so let's fix that part first.", 'start': 3521.09, 'duration': 1.18}, {'end': 3532.972, 'text': "So we'll be using this melt function again, so this is the third time that we're seeing this.", 'start': 3528.411, 'duration': 4.561}, {'end': 3536.273, 'text': 'So we pass in our weather data.', 'start': 3534.653, 'duration': 1.62}], 'summary': 'Using the melt function to fix columns with t min and t max values.', 'duration': 30.33, 'max_score': 3505.943, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU3505943.jpg'}, {'end': 3777.837, 'src': 'embed', 'start': 3753.274, 'weight': 1, 'content': [{'end': 3758.757, 'text': "And so that's what pivot table will give you, is you get to specify here an aggregation function.", 'start': 3753.274, 'duration': 5.483}, {'end': 3762.98, 'text': 'So you can say, hey, I want the sum, I want the mean, I want the min, I want the max, etc.', 'start': 3758.877, 'duration': 4.103}, {'end': 3767.423, 'text': ', etc So those are the subtle differences between pivot and pivot table.', 'start': 3762.98, 'duration': 4.443}, {'end': 3774.876, 'text': "In this case, I'll just show you pivot table, but just know that if you're using pivot, you're making an assumption that there are no duplicates.", 'start': 3768.753, 'duration': 6.123}, {'end': 3776.676, 'text': "If there are duplicates, your code's going to crash.", 'start': 3775.196, 'duration': 1.48}, {'end': 3777.837, 'text': "It's going to throw an error.", 'start': 3777.076, 'duration': 0.761}], 'summary': 'Pivot table allows specifying aggregation functions like sum, mean, min, and max, while pivot assumes no duplicates, leading to potential code errors.', 'duration': 24.563, 'max_score': 3753.274, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU3753274.jpg'}, {'end': 3984.246, 'src': 'heatmap', 'start': 3930.376, 'weight': 0.966, 'content': [{'end': 3934.117, 'text': 'So then we pass in this values parameter and we just say hey,', 'start': 3930.376, 'duration': 3.741}, {'end': 3939.719, 'text': 'use this temp column here for temperature and use that to populate after we do this manipulation.', 'start': 3934.117, 'duration': 5.602}, {'end': 3946.081, 'text': 'So now if we look at this, we get what we need, it just prints out like a little wonky.', 'start': 3940.839, 'duration': 5.242}, {'end': 3954.063, 'text': 'So we have now a column call ID, column call year, column call month and day, and then we also have a temperature max and temperature min.', 'start': 3946.321, 'duration': 7.742}, {'end': 3963.248, 'text': "Depending on what version of Pandas you're using, I THINK THERE'S ALSO A PARAMETER IN HERE CALLED DROP NA.", 'start': 3955.464, 'duration': 7.784}, {'end': 3965.07, 'text': 'SO THIS WILL DROP MISSING VALUES.', 'start': 3963.348, 'duration': 1.722}, {'end': 3969.573, 'text': "SO THAT'S WHY YOU DON'T HAVE A BUNCH OF NAS FOR, FOR EXAMPLE, JANUARY.", 'start': 3965.65, 'duration': 3.923}, {'end': 3974.458, 'text': "IT'S ONLY SHOWING D30, DEPENDING ON WHAT VALUE OF PANDAS, YOU MIGHT JUST HAVE ALL OF THEM AS MISSING.", 'start': 3970.054, 'duration': 4.404}, {'end': 3977.06, 'text': 'ALL RIGHT.', 'start': 3974.478, 'duration': 2.582}, {'end': 3979.001, 'text': 'SO YOU HAVE THIS DATA SET.', 'start': 3977.18, 'duration': 1.821}, {'end': 3984.246, 'text': "AND THIS IS ALMOST EXACTLY WHAT WE WANT, BUT IT'S JUST PRINTING OUT DIFFERENTLY.", 'start': 3980.943, 'duration': 3.303}], 'summary': 'Using pandas to manipulate data, dropping missing values to get desired output.', 'duration': 53.87, 'max_score': 3930.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU3930376.jpg'}, {'end': 4026.726, 'src': 'embed', 'start': 3992.559, 'weight': 2, 'content': [{'end': 3995.059, 'text': "There's a hierarchy in the columns and rows going on here.", 'start': 3992.559, 'duration': 2.5}, {'end': 4001.861, 'text': 'So I also, in my day to day, I jump for any given hour or day of the week.', 'start': 3997.34, 'duration': 4.521}, {'end': 4003.461, 'text': "I'm either using R or Python.", 'start': 4002.201, 'duration': 1.26}, {'end': 4008.482, 'text': 'So I try to just bring everything down to the lowest common denominator.', 'start': 4003.501, 'duration': 4.981}, {'end': 4012.562, 'text': 'And so I just want to work with a regular flat data frame.', 'start': 4009.602, 'duration': 2.96}, {'end': 4015.223, 'text': 'So the way we do that.', 'start': 4014.203, 'duration': 1.02}, {'end': 4019.743, 'text': 'is there is this method called reset index.', 'start': 4017.722, 'duration': 2.021}, {'end': 4026.726, 'text': "And if you run reset index and you have this weird hierarchy thing going on, and you know it's a weird hierarchy thing just by the way it's printed.", 'start': 4020.323, 'duration': 6.403}], 'summary': "Utilizing r or python, i simplify data into flat frames using 'reset index' to remove hierarchy.", 'duration': 34.167, 'max_score': 3992.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU3992559.jpg'}, {'end': 4098.491, 'src': 'heatmap', 'start': 4034.97, 'weight': 0.748, 'content': [{'end': 4036.471, 'text': 'There are some weird edge cases going on.', 'start': 4034.97, 'duration': 1.501}, {'end': 4043.054, 'text': "But if you just want a regular flat data frame, just run reset index and it'll just flatten everything down for you.", 'start': 4038.012, 'duration': 5.042}, {'end': 4050.064, 'text': 'And sometimes you have to run reset index like twice.', 'start': 4047.483, 'duration': 2.581}, {'end': 4052.345, 'text': "So I wouldn't worry too much about that.", 'start': 4050.224, 'duration': 2.121}, {'end': 4054.646, 'text': 'All right.', 'start': 4052.365, 'duration': 2.281}, {'end': 4059.028, 'text': 'And now we have like a nice tidy data frame for our weather data set.', 'start': 4055.726, 'duration': 3.302}, {'end': 4061.069, 'text': 'All right.', 'start': 4060.849, 'duration': 0.22}, {'end': 4061.809, 'text': 'Last part.', 'start': 4061.409, 'duration': 0.4}, {'end': 4067.451, 'text': 'What do we do when multiple tables, multiple types of observational units are stored in the same table?', 'start': 4062.229, 'duration': 5.222}, {'end': 4076.015, 'text': 'So an example of that was our billboard data set.', 'start': 4069.852, 'duration': 6.163}, {'end': 4084.867, 'text': 'If you think about how databases work, or if you work with databases, or if you are a database administrator.', 'start': 4077.645, 'duration': 7.222}, {'end': 4098.491, 'text': 'And if we looked at the tidy version of our data set, of our billboard data set, you can see year, artist, track, time, date entered.', 'start': 4086.868, 'duration': 11.623}], 'summary': 'Data frame is flattened with reset index, handling multiple tables in a tidy format.', 'duration': 63.521, 'max_score': 4034.97, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU4034970.jpg'}, {'end': 4251.252, 'src': 'heatmap', 'start': 4140.603, 'weight': 1, 'content': [{'end': 4150.688, 'text': "let's look at this data set such that the track is the song Loser, Loser, L right?", 'start': 4140.603, 'duration': 10.085}, {'end': 4158.332, 'text': 'Like this song by Three Doors Down is repeated in our data set a bunch of times 76 times, in fact, right?', 'start': 4151.028, 'duration': 7.304}, {'end': 4159.353, 'text': "So that's not good.", 'start': 4158.772, 'duration': 0.581}, {'end': 4161.634, 'text': 'So how do we kind of fix this stuff up?', 'start': 4159.473, 'duration': 2.161}, {'end': 4168.858, 'text': 'so, like our data sets are a little bit more like storage and database person friendly?', 'start': 4161.634, 'duration': 7.224}, {'end': 4172.518, 'text': 'I showed you guys how to filter your data set.', 'start': 4170.558, 'duration': 1.96}, {'end': 4178.12, 'text': "So it's really just a combination of filtering your data set with drop duplicates.", 'start': 4172.537, 'duration': 5.583}, {'end': 4190.221, 'text': 'So we want to create a songs data set, which just holds information about our songs.', 'start': 4179.939, 'duration': 10.282}, {'end': 4195.923, 'text': "So how do we do that? Well, that's our billboard melt.", 'start': 4190.801, 'duration': 5.122}, {'end': 4199.304, 'text': "And what's information stored in songs? It's just year.", 'start': 4196.783, 'duration': 2.521}, {'end': 4207.119, 'text': 'artist track time.', 'start': 4200.635, 'duration': 6.484}, {'end': 4210.342, 'text': "I'll leave the date entered separate, but it's fine.", 'start': 4207.7, 'duration': 2.642}, {'end': 4213.644, 'text': 'So we have our billboard songs.', 'start': 4211.983, 'duration': 1.661}, {'end': 4220.349, 'text': 'So this is the stuff that was repeated 76 times.', 'start': 4216.806, 'duration': 3.543}, {'end': 4225.712, 'text': 'So if we look at our shape, we have 24, 000 rows in this.', 'start': 4220.609, 'duration': 5.103}, {'end': 4237.297, 'text': 'But if we run this command called drop duplicates, IT WILL RETURN BACK ANOTHER DATA FRAME SUCH THAT ALL OF THE DUPLICATE ROWS ARE NO LONGER THERE.', 'start': 4226.573, 'duration': 10.724}, {'end': 4241.301, 'text': "SO LET'S REASSIGN THIS BACK TO OUR DATA SET.", 'start': 4238.358, 'duration': 2.943}, {'end': 4245.586, 'text': "AND NOW, IF WE LOOK AT SHAPE, IT'S REALLY LIKE 317, RIGHT?", 'start': 4241.301, 'duration': 4.285}, {'end': 4246.847, 'text': 'SO YOU CAN CLEARLY SEE.', 'start': 4245.926, 'duration': 0.921}, {'end': 4251.252, 'text': 'LIKE. WE WENT FROM 20, 000 TO 317, AND THIS JUST FOR DATA STORAGE.', 'start': 4246.847, 'duration': 4.405}], 'summary': "The song 'loser' by three doors down was repeated 76 times in the data set, but after using the command 'drop duplicates,' the number of rows reduced to 317, greatly improving data storage.", 'duration': 110.649, 'max_score': 4140.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU4140603.jpg'}], 'start': 3370.553, 'title': 'Data storage and pivot functions', 'summary': "Discusses identifying and resolving data storage issues, recognizing symptoms, and using the melt function in the 'weather' dataset. it also explains the differences between pivot and pivot table functions, handling duplicate values, aggregation options, a small api change, and using pivot table function for data management in pandas.", 'chapters': [{'end': 3651.452, 'start': 3370.553, 'title': 'Identifying and resolving data storage issues', 'summary': "Discussed identifying and resolving issues with data storage, specifically recognizing the symptoms of data being stored in both rows and columns, and the process of using the melt function to address this problem in the dataset 'weather'.", 'duration': 280.899, 'highlights': ['Identifying the symptoms of data being stored in both rows and columns, such as duplicated values except for one, and the impact on modeling and analysis.', 'Utilizing the melt function to address the issue of days being stored in columns, and renaming columns for clarity and consistency.', 'Recognizing the challenge of not having a specific way to differentiate between min and max temperatures due to data storage issues, which impacts the ability to fit a model.']}, {'end': 3861.029, 'start': 3651.552, 'title': 'Pivot table vs pivot function', 'summary': 'Explains the differences between the pivot and pivot table functions, highlighting the handling of duplicate values and the aggregation options, and also mentions a small api change in the newer version of pandas.', 'duration': 209.477, 'highlights': ['The pivot function does not handle duplicate values, while the pivot table function allows specifying an aggregation function like sum, mean, min, max, etc.', 'The example illustrates the scenario of having duplicate temperature readings for the same year, month, and day, and the need for handling them when using pivot.', 'The newer version of pandas introduced an API change, unifying the interface for using pd.melt or dataframe.melt.']}, {'end': 4323.645, 'start': 3861.049, 'title': 'Pivot table function and data management', 'summary': 'Explains how to use the pivot table function in pandas, including the process of pivoting columns, handling missing values, and flattening hierarchical data frames, as well as optimizing data storage through filtering and dropping duplicates.', 'duration': 462.596, 'highlights': ['The chapter explains the process of using the pivot table function in Pandas, including specifying columns to pivot, creating new columns based on unique values, and populating the values for the manipulation. Process of using the pivot table function, specifying columns to pivot, creating new columns, populating values', 'The method of resetting the index is discussed as a way to flatten hierarchical data frames into regular flat data frames for easier manipulation and analysis. Resetting the index to flatten hierarchical data frames, making data frame manipulation easier', 'The process of optimizing data storage through filtering and dropping duplicates is explained, resulting in a significant reduction in the number of rows from 24,000 to 317 for improved storage efficiency. Optimizing storage through filtering and dropping duplicates, reducing rows from 24,000 to 317']}], 'duration': 953.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU3370553.jpg', 'highlights': ['Utilizing the melt function to address the issue of days being stored in columns, and renaming columns for clarity and consistency.', 'The pivot function does not handle duplicate values, while the pivot table function allows specifying an aggregation function like sum, mean, min, max, etc.', 'The method of resetting the index is discussed as a way to flatten hierarchical data frames into regular flat data frames for easier manipulation and analysis.']}, {'end': 5233.852, 'segs': [{'end': 4694.694, 'src': 'embed', 'start': 4651.45, 'weight': 1, 'content': [{'end': 4656.672, 'text': 'But if we look at info, one of the things down here is like the actual memory usage of this data frame.', 'start': 4651.45, 'duration': 5.222}, {'end': 4660.254, 'text': 'And so if you look at our original data frame, we have like 1.3 plus megabytes.', 'start': 4657.133, 'duration': 3.121}, {'end': 4666.037, 'text': 'If we look at the separate ones, they are actually much smaller in terms of data set size, right?', 'start': 4660.654, 'duration': 5.383}, {'end': 4670.72, 'text': 'So again, the last point of tidy data is more for data storage.', 'start': 4666.077, 'duration': 4.643}, {'end': 4676.342, 'text': "If you're actually gonna combine data or fit a model or something.", 'start': 4671.14, 'duration': 5.202}, {'end': 4684.167, 'text': "what you're doing is like combining and doing the merge together and getting the full 1.3 megabyte data set loaded in memory.", 'start': 4676.342, 'duration': 7.825}, {'end': 4687.79, 'text': 'But in terms of storage, this is way easier to store.', 'start': 4684.608, 'duration': 3.182}, {'end': 4689.831, 'text': "You've cleaned your data set up.", 'start': 4688.39, 'duration': 1.441}, {'end': 4694.694, 'text': 'break it up into separate pieces, just so if you have to zip this up and email it to someone,', 'start': 4689.831, 'duration': 4.863}], 'summary': 'Original data frame size is 1.3mb, but separate ones are smaller, aiding storage and data management.', 'duration': 43.244, 'max_score': 4651.45, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU4651450.jpg'}, {'end': 4804.03, 'src': 'embed', 'start': 4779.813, 'weight': 0, 'content': [{'end': 4786.618, 'text': 'So that is essentially when we our billboard data set that we had originally.', 'start': 4779.813, 'duration': 6.805}, {'end': 4787.959, 'text': 'that is this last problem, right?', 'start': 4786.618, 'duration': 1.341}, {'end': 4793.503, 'text': "So when you're ready to do your analysis, you're kind of like not making your data adhere to the last point,", 'start': 4787.979, 'duration': 5.524}, {'end': 4799.047, 'text': "because you're actually combining your data sets together, because this data set now contains song data and rating data,", 'start': 4793.503, 'duration': 5.544}, {'end': 4800.628, 'text': "but that's what you need for your analysis.", 'start': 4799.047, 'duration': 1.581}, {'end': 4802.93, 'text': 'So that is it.', 'start': 4800.668, 'duration': 2.262}, {'end': 4804.03, 'text': 'I am on time.', 'start': 4803.19, 'duration': 0.84}], 'summary': 'Combining billboard data sets with song and rating data is essential for analysis.', 'duration': 24.217, 'max_score': 4779.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU4779813.jpg'}, {'end': 5130.946, 'src': 'embed', 'start': 5102.212, 'weight': 3, 'content': [{'end': 5109.635, 'text': "if there's a preference or best practice around accessing brackets versus the dot notation, if you know you're going to be chaining extensively.", 'start': 5102.212, 'duration': 7.423}, {'end': 5117.559, 'text': "I don't know if there's an actual best practice.", 'start': 5114.057, 'duration': 3.502}, {'end': 5119.86, 'text': 'I use a square bracket notation a lot.', 'start': 5117.899, 'duration': 1.961}, {'end': 5123.522, 'text': "That's just how my R scripts look like that.", 'start': 5120.621, 'duration': 2.901}, {'end': 5130.946, 'text': "I wouldn't say there is a best practice, really.", 'start': 5123.542, 'duration': 7.404}], 'summary': 'No clear best practice for accessing brackets versus dot notation in chaining extensively.', 'duration': 28.734, 'max_score': 5102.212, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU5102212.jpg'}], 'start': 4323.665, 'title': 'Merging, tidying, and chaining data', 'summary': 'Covers merging songs and billboard melt data to create new billboard ratings, saving data as csv files, tidying data to reduce memory usage, improve storage and analysis, and comparing chaining techniques in r and python, emphasizing best practices.', 'chapters': [{'end': 4651.27, 'start': 4323.665, 'title': 'Merging and saving billboard data', 'summary': 'Discusses merging two data sets, songs and billboard melt, to create a new data set called billboard ratings, and saving them as csv files while using techniques like filtering columns and dropping duplicates.', 'duration': 327.605, 'highlights': ['Merging two data sets to create a new data set called billboard ratings The process involves merging the songs table and the original data set to create a data set called billboard ratings in order to utilize the ID column and perform further analysis.', 'Saving the data sets as CSV files The process includes saving the songs data set as billboard_songs.csv and the ratings data set as billboard_ratings.csv, after filtering columns and dropping duplicates.', "Understanding the merge process and its parameters The merge process involves understanding the parameters like left, right, inner, outer, and using 'on' to specify the columns for merging, ensuring a proper understanding of the data alignment."]}, {'end': 4842.273, 'start': 4651.45, 'title': 'Tidy data for efficient data storage', 'summary': 'Discusses the benefits of tidying data for efficient storage, such as reducing memory usage by breaking data into smaller pieces, making it easier to store and download, and optimizing data for analysis.', 'duration': 190.823, 'highlights': ['The benefits of tidying data for efficient storage include reducing memory usage by breaking data into smaller pieces, as seen in the case of the original data frame with 1.3 plus megabytes compared to the smaller separate ones.', 'Tidying data makes it easier to store and download, which is advantageous for scenarios like emailing or dealing with poor Wi-Fi connections at conferences.', 'Tidying data involves assembling and combining data sets for analysis, with processes such as melting, string manipulation, and normalization, ultimately optimizing the data for efficient storage and analysis.', 'The chapter emphasizes the importance of making data adhere to the principles of tidy data, such as treating variables stored in both rows and columns as separate pieces and ensuring a single observational unit is stored in multiple tables for effective analysis.']}, {'end': 5233.852, 'start': 4842.273, 'title': 'Chaining in r and python', 'summary': 'Discusses the differences between chaining in r and python, the use of square bracket notation versus dot notation for chaining extensively, and the best practices for writing code and data analysis pipelines in pandas.', 'duration': 391.579, 'highlights': ['The differences between chaining in R and Python R and Python are fundamentally different in their handling of functional and object-oriented programming, which impacts the chaining of commands and code structure.', 'Use of square bracket notation versus dot notation for chaining extensively While there is no specific best practice, the speaker often uses square bracket notation for readability and then cleans up the code later, but acknowledges that extensive chaining may benefit from using dot notation for clarity.', 'Best practices for writing code and data analysis pipelines in pandas The speaker suggests writing code to make it work initially and then potentially refactor it for clarity later, preferring not to extensively chain commands to avoid potential complications and ease of debugging.']}], 'duration': 910.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iYie42M1ZyU/pics/iYie42M1ZyU4323665.jpg', 'highlights': ['Merging songs and billboard melt data to create new billboard ratings', 'Saving data as csv files to reduce memory usage and improve storage', 'Tidying data to reduce memory usage and improve storage and analysis', 'Comparing chaining techniques in R and Python, emphasizing best practices']}], 'highlights': ['The tutorial covers accessing GitHub and using JupyterLab for data analysis.', 'The GitHub repository contains all necessary notebooks and datasets for the tutorial.', 'JupyterLab provides convenient access to files and allows having multiple notebooks open simultaneously.', 'The tutorial aims to skip over unnecessary details and directly focus on tidying data.', 'The shape attribute of a data frame provides the number of rows and columns in the data set.', 'The chapter targets beginners who have not worked with pandas or tabular data in Python.', 'The concept of tidy data is presented as each variable forming a column, each observation forming a row.', 'The process of melting the dataset and specifying fixed or melted columns results in the creation of a tidy dataset.', 'The function takes two essential parameters, ID vars and value vars, where any unspecified variables in IDVars are placed into the value vars parameter.', 'The data set requires transformation to include columns for date, day, count type, country, and count to resolve existing data structure issues.', 'Utilizing the melt function to address the issue of days being stored in columns, and renaming columns for clarity and consistency.', 'Merging songs and billboard melt data to create new billboard ratings', 'Saving data as csv files to reduce memory usage and improve storage', 'Comparing chaining techniques in R and Python, emphasizing best practices']}