title
Hands-on dplyr tutorial for faster data manipulation in R

description
dplyr is a new R package for data manipulation. Using a series of examples on a dataset you can download, this tutorial covers the five basic dplyr "verbs" as well as a dozen other dplyr functions. Watch the follow-up tutorial: http://youtu.be/2mh1PqfsXVI View the R Markdown document: http://rpubs.com/justmarkham/dplyr-tutorial Download the source document: https://github.com/justmarkham/dplyr-tutorial Read about why I love dplyr: https://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/ Tutorial contents: 1. Introduction to dplyr (starts at 0:00) 2. Loading dplyr and the example dataset (starts at 2:29) 3. Understanding "local data frames" (starts at 3:23) 4. Verb #1: `filter` (starts at 5:17) 5. Verb #2: `select`, plus `contains`, `starts_with`, `ends_with`, `matches` (starts at 7:54) 6. Using chaining syntax for more readable code (starts at 9:34) 7. Verb #3: `arrange` (starts at 12:53) 8. Verb #4: `mutate` (starts at 13:55) 9. Verb #5: `summarise`, plus `group_by`, `summarise_each`, `n`, `n_distinct`, `tally` (starts at 15:31) 10. Window functions: `min_rank`, `top_n`, `lag` (starts at 26:47) 11. Convenience functions: `sample_n`, `sample_frac`, `glimpse` (starts at 32:44) 12. Connecting to databases (starts at 34:21) == RESOURCES == Reference manual and vignettes: http://cran.r-project.org/web/packages/dplyr/index.html July 2014 webinar: http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html July 2014 webinar code: https://github.com/rstudio/webinars/tree/master/2014-01 Tutorial by Hadley Wickham: https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a GitHub repo: https://github.com/hadley/dplyr List of releases: https://github.com/hadley/dplyr/releases == LET'S CONNECT! == Newsletter: https://www.dataschool.io/subscribe/ Twitter: https://twitter.com/justmarkham Facebook: https://www.facebook.com/DataScienceSchool/ LinkedIn: https://www.linkedin.com/in/justmarkham/

detail
{'title': 'Hands-on dplyr tutorial for faster data manipulation in R', 'heatmap': [{'end': 497.951, 'start': 440.828, 'weight': 0.974}, {'end': 939.294, 'start': 909.158, 'weight': 0.715}, {'end': 1009.861, 'start': 980.091, 'weight': 0.747}], 'summary': 'Tutorial on dplyr in r introduces its time-saving performance, basic verbs, working with databases, and examples with the latest release version 0.2. it covers data frame manipulation, filtering, column selection, chaining, and quantifiable examples. the tutorial also explains chaining in dplyr, data manipulation, and the use of dplyr functions for data analysis and database queries, including connections to various databases and additional learning resources.', 'chapters': [{'end': 288.546, 'segs': [{'end': 28.994, 'src': 'embed', 'start': 1.223, 'weight': 0, 'content': [{'end': 6.906, 'text': 'This video is an introduction to the dplyr package in R.', 'start': 1.223, 'duration': 5.683}, {'end': 18.051, 'text': 'dplyr is a package for data manipulation and exploration written by Hadley Wickham of ggplot2 fame and author of many other packages.', 'start': 6.906, 'duration': 11.145}, {'end': 24.453, 'text': "Dplyr is the next iteration of Plyr, and it's focused exclusively on data frames.", 'start': 19.332, 'duration': 5.121}, {'end': 28.994, 'text': 'So I use and I love Dplyr because it saves me time.', 'start': 25.113, 'duration': 3.881}], 'summary': 'Introduction to dplyr package in r for data manipulation, created by hadley wickham, focused on data frames.', 'duration': 27.771, 'max_score': 1.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1223.jpg'}], 'start': 1.223, 'title': 'Introduction to dplyr in r', 'summary': 'Introduces the dplyr package in r, highlighting its time-saving performance, code readability, basic verbs, working with databases, joins, window functions, and examples with the latest release version 0.2.', 'chapters': [{'end': 73.142, 'start': 1.223, 'title': 'Introduction to dplyr package in r', 'summary': 'Introduces the dplyr package in r, highlighting its time-saving performance and code readability, and mentions its functionality and intended use for data frames.', 'duration': 71.919, 'highlights': ['Dplyr is a package for data manipulation and exploration in R, known for its time-saving performance and code readability.', 'It is the next iteration of Plyr, focused exclusively on data frames, and is authored by Hadley Wickham.', 'The speaker finds Dplyr valuable for saving time in writing and reading code.']}, {'end': 288.546, 'start': 73.763, 'title': 'Introduction to dplyr in r', 'summary': 'Introduces dplyr, a package in r for data manipulation, including basic verbs, working with databases, joins, window functions, and examples with the latest release version 0.2.', 'duration': 214.783, 'highlights': ['dplyr can work with data stored in databases and data tables dplyr allows working with data stored in databases and data tables, enabling users to manipulate data without transferring it, or learning SQL.', 'dplyr provides four different types of joins: inner, left, semi-join, and anti-join dplyr offers four types of joins, including inner, left, semi-join, and anti-join, providing flexibility in merging datasets.', 'The examples are based on the latest release, version 0.2, which was released in May The examples in the presentation are based on the latest release, version 0.2, providing up-to-date information for users.', 'Introduction to dplyr, a package in R for data manipulation, including basic verbs, working with databases, joins, and window functions The chapter introduces dplyr, a package in R for data manipulation, covering basic verbs, working with databases, joins, and window functions.']}], 'duration': 287.323, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1223.jpg', 'highlights': ['Dplyr is a package for data manipulation and exploration in R, known for its time-saving performance and code readability.', 'dplyr can work with data stored in databases and data tables, enabling users to manipulate data without transferring it, or learning SQL.', 'dplyr provides four different types of joins: inner, left, semi-join, and anti-join, providing flexibility in merging datasets.', 'The examples in the presentation are based on the latest release, version 0.2, providing up-to-date information for users.']}, {'end': 634.392, 'segs': [{'end': 315.631, 'src': 'embed', 'start': 288.546, 'weight': 0, 'content': [{'end': 292.469, 'text': 'and this will show me 20 rows instead of the default of 10..', 'start': 288.546, 'duration': 3.923}, {'end': 299.575, 'text': "And then sometimes you just want to take a look at all the columns, and you don't want to do the str command.", 'start': 292.469, 'duration': 7.106}, {'end': 310.183, 'text': 'So the easiest way I found is just do a head of the local data frame and then coerce it to a regular data frame.', 'start': 300.495, 'duration': 9.688}, {'end': 315.631, 'text': "And as you can see, we're now looking at kind of a regular DataFrame printout.", 'start': 311.243, 'duration': 4.388}], 'summary': 'Show 20 rows instead of default 10, view all columns without str command, and convert to regular dataframe.', 'duration': 27.085, 'max_score': 288.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc288546.jpg'}, {'end': 400.467, 'src': 'embed', 'start': 371.124, 'weight': 3, 'content': [{'end': 377.286, 'text': "So the dplyr approach, is simpler to write and to read, and it's this right here.", 'start': 371.124, 'duration': 6.162}, {'end': 381.769, 'text': "And so I'll run that, and you'll see it produces the identical result.", 'start': 377.306, 'duration': 4.463}, {'end': 392.237, 'text': 'Note that the command structure is to first write the name of the local data frame or regular data frame,', 'start': 382.73, 'duration': 9.507}, {'end': 397.621, 'text': 'then write a condition and a comma and then write any other condition.', 'start': 392.237, 'duration': 5.384}, {'end': 400.467, 'text': 'And the comma, it interprets as an and.', 'start': 398.326, 'duration': 2.141}], 'summary': 'The dplyr approach is simpler to write and produces identical results.', 'duration': 29.343, 'max_score': 371.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc371124.jpg'}, {'end': 497.951, 'src': 'heatmap', 'start': 440.828, 'weight': 0.974, 'content': [{'end': 445.452, 'text': "So if you store anything in a row name, you'll want to write that to an explicit variable.", 'start': 440.828, 'duration': 4.624}, {'end': 451.077, 'text': 'Note that you can use a pipe for the OR condition.', 'start': 447.754, 'duration': 3.323}, {'end': 459.399, 'text': 'And you can also use the in infix operator if you prefer that.', 'start': 453.275, 'duration': 6.124}, {'end': 467.843, 'text': 'Just note that in the case of using this first construct, you do have to write the variable name twice.', 'start': 460.499, 'duration': 7.344}, {'end': 471.125, 'text': 'So the in infix operator is kind of nice.', 'start': 468.804, 'duration': 2.321}, {'end': 473.154, 'text': "So that's filter.", 'start': 472.133, 'duration': 1.021}, {'end': 480.038, 'text': 'Select is the next verb, and this is where you pick columns by name.', 'start': 474.475, 'duration': 5.563}, {'end': 485.242, 'text': "Now, again, I'll show you the base R approach for selecting these three columns.", 'start': 480.779, 'duration': 4.463}, {'end': 486.623, 'text': 'Just run that.', 'start': 485.742, 'duration': 0.881}, {'end': 497.951, 'text': 'And basically, I have to do open bracket comma and then create a character vector and then put these annoying quotation marks around each column name.', 'start': 487.423, 'duration': 10.528}], 'summary': 'Using filter and select in r to manipulate data columns.', 'duration': 57.123, 'max_score': 440.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc440828.jpg'}, {'end': 526.152, 'src': 'embed', 'start': 500.844, 'weight': 5, 'content': [{'end': 506.366, 'text': 'sometimes even hard to remember, the dplyr approach instead looks exactly like the filter command.', 'start': 500.844, 'duration': 5.522}, {'end': 517.369, 'text': 'First you write the name of the data frame and then you just separate between, with commas, the names of the columns you want to include.', 'start': 507.126, 'duration': 10.243}, {'end': 526.152, 'text': "So this is named select just because it does the same thing as a SQL select and I'll just run this right now.", 'start': 518.65, 'duration': 7.502}], 'summary': 'The dplyr approach for data manipulation resembles the filter command and uses a sql-like select function.', 'duration': 25.308, 'max_score': 500.844, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc500844.jpg'}], 'start': 288.546, 'title': 'Data frame manipulation and dplyr commands', 'summary': "Covers data frame manipulation techniques including changing row limits, displaying all columns, and using the 'filter' verb, and also explores dplyr commands for filtering, column selection, and chaining, with quantifiable examples.", 'chapters': [{'end': 346.197, 'start': 288.546, 'title': 'Data frame filtering and display', 'summary': "Discusses changing default row limits of a data frame, demonstrating a method to view all columns, and introduces the 'filter' verb for keeping rows matching a criteria using base r commands.", 'duration': 57.651, 'highlights': ['The chapter demonstrates changing the default row limit of a data frame to 20 from 10 for better visibility. Changing default row limit to 20, enhancing visibility', "The chapter explains a method to view all columns of a local data frame without using the 'str' command by using 'head' and coercing it to a regular data frame. Viewing all columns without 'str' command", "The chapter introduces the 'filter' verb for keeping rows matching a criteria using base R commands and provides an example demonstrating the filtering of rows. Introduction to 'filter' verb, filtering rows matching criteria"]}, {'end': 634.392, 'start': 346.809, 'title': 'Dplyr commands for data manipulation', 'summary': 'Explains the dplyr commands for filtering, selecting columns by name, using chaining/pipelining for easiest readability, and provides examples of base r approaches versus dplyr approaches with quantifiable data.', 'duration': 287.583, 'highlights': ['The dplyr approach for filtering is simpler to write and read, and produces identical results as the base R approach, with the command structure of writing the name of the local data frame, followed by conditions and commas. dplyr filtering approach simplicity and identical results, structure of command', 'The dplyr approach for selecting columns by name is simpler and more intuitive compared to the base R approach, as it only requires writing the data frame name and separating column names with commas. simplicity and intuitiveness of dplyr column selection approach compared to base R approach', 'The explanation of chaining or pipelining emphasizes writing dplyr code for easiest readability, demonstrated through examples of selecting unique carrier and departure delay columns and filtering for delays over 60 minutes. emphasized readability in dplyr code, examples of selecting unique carrier and departure delay columns, filtering for delays over 60 minutes']}], 'duration': 345.846, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc288546.jpg', 'highlights': ['Changing default row limit to 20, enhancing visibility', "Viewing all columns without 'str' command", "Introduction to 'filter' verb, filtering rows matching criteria", 'dplyr filtering approach simplicity and identical results, structure of command', 'simplicity and intuitiveness of dplyr column selection approach compared to base R approach', 'emphasized readability in dplyr code, examples of selecting unique carrier and departure delay columns, filtering for delays over 60 minutes']}, {'end': 1259.603, 'segs': [{'end': 719.392, 'src': 'embed', 'start': 659.963, 'weight': 2, 'content': [{'end': 669.712, 'text': 'but it uses this little operator percent greater than percent, which is an operator from the Magritter package.', 'start': 659.963, 'duration': 9.749}, {'end': 673.896, 'text': "It's imported from Magritter, so you don't have to use dplyr to use this.", 'start': 670.413, 'duration': 3.483}, {'end': 679.385, 'text': 'But basically, you read it out loud as the word then.', 'start': 675.083, 'duration': 4.302}, {'end': 690.489, 'text': 'So you say flights, then select unique carrier and departure delay, then filter by departure delay greater than 60.', 'start': 680.105, 'duration': 10.384}, {'end': 691.73, 'text': "And I'll run that.", 'start': 690.489, 'duration': 1.241}, {'end': 694.976, 'text': "It's much easier to read.", 'start': 693.655, 'duration': 1.321}, {'end': 700.4, 'text': "You'll notice that I don't have to repeat flights because basically the uh.", 'start': 695.276, 'duration': 5.124}, {'end': 712.167, 'text': 'this operator says take, take the um, take what you just did and put it as the first argument and then take the output of this, and then, uh,', 'start': 700.4, 'duration': 11.767}, {'end': 713.708, 'text': 'put it as the first argument here.', 'start': 712.167, 'duration': 1.541}, {'end': 715.029, 'text': "So you don't have to rewrite it.", 'start': 713.769, 'duration': 1.26}, {'end': 719.392, 'text': 'And again, you can just write commands kind of in a natural order.', 'start': 715.75, 'duration': 3.642}], 'summary': 'Using the percent greater than percent operator from magritter package for easier and more readable data manipulation.', 'duration': 59.429, 'max_score': 659.963, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc659963.jpg'}, {'end': 834.174, 'src': 'embed', 'start': 802.878, 'weight': 0, 'content': [{'end': 811.884, 'text': 'So flights, then you select which columns, then you just define which column you want to sort by.', 'start': 802.878, 'duration': 9.006}, {'end': 816.367, 'text': "And it's ascending sorting by default.", 'start': 811.904, 'duration': 4.463}, {'end': 819.89, 'text': 'So it starts with the smallest negative number and then goes up.', 'start': 816.387, 'duration': 3.503}, {'end': 828.072, 'text': 'And then if you want to do descending, you just wrap the variable name with DESC.', 'start': 821.23, 'duration': 6.842}, {'end': 834.174, 'text': "And so now you'll see that our largest departure delays are at the top of the list.", 'start': 828.952, 'duration': 5.222}], 'summary': 'Sorting flights data by departure delays, displaying largest delays at the top.', 'duration': 31.296, 'max_score': 802.878, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc802878.jpg'}, {'end': 939.294, 'src': 'heatmap', 'start': 909.158, 'weight': 0.715, 'content': [{'end': 915.223, 'text': "So I'm running this on flights with the mutate command and then I'm assigning it to flights.", 'start': 909.158, 'duration': 6.065}, {'end': 923.519, 'text': "Now also note that I didn't actually have to select things in order to use them for mutation.", 'start': 917.15, 'duration': 6.369}, {'end': 926.202, 'text': 'I was just doing that so you could see the results.', 'start': 923.599, 'duration': 2.603}, {'end': 929.867, 'text': "But you can mutate anything even if you haven't selected it.", 'start': 926.663, 'duration': 3.204}, {'end': 933.713, 'text': 'And our final verb.', 'start': 931.79, 'duration': 1.923}, {'end': 939.294, 'text': 'is Summarize, which is to reduce variables to values.', 'start': 934.572, 'duration': 4.722}], 'summary': 'Using mutate command to assign and manipulate data without selection in r programming.', 'duration': 30.136, 'max_score': 909.158, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc909158.jpg'}, {'end': 1009.861, 'src': 'heatmap', 'start': 980.091, 'weight': 0.747, 'content': [{'end': 987.257, 'text': 'So, if my goal is to calculate the average arrival delay to each destination,', 'start': 980.091, 'duration': 7.166}, {'end': 1004.876, 'text': 'what I do is I group by the destination And then I say for each group summarize it by calculating the mean arrival delay and assign that to a new variable called average underscore delay.', 'start': 987.257, 'duration': 17.619}, {'end': 1009.861, 'text': 'Now, because there are NAs, I have to do this na.rm equals true.', 'start': 1005.536, 'duration': 4.325}], 'summary': 'Calculate average arrival delay to each destination by grouping and summarizing, handling nas with na.rm=true.', 'duration': 29.77, 'max_score': 980.091, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc980091.jpg'}], 'start': 635.032, 'title': 'Dplyr data manipulation', 'summary': 'Covers chaining in dplyr using the percent greater than percent operator to streamline code, reordering rows, and adding new variables. it also explains the dplyr approach to data manipulation, including the mutate command to create new variables and the summarize function to reduce variables to values, with examples and comparisons to base r functions.', 'chapters': [{'end': 862.727, 'start': 635.032, 'title': 'Chaining in dplyr', 'summary': 'Explains the concept of chaining in dplyr, using the percent greater than percent operator from the magritter package to streamline code and enhance readability, as well as demonstrating its application for reordering rows and adding new variables.', 'duration': 227.695, 'highlights': ['Chaining using the percent greater than percent operator streamlines dplyr code and enhances readability Chaining simplifies dplyr code and improves readability by using the percent greater than percent operator, making it easier to understand and write commands in a natural order.', 'Application of chaining for reordering rows in dplyr with ascending and descending sorting Chaining in dplyr allows for easy reordering of rows with ascending and descending sorting, simplifying the process compared to the base R approach.', 'Demonstration of using chaining to add new variables in dplyr, such as calculating speed in miles per hour Chaining in dplyr facilitates the addition of new variables, such as calculating speed in miles per hour, streamlining the process and improving efficiency.']}, {'end': 1259.603, 'start': 862.727, 'title': 'Dplyr: mutate and summarize functions', 'summary': 'Explains the dplyr approach to data manipulation, including using the mutate command to create new variables and the summarize function to reduce variables to values, with examples and comparisons to base r functions.', 'duration': 396.876, 'highlights': ['The dplyr approach uses the mutate command to create new variables and does not require selecting columns in order to use them for mutation.', 'The summarize function in dplyr is used to group data by variables and aggregate it with functions, with an example of calculating the average arrival delay for each destination.', 'The summarizeEach function in dplyr allows applying the same summary function to multiple columns at once, with an example of calculating the percentage of flights canceled or diverted for each carrier.']}], 'duration': 624.571, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc635032.jpg', 'highlights': ['Chaining using the percent greater than percent operator streamlines dplyr code and enhances readability', 'Demonstration of using chaining to add new variables in dplyr, such as calculating speed in miles per hour', 'Application of chaining for reordering rows in dplyr with ascending and descending sorting', 'The dplyr approach uses the mutate command to create new variables and does not require selecting columns in order to use them for mutation', 'The summarize function in dplyr is used to group data by variables and aggregate it with functions, with an example of calculating the average arrival delay for each destination', 'The summarizeEach function in dplyr allows applying the same summary function to multiple columns at once, with an example of calculating the percentage of flights canceled or diverted for each carrier']}, {'end': 1673.651, 'segs': [{'end': 1529.91, 'src': 'embed', 'start': 1497.665, 'weight': 1, 'content': [{'end': 1499.727, 'text': 'So we know how to do the total number of flights.', 'start': 1497.665, 'duration': 2.062}, {'end': 1503.309, 'text': 'We summarize and then flight count equals n.', 'start': 1500.687, 'duration': 2.622}, {'end': 1509.794, 'text': 'equals n distinct tailnum.', 'start': 1506.751, 'duration': 3.043}, {'end': 1517.56, 'text': "Now note that when we calculate plane count, it's not referring to this in some way, so just forget that this is here.", 'start': 1510.194, 'duration': 7.366}, {'end': 1528.569, 'text': "So n distinct is basically saying within each group how many distinct tailnum, and tailnum is an identifier for a plane that's unique.", 'start': 1518.2, 'duration': 10.369}, {'end': 1529.91, 'text': "So let's run this.", 'start': 1529.129, 'duration': 0.781}], 'summary': 'Total number of flights summarized to n distinct tailnum.', 'duration': 32.245, 'max_score': 1497.665, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1497665.jpg'}, {'end': 1622.468, 'src': 'embed', 'start': 1593.851, 'weight': 0, 'content': [{'end': 1604.455, 'text': 'And just remember that you can pass, you can use other base r functions in combination with these dplyr functions.', 'start': 1593.851, 'duration': 10.604}, {'end': 1605.776, 'text': 'And this is a good example.', 'start': 1604.555, 'duration': 1.221}, {'end': 1616.142, 'text': "Now we're going to move on to window functions which are a bit more advanced and the code gets a little longer and a little more complicated to write and read.", 'start': 1607.287, 'duration': 8.855}, {'end': 1622.468, 'text': "Basically, I'm going to contrast window functions with aggregation functions.", 'start': 1618.206, 'duration': 4.262}], 'summary': 'Dplyr functions can be used with other base r functions. moving on to window functions, which are more advanced and make the code longer and more complex.', 'duration': 28.617, 'max_score': 1593.851, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1593851.jpg'}, {'end': 1683.922, 'src': 'embed', 'start': 1650.752, 'weight': 2, 'content': [{'end': 1657.498, 'text': 'So these are things like ranking and ordering functions, offset functions, and cumulative aggregates.', 'start': 1650.752, 'duration': 6.746}, {'end': 1666.665, 'text': "So if I say here's a vector of 10 numbers and I want you to rank them, you're going to give me back 10 ranks.", 'start': 1658.238, 'duration': 8.427}, {'end': 1670.228, 'text': "That's why it's a window function, not an aggregation function.", 'start': 1667.105, 'duration': 3.123}, {'end': 1673.651, 'text': "So let's do an example here.", 'start': 1671.529, 'duration': 2.122}, {'end': 1683.922, 'text': 'This is for each carrier, we want to calculate which two days of the year they had their longest departure delays.', 'start': 1674.477, 'duration': 9.445}], 'summary': 'Window functions include ranking and offset functions, used to calculate longest departure delays for carriers.', 'duration': 33.17, 'max_score': 1650.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1650752.jpg'}], 'start': 1259.683, 'title': 'Data analysis functions and dplyr functions', 'summary': 'Demonstrates the use of na removal, defining helper functions, grouping and summarizing flights data, with examples such as finding the minimum and maximum arrival and departure delays. it also introduces dplyr functions like summarize, tally, and group_by with examples of calculating flight count and plane count.', 'chapters': [{'end': 1410.384, 'start': 1259.683, 'title': 'Data analysis functions and summarization', 'summary': 'Demonstrates the use of na removal, defining helper functions n and n distinct, grouping and summarizing flights data, and arranging data in descending order, with examples such as finding the minimum and maximum arrival and departure delays and counting total flights for each day of the year.', 'duration': 150.701, 'highlights': ['The chapter demonstrates the use of NA removal when running functions like min and max, providing guidance on writing functions and utilizing chaining.', 'Introduction of helper functions n and n distinct, which count the number of rows in a group and the number of unique items in a vector, respectively.', 'Illustration of grouping and summarizing flights data by month and day of the month, including examples of counting the total number of flights and arranging the data in descending order.', 'Examples include finding the minimum and maximum arrival and departure delays for each carrier, and counting the total number of flights for each day of the year, with specific flight counts for certain dates.']}, {'end': 1673.651, 'start': 1410.944, 'title': 'Dplyr functions and window functions', 'summary': 'Introduces dplyr functions such as summarize, tally, and group_by with examples of calculating flight count and plane count using n and n distinct, and demonstrates the contrast between window functions and aggregation functions.', 'duration': 262.707, 'highlights': ['The chapter introduces dplyr functions such as summarize, tally, and group_by with examples of calculating flight count and plane count using n and n distinct. The chapter explains the use of dplyr functions like summarize, tally, and group_by to calculate flight count and plane count using n and n distinct.', 'The chapter demonstrates the contrast between window functions and aggregation functions. It contrasts window functions, such as ranking and ordering, with aggregation functions, like mean, by explaining that window functions take n inputs and return n values, while aggregation functions take n inputs and return one value.']}], 'duration': 413.968, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1259683.jpg', 'highlights': ['The chapter demonstrates the use of NA removal when running functions like min and max, providing guidance on writing functions and utilizing chaining.', 'Illustration of grouping and summarizing flights data by month and day of the month, including examples of counting the total number of flights and arranging the data in descending order.', 'Introduction of helper functions n and n distinct, which count the number of rows in a group and the number of unique items in a vector, respectively.', 'The chapter introduces dplyr functions such as summarize, tally, and group_by with examples of calculating flight count and plane count using n and n distinct.', 'The chapter demonstrates the contrast between window functions and aggregation functions, explaining that window functions take n inputs and return n values, while aggregation functions take n inputs and return one value.']}, {'end': 2013.023, 'segs': [{'end': 1862.437, 'src': 'embed', 'start': 1827.157, 'weight': 0, 'content': [{'end': 1831.44, 'text': "So it saves a lot of typing and it's conceptually kind of makes a lot of sense.", 'start': 1827.157, 'duration': 4.283}, {'end': 1834.523, 'text': 'I want the top two in that group.', 'start': 1831.901, 'duration': 2.622}, {'end': 1840.568, 'text': "And it's selected automatically by departure delay.", 'start': 1835.944, 'duration': 4.624}, {'end': 1848.974, 'text': "There's a way, there's supposed to be a way you can choose what column top end will select.", 'start': 1841.869, 'duration': 7.105}, {'end': 1862.437, 'text': 'That currently has a bug in the current release 0.2, so right now, by default, it just selects depDelay as the one that you want to take,', 'start': 1849.355, 'duration': 13.082}], 'summary': 'Automated selection based on departure delay saves typing and makes sense. current release 0.2 has a bug where it defaults to selecting depdelay.', 'duration': 35.28, 'max_score': 1827.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1827157.jpg'}, {'end': 1985.265, 'src': 'embed', 'start': 1958.673, 'weight': 1, 'content': [{'end': 1962.794, 'text': 'And as you can see, n is used here, but the change column looks exactly the same.', 'start': 1958.673, 'duration': 4.121}, {'end': 1968.355, 'text': 'So there are a few more convenience functions that I like to use.', 'start': 1964.394, 'duration': 3.961}, {'end': 1974.097, 'text': 'This is an easy way to randomly sample a fixed number of rows.', 'start': 1969.396, 'duration': 4.701}, {'end': 1977.042, 'text': "And by default, it's without replacement.", 'start': 1974.941, 'duration': 2.101}, {'end': 1985.265, 'text': "So in this case, it does show me row numbers, but it's pulling back five random rows from the data set.", 'start': 1977.602, 'duration': 7.663}], 'summary': "Demonstrates use of 'n' and convenience functions to sample 5 random rows from the dataset.", 'duration': 26.592, 'max_score': 1958.673, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1958673.jpg'}], 'start': 1674.477, 'title': 'Airline delays and data analysis in r', 'summary': "Discusses the longest departure delays for 15 carriers, including american airlines with a 970-minute delay on december 12th and a 677-minute delay on november 19th. it also explores the use of r's window functions for data filtering, ranking, arranging, and summarizing, emphasizing the impact of functions like top end, lag, lead, and sample frac on data manipulation and analysis.", 'chapters': [{'end': 1733.018, 'start': 1674.477, 'title': 'Longest departure delays by carrier', 'summary': 'Discusses the process of calculating the two days of the year with the longest departure delays for each of the 15 carriers, citing instances for american airlines with a 970-minute delay on december 12th and a 677-minute delay on november 19th.', 'duration': 58.541, 'highlights': ['The process involves determining the two days of the year with the longest departure delays for each of the 15 carriers, with specific examples for American Airlines such as a 970-minute delay on December 12th and a 677-minute delay on November 19th.', 'The method includes grouping by carrier, selecting relevant columns (month, day of month, departure delay), and utilizing a key function to perform the calculations.']}, {'end': 2013.023, 'start': 1733.619, 'title': 'Data analysis window functions', 'summary': 'Explores the use of window functions in r for filtering, ranking, arranging, and summarizing data, while highlighting the convenience of functions like top end, lag, lead, and sample frac, as well as their impact on data manipulation and analysis.', 'duration': 279.404, 'highlights': ['The chapter discusses filtering rows by min rank of descending departure delay to obtain the largest departure delays, using the top end function to simplify code and automatically select the top two in a group, and addressing a bug in the current release regarding column selection. Filtering rows by min rank of descending departure delay, using top end function to simplify code, addressing bug in current release', 'The explanation of the lag function for calculating the change from the previous month, including grouping by month, summarizing flight counts, and utilizing the lag function to determine the change. Explanation of lag function for calculating change from previous month, grouping by month, utilizing lag function', 'Demonstration of using the sample frac function to sample a fraction of rows, and the use of replace equals true argument for sampling with replacement. Using sample frac function to sample a fraction of rows, using replace equals true for sampling with replacement', 'Introduction to the str function for viewing the structure of an object, providing a quick tool for analyzing data structure. Introduction to str function for viewing object structure, quick tool for analyzing data structure']}], 'duration': 338.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc1674477.jpg', 'highlights': ['The process involves determining the two days of the year with the longest departure delays for each of the 15 carriers, with specific examples for American Airlines such as a 970-minute delay on December 12th and a 677-minute delay on November 19th.', 'The method includes grouping by carrier, selecting relevant columns (month, day of month, departure delay), and utilizing a key function to perform the calculations.', 'The chapter discusses filtering rows by min rank of descending departure delay to obtain the largest departure delays, using the top end function to simplify code and automatically select the top two in a group, and addressing a bug in the current release regarding column selection.', 'Demonstration of using the sample frac function to sample a fraction of rows, and the use of replace equals true argument for sampling with replacement.', 'The explanation of the lag function for calculating the change from the previous month, including grouping by month, summarizing flight counts, and utilizing the lag function to determine the change.']}, {'end': 2332.781, 'segs': [{'end': 2123.787, 'src': 'embed', 'start': 2093.36, 'weight': 0, 'content': [{'end': 2099.344, 'text': "It only generates select statements, so you can't update your data using dplyr.", 'start': 2093.36, 'duration': 5.984}, {'end': 2110.351, 'text': 'Here are the databases that are currently supported, SQLite, Postgres, Redshift, MySQL, MariaDB, BigQuery, MonetDB.', 'start': 2100.845, 'duration': 9.506}, {'end': 2120.163, 'text': "And I'm going to show an example using an SQLite database that contains the HFlights data.", 'start': 2112.754, 'duration': 7.409}, {'end': 2122.766, 'text': 'If you want to do this on your own.', 'start': 2121.044, 'duration': 1.722}, {'end': 2123.787, 'text': "there's an excellent.", 'start': 2122.766, 'duration': 1.021}], 'summary': 'Dplyr only supports select statements, databases include sqlite, postgres, redshift, mysql, mariadb, bigquery, monetdb.', 'duration': 30.427, 'max_score': 2093.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc2093360.jpg'}, {'end': 2304.473, 'src': 'embed', 'start': 2242.672, 'weight': 1, 'content': [{'end': 2249.697, 'text': "And it shows you the SQL right here, and it shows you the query execution plan in case you're interested.", 'start': 2242.672, 'duration': 7.025}, {'end': 2251.926, 'text': "So that's about it.", 'start': 2250.805, 'duration': 1.121}, {'end': 2255.931, 'text': "I've listed a couple of resources that you might find useful.", 'start': 2252.827, 'duration': 3.104}, {'end': 2261.717, 'text': 'A lot of this, these are the resources I used when building this presentation.', 'start': 2257.132, 'duration': 4.585}, {'end': 2269.225, 'text': 'The official documentation on CRAN and the vignettes are very well written.', 'start': 2264.219, 'duration': 5.006}, {'end': 2272.672, 'text': 'and they cover a lot that I did not cover here.', 'start': 2269.83, 'duration': 2.842}, {'end': 2280.357, 'text': 'Hadley Wickham recently did a webinar about dplier, and it also included ggViz.', 'start': 2273.592, 'duration': 6.765}, {'end': 2286.541, 'text': 'And that was more of a high-level, conceptual webinar.', 'start': 2281.798, 'duration': 4.743}, {'end': 2293.926, 'text': 'Hadley also recently did a tutorial that was very detailed and excellent.', 'start': 2287.561, 'duration': 6.365}, {'end': 2298.088, 'text': 'It was at the USAR conference in 2014.', 'start': 2294.546, 'duration': 3.542}, {'end': 2304.473, 'text': "And it has a similar format to what I've used here, but it uses a lot more examples.", 'start': 2298.088, 'duration': 6.385}], 'summary': 'Covered sql and resources for building presentation, including official cran documentation and recent webinars by hadley wickham.', 'duration': 61.801, 'max_score': 2242.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc2242672.jpg'}], 'start': 2013.963, 'title': 'Using dplyr for data analysis and database queries', 'summary': 'Covers using dplyr for data analysis, improving readability with glimpse, connecting to various databases like sqlite, postgres, redshift, mysql, mariadb, bigquery, and monetdb. it also demonstrates using dplyr for database queries, running queries with data frame and database, writing sql commands, and utilizing the explain function. additional resources for further learning on dplyr and ggviz are also provided.', 'chapters': [{'end': 2153.475, 'start': 2013.963, 'title': 'Using dplyr for data analysis', 'summary': 'Discusses using dplyr for data analysis, covering topics such as improving readability with glimpse, connecting to databases, and the supported databases like sqlite, postgres, redshift, mysql, mariadb, bigquery, and monetdb.', 'duration': 139.512, 'highlights': ["dplyr's glimpse provides improved formatting and readability with no wrapping and automatic fitting of data on the screen. Glimpse in dplyr offers better formatting and readability by avoiding wrapping and automatically adjusting data to fit the screen, enhancing the user experience.", 'Demonstrates how dplyr can connect to databases like SQLite, Postgres, Redshift, MySQL, MariaDB, BigQuery, and MonetDB, and conduct queries as if the data were loaded into a data frame. The chapter demonstrates the capability of dplyr to connect to various databases and perform queries as if the data were already loaded into a data frame, showcasing its versatility in data analysis.', 'Provides an example of connecting to an SQLite database containing the HFlights data and references a vignette on CRAN for creating the database. An example is provided on connecting to an SQLite database containing the HFlights data and references a CRAN vignette for creating the database, aiding in practical application and learning.']}, {'end': 2332.781, 'start': 2154.216, 'title': 'Using dplyr for database queries', 'summary': 'Demonstrates using dplyr to connect to a database, running queries with data frame and database, writing sql commands, using explain function, and provides additional resources for further learning on dplyr and ggviz.', 'duration': 178.565, 'highlights': ['Using dplyr to connect to a database and running queries with data frame and database.', 'Swapping from flights to flights tbl to run the same identical query, with a slower execution time.', 'Writing SQL commands directly, running them quickly, and using the explain function to view the SQL commands and query execution plan.', 'Providing additional resources for further learning on dplyr and ggViz, including official documentation, webinars, tutorials, and GitHub repository.']}], 'duration': 318.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/jWjqLW-u3hc/pics/jWjqLW-u3hc2013963.jpg', 'highlights': ["Demonstrates dplyr's versatility in connecting to various databases and performing queries as if the data were already loaded into a data frame.", 'Provides an example of connecting to an SQLite database containing the HFlights data and references a CRAN vignette for creating the database, aiding in practical application and learning.', 'Using dplyr to connect to a database, running queries with data frame and database, and writing SQL commands directly for quick execution.', 'Glimpse in dplyr offers better formatting and readability by avoiding wrapping and automatically adjusting data to fit the screen, enhancing the user experience.', 'Providing additional resources for further learning on dplyr and ggViz, including official documentation, webinars, tutorials, and GitHub repository.']}], 'highlights': ['dplyr provides four different types of joins: inner, left, semi-join, and anti-join, providing flexibility in merging datasets.', 'Demonstration of using chaining to add new variables in dplyr, such as calculating speed in miles per hour', 'The process involves determining the two days of the year with the longest departure delays for each of the 15 carriers, with specific examples for American Airlines such as a 970-minute delay on December 12th and a 677-minute delay on November 19th.', "Demonstrates dplyr's versatility in connecting to various databases and performing queries as if the data were already loaded into a data frame.", 'The chapter introduces dplyr functions such as summarize, tally, and group_by with examples of calculating flight count and plane count using n and n distinct.']}