title
Intro To Web Scraping With Puppeteer
description
In this video, we will look at Puppeteer to scrape data from a web page.
💻 Code:
https://github.com/bradtraversy/courses-scrape
Puppeteer Docs:
https://pptr.dev/
⭐ All Courses:
https://traversymedia.com
💖 Show Support
Patreon: https://www.patreon.com/traversymedia
PayPal: https://paypal.me/traversymedia
👇 Follow Traversy Media On Social Media:
Twitter: https://twitter.com/traversymedia
Instagram: https://www.instagram.com/traversymedia
Linkedin: https://www.linkedin.com/in/bradtraversy
Timestamps:
0:00 - Intro
0:36 - Install & Setup
3:36 - Init Browser & Page Object
5:02 - Screenshot & PDF
6:54 - Targeting HTML, Text, and Links
11:22 - Scraping Courses
17:08 - $$eval()
18:40 - Save JSON Data
detail
{'title': 'Intro To Web Scraping With Puppeteer', 'heatmap': [{'end': 221.109, 'start': 152.557, 'weight': 0.824}, {'end': 350.629, 'start': 332.936, 'weight': 0.713}], 'summary': 'Learn web scraping with puppeteer, a headless chrome browser tool, covering capabilities like accessing dom, javascript parsing, and creating screenshots and pdfs. demonstrates scraping course data and setting up puppeteer in node.js, manipulating web pages, scraping techniques, and javascript dom manipulation for data extraction.', 'chapters': [{'end': 137.025, 'segs': [{'end': 45.196, 'src': 'embed', 'start': 20.023, 'weight': 0, 'content': [{'end': 24.865, 'text': "But what if the data you want isn't available in any kind of API?", 'start': 20.023, 'duration': 4.842}, {'end': 30.347, 'text': "Well, in many cases you can scrape that data yourself and there's a lot of different tools that you can do this with.", 'start': 25.225, 'duration': 5.122}, {'end': 33.369, 'text': 'But Puppeteer is extremely powerful.', 'start': 30.928, 'duration': 2.441}, {'end': 35.389, 'text': "It's used for more than just web scraping.", 'start': 33.529, 'duration': 1.86}, {'end': 41.032, 'text': "It's essentially a headless Chrome browser, so Anything you can do in the browser.", 'start': 35.609, 'duration': 5.423}, {'end': 45.196, 'text': 'normally you can basically do that programmatically through Puppeteer.', 'start': 41.032, 'duration': 4.164}], 'summary': 'Puppeteer is a powerful tool for web scraping and automating browser tasks.', 'duration': 25.173, 'max_score': 20.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI20023.jpg'}, {'end': 116.587, 'src': 'embed', 'start': 65.612, 'weight': 1, 'content': [{'end': 67.314, 'text': 'So Traverse Media dot com.', 'start': 65.612, 'duration': 1.702}, {'end': 72.357, 'text': "we're going to take all the courses and get the title, the Udemy link,", 'start': 67.314, 'duration': 5.043}, {'end': 81.384, 'text': 'the course level and the promo code for all the courses and put them into a JSON array and then save that to a file on my system.', 'start': 72.357, 'duration': 9.027}, {'end': 82.125, 'text': 'All right.', 'start': 81.844, 'duration': 0.281}, {'end': 89.25, 'text': "So obviously there's different things you can do with different data, but I'm just going to show you how you can how you can get that data.", 'start': 82.225, 'duration': 7.025}, {'end': 89.75, 'text': 'All right.', 'start': 89.49, 'duration': 0.26}, {'end': 90.591, 'text': "So let's get started.", 'start': 89.79, 'duration': 0.801}, {'end': 97.064, 'text': 'right, guys.', 'start': 96.644, 'duration': 0.42}, {'end': 100.407, 'text': "so we're going to get into this and i would recommend that you follow along.", 'start': 97.064, 'duration': 3.343}, {'end': 102.149, 'text': "you don't have to, but i would recommend it.", 'start': 100.407, 'duration': 1.742}, {'end': 105.812, 'text': 'so this is the puppeteer website where you can find all the documentation.', 'start': 102.149, 'duration': 3.663}, {'end': 108.554, 'text': "it's pptr.dev.", 'start': 105.812, 'duration': 2.742}, {'end': 116.587, 'text': 'okay, now, what we want to scrape, the data we want to get is at traversimedia.com we have these courses.', 'start': 108.554, 'duration': 8.033}], 'summary': 'Scraping data from traversimedia.com to create a json array of courses with title, udemy link, course level, and promo code.', 'duration': 50.975, 'max_score': 65.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI65612.jpg'}], 'start': 0.129, 'title': 'Introduction to web scraping with puppeteer', 'summary': 'Introduces web scraping with puppeteer, highlighting its capabilities as a headless chrome browser for accessing dom, event firing, javascript parsing, and creating screenshots and pdfs. it demonstrates scraping course data from a website and saving it to a json file.', 'chapters': [{'end': 137.025, 'start': 0.129, 'title': 'Introduction to web scraping with puppeteer', 'summary': 'Introduces web scraping with puppeteer, showcasing its capabilities as a headless chrome browser that provides access to dom, event firing, javascript parsing, and creating screenshots and pdfs programmatically. demonstrates scraping course data from a website and saving it to a json file.', 'duration': 136.896, 'highlights': ['Puppeteer is a powerful tool used for web scraping and more, functioning as a headless Chrome browser with capabilities to access the DOM, fire events, parse JavaScript, and create screenshots and PDFs of websites programmatically. Puppeteer is essentially a headless Chrome browser, providing complete access to the DOM, event firing, JavaScript parsing, and programmatically creating screenshots and PDFs.', 'Demonstrates scraping course data from a website (traversimedia.com) to retrieve the title, Udemy link, course level, and promo code for all the courses and saving it to a JSON array and then to a file. Shows the process of scraping course data from traversimedia.com, including the title, Udemy link, course level, and promo code, and saving it to a JSON array and then to a file.', 'Encourages viewers to explore Puppeteer further by referencing the official website (pptr.dev) for documentation and recommends following along with the demonstration. Encourages further exploration of Puppeteer by referring to the official website for documentation and recommending viewers to follow along with the demonstration.']}], 'duration': 136.896, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI129.jpg', 'highlights': ['Puppeteer is a powerful tool used for web scraping and more, functioning as a headless Chrome browser with capabilities to access the DOM, fire events, parse JavaScript, and create screenshots and PDFs of websites programmatically.', 'Demonstrates scraping course data from a website (traversimedia.com) to retrieve the title, Udemy link, course level, and promo code for all the courses and saving it to a JSON array and then to a file.', 'Encourages viewers to explore Puppeteer further by referencing the official website (pptr.dev) for documentation and recommends following along with the demonstration.']}, {'end': 301.155, 'segs': [{'end': 221.109, 'src': 'heatmap', 'start': 152.557, 'weight': 0.824, 'content': [{'end': 158.899, 'text': "so we're going to first run npm init and just add a dash y, so we don't have to go through the questions,", 'start': 152.557, 'duration': 6.342}, {'end': 165.981, 'text': 'and that will just initialize a package.json file which holds our scripts and and dependencies and all that,', 'start': 158.899, 'duration': 7.082}, {'end': 172.042, 'text': 'and then the only dependency we want to install with npm installer, npm i, is puppeteer.', 'start': 165.981, 'duration': 6.061}, {'end': 183.863, 'text': "so let's do that and that will create our node modules folder with puppeteer and all its dependencies, and it'll get added here.", 'start': 172.042, 'duration': 11.821}, {'end': 187.744, 'text': 'now the file that basically our entry point file.', 'start': 183.863, 'duration': 3.881}, {'end': 190.844, 'text': "i'm going to create that and call it index.js.", 'start': 187.744, 'duration': 3.1}, {'end': 196.325, 'text': "you can call it whatever you'd like and then i'm going to create a start script for that which you don't have to.", 'start': 190.844, 'duration': 5.481}, {'end': 199.426, 'text': "you can just run node index, but i'm just going to make it.", 'start': 196.325, 'duration': 3.101}, {'end': 205.579, 'text': 'so we can run npm start and here we want to run node index, All right.', 'start': 199.426, 'duration': 6.153}, {'end': 216.286, 'text': "So if we save that and then in our index JS, we'll just do a console log and then down here in our terminal go NPM start and we should see that log.", 'start': 205.619, 'duration': 10.667}, {'end': 221.109, 'text': 'OK, now the first thing we want to do, of course, is bring in puppeteer.', 'start': 217.487, 'duration': 3.622}], 'summary': 'Setting up a node.js project with puppeteer for web scraping and automation.', 'duration': 68.552, 'max_score': 152.557, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI152557.jpg'}, {'end': 187.744, 'src': 'embed', 'start': 158.899, 'weight': 0, 'content': [{'end': 165.981, 'text': 'and that will just initialize a package.json file which holds our scripts and and dependencies and all that,', 'start': 158.899, 'duration': 7.082}, {'end': 172.042, 'text': 'and then the only dependency we want to install with npm installer, npm i, is puppeteer.', 'start': 165.981, 'duration': 6.061}, {'end': 183.863, 'text': "so let's do that and that will create our node modules folder with puppeteer and all its dependencies, and it'll get added here.", 'start': 172.042, 'duration': 11.821}, {'end': 187.744, 'text': 'now the file that basically our entry point file.', 'start': 183.863, 'duration': 3.881}], 'summary': 'Initializing package.json with puppeteer as the only dependency.', 'duration': 28.845, 'max_score': 158.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI158899.jpg'}, {'end': 277.312, 'src': 'embed', 'start': 234.301, 'weight': 2, 'content': [{'end': 238.066, 'text': "So we're going to have everything in an asynchronous function.", 'start': 234.301, 'duration': 3.765}, {'end': 240.49, 'text': "So let's create an async function.", 'start': 238.146, 'duration': 2.344}, {'end': 241.751, 'text': 'You can call it whatever you want.', 'start': 240.53, 'duration': 1.221}, {'end': 242.672, 'text': "I'm going to call it run.", 'start': 241.811, 'duration': 0.861}, {'end': 244.874, 'text': "And then we're going to run it.", 'start': 243.974, 'duration': 0.9}, {'end': 250.356, 'text': 'Now, the first thing we want to do here is open the browser or launch a browser.', 'start': 245.514, 'duration': 4.842}, {'end': 256.476, 'text': "So we're going to create a variable called browser and then set that to a way puppeteer dot launch.", 'start': 250.376, 'duration': 6.1}, {'end': 264.859, 'text': 'So this is essentially just launching a browser programmatically so that we can do things like access different pages and elements on the pages,', 'start': 257.096, 'duration': 7.763}, {'end': 266.58, 'text': 'fire off events, things like that.', 'start': 264.859, 'duration': 1.721}, {'end': 275.13, 'text': 'Now, to to access a page, we need to initialize a page variable and we can do that with a weight and then browser.', 'start': 267.12, 'duration': 8.01}, {'end': 277.312, 'text': "And then there's a new page method.", 'start': 275.59, 'duration': 1.722}], 'summary': 'Creating an async function to launch a browser and access pages using puppeteer.', 'duration': 43.011, 'max_score': 234.301, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI234301.jpg'}], 'start': 137.765, 'title': 'Setting up puppeteer in node.js', 'summary': 'Demonstrates how to set up puppeteer in node.js to scrape websites, including initializing a package.json file, installing the puppeteer dependency, creating an entry point file, and launching a browser programmatically to access and interact with web pages.', 'chapters': [{'end': 301.155, 'start': 137.765, 'title': 'Setting up puppeteer in node.js', 'summary': 'Demonstrates how to set up puppeteer in node.js to scrape websites, including initializing a package.json file, installing the puppeteer dependency, creating an entry point file, and launching a browser programmatically to access and interact with web pages.', 'duration': 163.39, 'highlights': ['Initializing a package.json file with npm init and adding the puppeteer dependency with npm i Demonstrates the process of initializing a package.json file and installing the puppeteer dependency using npm, streamlining the setup process for the project.', 'Creating an entry point file (index.js) and setting up a start script in package.json Illustrates the creation of an entry point file and the setup of a start script in package.json, providing a convenient way to run the Node.js application.', 'Launching a browser programmatically with puppeteer.launch and accessing a page with page.goto Details the use of puppeteer to launch a browser programmatically and access a specific web page using page.goto, enabling automated interaction with web content.', 'Using asynchronous functions and methods to handle browser and page operations Emphasizes the use of asynchronous functions and methods to handle browser and page operations, ensuring proper execution and control flow in the Node.js application.']}], 'duration': 163.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI137765.jpg', 'highlights': ['Demonstrates the process of initializing a package.json file and installing the puppeteer dependency using npm, streamlining the setup process for the project.', 'Illustrates the creation of an entry point file and the setup of a start script in package.json, providing a convenient way to run the Node.js application.', 'Details the use of puppeteer to launch a browser programmatically and access a specific web page using page.goto, enabling automated interaction with web content.', 'Emphasizes the use of asynchronous functions and methods to handle browser and page operations, ensuring proper execution and control flow in the Node.js application.']}, {'end': 568.024, 'segs': [{'end': 330.572, 'src': 'embed', 'start': 301.575, 'weight': 1, 'content': [{'end': 307.981, 'text': 'OK And then within here is where we can access DOM elements and do pretty much anything we want.', 'start': 301.575, 'duration': 6.406}, {'end': 314.487, 'text': "Now there's some some other cool things I want to show you before we get into targeting content and data.", 'start': 308.522, 'duration': 5.965}, {'end': 322.695, 'text': "So first of all we can create a screenshot and we can do that by a weight page dot and then there's a screenshot.", 'start': 314.607, 'duration': 8.088}, {'end': 330.572, 'text': 'method that we can call and we just pass in an object with a path, which is going to be where we want this to go.', 'start': 323.88, 'duration': 6.692}], 'summary': 'Access dom elements and create screenshots using weight page method.', 'duration': 28.997, 'max_score': 301.575, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI301575.jpg'}, {'end': 357.012, 'src': 'heatmap', 'start': 332.936, 'weight': 0.713, 'content': [{'end': 339.184, 'text': "okay, so if i run npm start, it's going to run.", 'start': 332.936, 'duration': 6.248}, {'end': 346.347, 'text': 'our index file takes a two seconds or so and then you can see up here example.png has been created.', 'start': 339.184, 'duration': 7.163}, {'end': 350.629, 'text': "now it's going to set it to a specific size and it's not the whole page.", 'start': 346.347, 'duration': 4.282}, {'end': 357.012, 'text': 'what you can do is you can add in another property here of full page and set that to true.', 'start': 350.629, 'duration': 6.383}], 'summary': "Running 'npm start' creates example.png in 2 seconds, adjusting to specific size and full page option available.", 'duration': 24.076, 'max_score': 332.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI332936.jpg'}, {'end': 435.948, 'src': 'embed', 'start': 407.71, 'weight': 2, 'content': [{'end': 414.031, 'text': 'So if you need to, for some reason, generate a PDF of a specific web page, you can do that.', 'start': 407.71, 'duration': 6.321}, {'end': 418.432, 'text': 'Now, I want to get into targeting some of the content now.', 'start': 414.811, 'duration': 3.621}, {'end': 420.572, 'text': "So let's comment that PDF line out.", 'start': 418.532, 'duration': 2.04}, {'end': 425.417, 'text': "and let's look at how to get the entire html of a page.", 'start': 421.432, 'duration': 3.985}, {'end': 435.948, 'text': "so we'll create a variable called html and we're going to set that to a weight and then there's a method called content that we can call that will give us all of the html.", 'start': 425.417, 'duration': 10.531}], 'summary': 'Learn to generate pdfs from web pages and extract entire html content using code.', 'duration': 28.238, 'max_score': 407.71, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI407710.jpg'}, {'end': 485.077, 'src': 'embed', 'start': 450.45, 'weight': 0, 'content': [{'end': 451.712, 'text': "I'm just console logging it.", 'start': 450.45, 'duration': 1.262}, {'end': 454.075, 'text': "So that'll get the HTML.", 'start': 452.313, 'duration': 1.762}, {'end': 456.439, 'text': "Let's let's comment that out.", 'start': 454.115, 'duration': 2.324}, {'end': 467.787, 'text': "Now, if you want to get the title or really anything, if you want to target h3s or whatever, there's a method called evaluate on the page object.", 'start': 457.1, 'duration': 10.687}, {'end': 470.069, 'text': "so let's say we want to get the title.", 'start': 467.787, 'duration': 2.282}, {'end': 472.91, 'text': 'so what we could do is say page.evaluate.', 'start': 470.069, 'duration': 2.841}, {'end': 476.793, 'text': 'oops, evaluate, and this is a high order function.', 'start': 472.91, 'duration': 3.883}, {'end': 485.077, 'text': 'so we can pass in a function here and then we have access to the document object.', 'start': 476.793, 'duration': 8.284}], 'summary': 'Using evaluate method to access document object for title retrieval.', 'duration': 34.627, 'max_score': 450.45, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI450450.jpg'}], 'start': 301.575, 'title': 'Web page manipulation and scraping methods with puppeteer', 'summary': 'Demonstrates using puppeteer to manipulate web pages, including capturing screenshots, generating pdfs, and accessing html, as well as using evaluate method to extract html content, title, text, and links from a webpage, with examples of console logged data.', 'chapters': [{'end': 425.417, 'start': 301.575, 'title': 'Web page manipulation with puppeteer', 'summary': 'Demonstrates using puppeteer to create screenshots and pdfs of web pages, including how to capture a full page screenshot and generate a pdf of a specific web page, as well as how to access and manipulate the entire html of a page.', 'duration': 123.842, 'highlights': ['The chapter showcases creating a full page screenshot with Puppeteer, illustrating how to generate a screenshot and set it to a specific size, including generating a full page screenshot and overwriting the initial image, providing practical demonstrations of creating screenshots in Puppeteer.', 'The chapter explains how to generate a PDF of a specific web page using Puppeteer, demonstrating the process of creating a PDF and specifying the format, such as A4, to generate a PDF of a specific web page, providing a practical example of generating PDFs with Puppeteer.', 'The chapter introduces accessing and manipulating the entire HTML of a web page, showcasing how to get the entire HTML of a page using Puppeteer, providing insights into accessing and manipulating the content of web pages.']}, {'end': 568.024, 'start': 425.417, 'title': 'Web scraping methods', 'summary': 'Demonstrates how to use evaluate method to extract html content, title, text, and links from a webpage using puppeteer, providing examples of console logged html, title, text, and all the links on the page.', 'duration': 142.607, 'highlights': ['The chapter demonstrates how to use evaluate method to extract HTML content, title, text, and links from a webpage using Puppeteer. Puppeteer is used to extract information from a webpage. The evaluate method is utilized for this purpose.', 'The example shows console logged HTML, title, text, and all the links on the page. Console logging examples include HTML content, title, text, and all the links on the page.']}], 'duration': 266.449, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI301575.jpg', 'highlights': ['The chapter demonstrates how to use evaluate method to extract HTML content, title, text, and links from a webpage using Puppeteer.', 'The chapter showcases creating a full page screenshot with Puppeteer, illustrating how to generate a screenshot and set it to a specific size.', 'The chapter explains how to generate a PDF of a specific web page using Puppeteer, demonstrating the process of creating a PDF and specifying the format, such as A4.']}, {'end': 772.725, 'segs': [{'end': 598.285, 'src': 'embed', 'start': 568.245, 'weight': 2, 'content': [{'end': 570.307, 'text': "And again, we're going to use page.evaluate.", 'start': 568.245, 'duration': 2.062}, {'end': 579.932, 'text': "Now where we're getting all the links, we're getting multiple elements right?", 'start': 574.488, 'duration': 5.444}, {'end': 587.057, 'text': 'So, instead of using like, we could do query selector and get any single element, but we want to get all the links.', 'start': 580.072, 'duration': 6.985}, {'end': 589.379, 'text': "So we're going to be using query selector all.", 'start': 587.117, 'duration': 2.262}, {'end': 593.602, 'text': "It's the same way you would access the DOM in front end JavaScript.", 'start': 589.419, 'duration': 4.183}, {'end': 598.285, 'text': 'Now query selector all will give us something called a node list.', 'start': 594.202, 'duration': 4.083}], 'summary': 'Using page.evaluate to retrieve multiple links using queryselectorall.', 'duration': 30.04, 'max_score': 568.245, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI568245.jpg'}, {'end': 655.577, 'src': 'embed', 'start': 626.428, 'weight': 0, 'content': [{'end': 627.448, 'text': "Let's close this up.", 'start': 626.428, 'duration': 1.02}, {'end': 630.008, 'text': "So we'll pass in here A.", 'start': 627.968, 'duration': 2.04}, {'end': 634.209, 'text': 'Now array.from takes in a second argument, which is going to be a function.', 'start': 630.008, 'duration': 4.201}, {'end': 640.851, 'text': 'And we can pass in an argument here that represents the element.', 'start': 636.03, 'duration': 4.821}, {'end': 645.832, 'text': 'And we want to say for each one, we want to get the href or the URL.', 'start': 641.751, 'duration': 4.081}, {'end': 648.872, 'text': 'So we can take that element and get href.', 'start': 645.912, 'duration': 2.96}, {'end': 652.093, 'text': 'So that should just give us all the links.', 'start': 650.132, 'duration': 1.961}, {'end': 655.577, 'text': "And then let's just do a console log here of links.", 'start': 652.975, 'duration': 2.602}], 'summary': 'Using array.from to extract links from element a.', 'duration': 29.149, 'max_score': 626.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI626428.jpg'}, {'end': 746.523, 'src': 'embed', 'start': 720.02, 'weight': 3, 'content': [{'end': 730.074, 'text': 'And then if we look further into it, every course has a, and down here as well where it says more, every course has a class of card around it.', 'start': 720.02, 'duration': 10.054}, {'end': 735.917, 'text': 'And if we look into that, every card has a class of card body and card footer.', 'start': 730.555, 'duration': 5.362}, {'end': 739.159, 'text': 'Now the data we want is in both of these divs.', 'start': 736.397, 'duration': 2.762}, {'end': 746.523, 'text': 'So in the card body, we have the title, which is in an H3, and we have the level, which is in a div with the class of level.', 'start': 739.199, 'duration': 7.324}], 'summary': 'The data we want is in the card body, including the title in h3 and the level in a div with the class of level.', 'duration': 26.503, 'max_score': 720.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI720020.jpg'}], 'start': 568.245, 'title': 'Web scraping techniques', 'summary': 'Covers techniques for web scraping including using queryselectorall to extract links and course data, demonstrating the extraction of all links from a page and analyzing website structure to locate specific elements such as courses, titles, levels, links, and promo codes.', 'chapters': [{'end': 655.577, 'start': 568.245, 'title': 'Web scraping: using queryselectorall to extract links', 'summary': 'Covers using page.evaluate to extract all links using queryselectorall, converting the node list to an array, and accessing the href attribute for each link, resulting in the extraction of all the links from the page.', 'duration': 87.332, 'highlights': ['Using querySelectorAll to obtain all the links from a webpage, providing a practical approach to web scraping.', 'Demonstrating the array.from method to convert the node list obtained from querySelectorAll into an array, facilitating easier manipulation and access to the links.', 'Accessing the href attribute for each link using a function within array.from, showcasing a method to extract specific information from the array of links.']}, {'end': 772.725, 'start': 657.439, 'title': 'Web scraping for course data', 'summary': "Demonstrates extracting course data from a website using query selector all and analyzing the website's structure to locate specific elements, such as courses, titles, levels, links, and promo codes.", 'duration': 115.286, 'highlights': ['By using query selector all, the speaker extracts all the links on the page, providing insight into the process of gathering data from a website.', "Analyzing the website's structure, the speaker identifies the section with the ID of 'courses' and the class of 'card' that contains the data for each course, emphasizing the importance of understanding the website's layout for effective scraping.", "The speaker explains the specific elements within the 'card' class, such as the title in an H3 tag, the level in a div with the class of 'level', the Udemy link in the footer, and the promo code in a div with the class of 'promo code', demonstrating the detailed approach required for extracting relevant data from a website."]}], 'duration': 204.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI568245.jpg', 'highlights': ['Demonstrating the array.from method to convert the node list obtained from querySelectorAll into an array, facilitating easier manipulation and access to the links.', 'Accessing the href attribute for each link using a function within array.from, showcasing a method to extract specific information from the array of links.', 'By using query selector all, the speaker extracts all the links on the page, providing insight into the process of gathering data from a website.', "Analyzing the website's structure, the speaker identifies the section with the ID of 'courses' and the class of 'card' that contains the data for each course, emphasizing the importance of understanding the website's layout for effective scraping."]}, {'end': 1283.349, 'segs': [{'end': 803.719, 'src': 'embed', 'start': 773.945, 'weight': 1, 'content': [{'end': 776.826, 'text': "So now that we know that, let's jump back in here.", 'start': 773.945, 'duration': 2.881}, {'end': 783.287, 'text': "And instead of just all the eight tags, let's let's get the let's say courses.", 'start': 777.006, 'duration': 6.281}, {'end': 787.087, 'text': 'So the idea of courses, remember that section wraps around all of them.', 'start': 783.407, 'duration': 3.68}, {'end': 790.268, 'text': 'And then we want to get all the cards that are in that.', 'start': 787.567, 'duration': 2.701}, {'end': 798.335, 'text': 'OK And then here instead of just returning all the href I want to return an object of courses.', 'start': 791.129, 'duration': 7.206}, {'end': 803.719, 'text': "So since I want to return an object I'm going to put some parentheses here.", 'start': 799.075, 'duration': 4.644}], 'summary': 'Refactoring to retrieve courses and cards instead of tags.', 'duration': 29.774, 'max_score': 773.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI773945.jpg'}, {'end': 854.431, 'src': 'embed', 'start': 825.875, 'weight': 0, 'content': [{'end': 828.256, 'text': 'we have a card body and a card footer.', 'start': 825.875, 'duration': 2.381}, {'end': 838.821, 'text': "so here what i'm going to do is take the element which is going to be the card right, so say e, and then we want to then use query selector on that,", 'start': 828.256, 'duration': 10.565}, {'end': 844.044, 'text': 'because we want to go into the card body, So card dash body.', 'start': 838.821, 'duration': 5.223}, {'end': 848.267, 'text': 'And then we know the title is in an H3 in the card body.', 'start': 844.705, 'duration': 3.562}, {'end': 849.708, 'text': 'So I want to get H3.', 'start': 848.327, 'duration': 1.381}, {'end': 851.649, 'text': 'And then finally, I want the text.', 'start': 850.128, 'duration': 1.521}, {'end': 853.17, 'text': "So I'm going to use inner text.", 'start': 851.689, 'duration': 1.481}, {'end': 854.431, 'text': 'All right.', 'start': 853.19, 'duration': 1.241}], 'summary': "Using query selector to extract text from card body's h3 element.", 'duration': 28.556, 'max_score': 825.875, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI825875.jpg'}, {'end': 1170.989, 'src': 'embed', 'start': 1140.427, 'weight': 4, 'content': [{'end': 1142.088, 'text': "And let's come back down here.", 'start': 1140.427, 'duration': 1.661}, {'end': 1145.751, 'text': "And then let's say save.", 'start': 1142.108, 'duration': 3.643}, {'end': 1150.232, 'text': 'data to a JSON file.', 'start': 1147.35, 'duration': 2.882}, {'end': 1159.82, 'text': "So we can do that by saying fs, and then we're going to use the write file method and pass in a couple of things.", 'start': 1150.813, 'duration': 9.007}, {'end': 1162.943, 'text': 'We want the name of our file, so courses.json.', 'start': 1159.86, 'duration': 3.083}, {'end': 1167.807, 'text': 'And then we need to, it needs to be valid JSON before we save it.', 'start': 1163.743, 'duration': 4.064}, {'end': 1170.989, 'text': "So we're going to run it through JSON.stringify.", 'start': 1167.847, 'duration': 3.142}], 'summary': 'Save data to a json file using fs.writefile method', 'duration': 30.562, 'max_score': 1140.427, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI1140427.jpg'}, {'end': 1272.907, 'src': 'embed', 'start': 1238.457, 'weight': 2, 'content': [{'end': 1240.099, 'text': 'My codes are pretty easy to remember.', 'start': 1238.457, 'duration': 1.642}, {'end': 1244.623, 'text': "They're just like the course name or topic and then the month and year.", 'start': 1240.139, 'duration': 4.484}, {'end': 1251.369, 'text': 'But if these were like crazy codes, you might want to just be able to run a command and fetch all the codes.', 'start': 1245.243, 'duration': 6.126}, {'end': 1254.692, 'text': "There's just a million reasons why you might want to scrape data.", 'start': 1251.71, 'duration': 2.982}, {'end': 1257.403, 'text': "All right, so that's pretty much it.", 'start': 1255.643, 'duration': 1.76}, {'end': 1259.864, 'text': "There's a lot more to Puppeteer.", 'start': 1257.443, 'duration': 2.421}, {'end': 1261.024, 'text': "There's a lot more you can do.", 'start': 1259.884, 'duration': 1.14}, {'end': 1263.125, 'text': 'You can fire off events and stuff like that.', 'start': 1261.084, 'duration': 2.041}, {'end': 1264.585, 'text': 'Like I said.', 'start': 1263.885, 'duration': 0.7}, {'end': 1272.907, 'text': 'I think I said that I may do a more in-depth course, but I wanted this to just be kind of an introduction to data scraping,', 'start': 1264.585, 'duration': 8.322}], 'summary': 'Introduction to data scraping and puppeteer capabilities.', 'duration': 34.45, 'max_score': 1238.457, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI1238457.jpg'}], 'start': 773.945, 'title': 'Javascript dom manipulation and data scraping using puppeteer', 'summary': 'Covers javascript dom manipulation, including selecting elements, accessing properties, and returning specific data, as well as demonstrates data scraping using puppeteer to extract course titles, levels, urls, and promo codes, and save the data to a json file.', 'chapters': [{'end': 853.17, 'start': 773.945, 'title': 'Javascript dom manipulation', 'summary': 'Explains how to manipulate the dom using javascript, including selecting elements, accessing properties, and returning specific data, such as titles and text.', 'duration': 79.225, 'highlights': ['Using query selector to access elements like card body and card footer, and then selecting specific elements like H3 and retrieving the text, allows for targeted data extraction.', 'By returning an object of courses instead of just all the href, the code provides a more structured and organized output for the specified data.']}, {'end': 1283.349, 'start': 853.19, 'title': 'Data scraping using puppeteer', 'summary': 'Demonstrates using puppeteer to scrape data from web pages, extracting course titles, levels, urls, and promo codes, and saving the data to a json file, providing an overview of data scraping with puppeteer.', 'duration': 430.159, 'highlights': ['Using Puppeteer to scrape data from web pages The chapter showcases the utilization of Puppeteer to extract data from web pages, demonstrating its capability to automate web scraping tasks.', 'Extracting course titles, levels, URLs, and promo codes The tutorial illustrates the process of extracting course titles, levels, URLs, and promo codes from web pages, showcasing the practical application of data extraction using Puppeteer.', 'Saving the data to a JSON file The tutorial provides a demonstration of saving the extracted data to a JSON file using the FS module in Node.js, enabling users to persist the scraped data for future use or analysis.']}], 'duration': 509.404, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/S67gyqnYHmI/pics/S67gyqnYHmI773945.jpg', 'highlights': ['Using query selector to access elements like card body and card footer, and then selecting specific elements like H3 and retrieving the text, allows for targeted data extraction.', 'By returning an object of courses instead of just all the href, the code provides a more structured and organized output for the specified data.', 'Using Puppeteer to scrape data from web pages The chapter showcases the utilization of Puppeteer to extract data from web pages, demonstrating its capability to automate web scraping tasks.', 'Extracting course titles, levels, URLs, and promo codes The tutorial illustrates the process of extracting course titles, levels, URLs, and promo codes from web pages, showcasing the practical application of data extraction using Puppeteer.', 'Saving the data to a JSON file The tutorial provides a demonstration of saving the extracted data to a JSON file using the FS module in Node.js, enabling users to persist the scraped data for future use or analysis.']}], 'highlights': ['Puppeteer is a powerful tool used for web scraping and more, functioning as a headless Chrome browser with capabilities to access the DOM, fire events, parse JavaScript, and create screenshots and PDFs of websites programmatically.', 'Demonstrates scraping course data from a website (traversimedia.com) to retrieve the title, Udemy link, course level, and promo code for all the courses and saving it to a JSON array and then to a file.', 'Using Puppeteer to scrape data from web pages The chapter showcases the utilization of Puppeteer to extract data from web pages, demonstrating its capability to automate web scraping tasks.', 'Extracting course titles, levels, URLs, and promo codes The tutorial illustrates the process of extracting course titles, levels, URLs, and promo codes from web pages, showcasing the practical application of data extraction using Puppeteer.', 'Saving the data to a JSON file The tutorial provides a demonstration of saving the extracted data to a JSON file using the FS module in Node.js, enabling users to persist the scraped data for future use or analysis.', 'Demonstrates the process of initializing a package.json file and installing the puppeteer dependency using npm, streamlining the setup process for the project.', 'Illustrates the creation of an entry point file and the setup of a start script in package.json, providing a convenient way to run the Node.js application.', 'Details the use of puppeteer to launch a browser programmatically and access a specific web page using page.goto, enabling automated interaction with web content.', 'The chapter demonstrates how to use evaluate method to extract HTML content, title, text, and links from a webpage using Puppeteer.', 'The chapter showcases creating a full page screenshot with Puppeteer, illustrating how to generate a screenshot and set it to a specific size.', 'The chapter explains how to generate a PDF of a specific web page using Puppeteer, demonstrating the process of creating a PDF and specifying the format, such as A4.', 'Demonstrating the array.from method to convert the node list obtained from querySelectorAll into an array, facilitating easier manipulation and access to the links.', 'Accessing the href attribute for each link using a function within array.from, showcasing a method to extract specific information from the array of links.', 'By using query selector all, the speaker extracts all the links on the page, providing insight into the process of gathering data from a website.', "Analyzing the website's structure, the speaker identifies the section with the ID of 'courses' and the class of 'card' that contains the data for each course, emphasizing the importance of understanding the website's layout for effective scraping.", 'Using query selector to access elements like card body and card footer, and then selecting specific elements like H3 and retrieving the text, allows for targeted data extraction.', 'By returning an object of courses instead of just all the href, the code provides a more structured and organized output for the specified data.', 'Emphasizes the use of asynchronous functions and methods to handle browser and page operations, ensuring proper execution and control flow in the Node.js application.', 'Encourages viewers to explore Puppeteer further by referencing the official website (pptr.dev) for documentation and recommends following along with the demonstration.']}