title
Intro To Web Crawlers & Scraping With Scrapy
description
In this video we will look at Python Scrapy and how to create a spider to crawl websites to scrape and structure data.
Download Kite free:
https://kite.com/download/?utm_medium=referral&utm_source=youtube&utm_campaign=TechGuyWeb&utm_content=scrapy-tutorial
Code & Commands:
https://gist.github.com/bradtraversy/94df3dd4168b82a204273d0ca80e17f8
💖 Become a Patron: Show support & get perks!
http://www.patreon.com/traversymedia
Website & Udemy Course Links:
https://www.traversymedia.com
Follow Traversy Media:
https://www.twitter.com/traversymedia
https://www.instagram.com/traversymedia
https://www.facebook.com/traversymedia
detail
{'title': 'Intro To Web Crawlers & Scraping With Scrapy', 'heatmap': [{'end': 245.831, 'start': 173.55, 'weight': 0.843}, {'end': 539.014, 'start': 518.461, 'weight': 1}], 'summary': "Titled 'intro to web crawlers & scraping with scrapy' covers web scraping using scrapy, focusing on ethical considerations and extracting post titles, dates, and authors from a blog. it includes setting up scrapy, html scraping, using xpath selectors, creating a json file, and scraping websites with examples demonstrating the process and functionality.", 'chapters': [{'end': 143.771, 'segs': [{'end': 63.983, 'src': 'embed', 'start': 25.779, 'weight': 0, 'content': [{'end': 33.224, 'text': "What's going on, guys? In this video, we're going to look at Scrapy, which is a Python framework for crawling websites and extracting data.", 'start': 25.779, 'duration': 7.445}, {'end': 41.29, 'text': "And there's a bunch of reasons why you might want to use something like this for for data analysis, for data mining, information processing.", 'start': 33.764, 'duration': 7.526}, {'end': 46.133, 'text': 'A lot of services and websites give you data APIs to work with, but not all of them do.', 'start': 41.67, 'duration': 4.463}, {'end': 50.536, 'text': "So there might be a website where you you want some data, but there's no API available.", 'start': 46.233, 'duration': 4.303}, {'end': 52.437, 'text': 'So you can scrape the data yourself.', 'start': 50.856, 'duration': 1.581}, {'end': 58.4, 'text': "Now, you have to keep in mind that there's a lot of ethics and even legality that goes into web scraping.", 'start': 52.557, 'duration': 5.843}, {'end': 63.983, 'text': "So if you're using it in a professional sense for a product or your company or something, you really want to look at that.", 'start': 58.66, 'duration': 5.323}], 'summary': 'Introduction to scrapy, a python framework for web scraping and data extraction, highlighting its relevance for data analysis and the ethical considerations involved.', 'duration': 38.204, 'max_score': 25.779, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ25779.jpg'}, {'end': 107.447, 'src': 'embed', 'start': 81.973, 'weight': 2, 'content': [{'end': 87.135, 'text': "it's called the scraping blog, scraping hub blog, and I've seen this in a bunch of tutorials.", 'start': 81.973, 'duration': 5.162}, {'end': 90.997, 'text': "so I figured it's fine for us to use in this video,", 'start': 87.135, 'duration': 3.862}, {'end': 97.84, 'text': "and it's just a regular blog and you can see that it has a bunch of posts and then it has other pages of posts.", 'start': 90.997, 'duration': 6.843}, {'end': 101.642, 'text': "so what I'd like to do is scrape every single post and get the title.", 'start': 97.84, 'duration': 3.802}, {'end': 105.685, 'text': 'So I want the title, the date and the author.', 'start': 102.862, 'duration': 2.823}, {'end': 107.447, 'text': 'And of course, you could get other stuff as well.', 'start': 105.745, 'duration': 1.702}], 'summary': 'Scrape the scraping hub blog for post titles, dates, and authors.', 'duration': 25.474, 'max_score': 81.973, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ81973.jpg'}], 'start': 7.096, 'title': 'Web scraping with scrapy', 'summary': 'Introduces scrapy, a python framework for web scraping, focusing on ethical considerations and extracting post titles, dates, and authors from a blog.', 'chapters': [{'end': 143.771, 'start': 7.096, 'title': 'Introduction to web scraping with scrapy', 'summary': 'Discusses the use of scrapy, a python framework for web scraping, to extract data from a website, emphasizing ethical considerations and the goal of extracting post titles, dates, and authors from a blog.', 'duration': 136.675, 'highlights': ['Scrapy is a Python framework for crawling websites and extracting data. Scrapy is a Python framework used for web scraping and data extraction from websites.', 'Ethical and legal considerations are important when web scraping for professional use. Emphasizes the importance of ethical and legal considerations when using web scraping for professional purposes.', 'The goal is to scrape post titles, dates, and authors from the specified blog and generate a JSON file. The specific objective is to extract post titles, dates, and authors from the blog and create a JSON file.']}], 'duration': 136.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ7096.jpg', 'highlights': ['Scrapy is a Python framework for crawling websites and extracting data.', 'Ethical and legal considerations are important when web scraping for professional use.', 'The goal is to scrape post titles, dates, and authors from the specified blog and generate a JSON file.']}, {'end': 418.463, 'segs': [{'end': 245.831, 'src': 'heatmap', 'start': 166.328, 'weight': 3, 'content': [{'end': 172.93, 'text': 'So if we do python3-m venv and create a folder called venv.', 'start': 166.328, 'duration': 6.602}, {'end': 175.31, 'text': 'So that will be our virtual environment.', 'start': 173.55, 'duration': 1.76}, {'end': 178.371, 'text': "And you can see inside the bin folder there's an activate script.", 'start': 175.33, 'duration': 3.041}, {'end': 185.514, 'text': 'So we just want to call that with source venv slash bin slash activate.', 'start': 178.852, 'duration': 6.662}, {'end': 188.491, 'text': 'Okay, so activate our virtual environment.', 'start': 186.55, 'duration': 1.941}, {'end': 195.236, 'text': "And if you're in VS code, you want to just command shift P or control shift P, search for Python,", 'start': 188.792, 'duration': 6.444}, {'end': 198.518, 'text': 'select interpreter and just select your virtual environment.', 'start': 195.236, 'duration': 3.282}, {'end': 201.28, 'text': 'Mine is called V and V.', 'start': 198.778, 'duration': 2.502}, {'end': 202.721, 'text': 'Okay, so now we should be all set.', 'start': 201.28, 'duration': 1.441}, {'end': 207.364, 'text': "So I'm going to install scrapey and we want to use pip for that or pippy and V if you're using that.", 'start': 202.781, 'duration': 4.583}, {'end': 209.786, 'text': "So let's install scrapey.", 'start': 207.985, 'duration': 1.801}, {'end': 220.387, 'text': "Okay, so once Scrapey's installed, we can go ahead and create a project by saying Scrapey start project.", 'start': 211.761, 'duration': 8.626}, {'end': 226.851, 'text': "And then I'm just going to call this, we'll call it post scrape or post crawl, whatever you want to call it.", 'start': 220.547, 'duration': 6.304}, {'end': 230.674, 'text': "And then we're just going to CD into post scrape.", 'start': 227.011, 'duration': 3.663}, {'end': 234.744, 'text': "And then let's take a look at the folder up here that was created.", 'start': 232.102, 'duration': 2.642}, {'end': 240.728, 'text': "Now there's another folder called Post Scrape inside of it, along with this scrapey.cfg file.", 'start': 235.024, 'duration': 5.704}, {'end': 245.831, 'text': 'And inside this folder, we have a file for middlewares, for pipelines.', 'start': 241.428, 'duration': 4.403}], 'summary': 'Creating a python virtual environment, installing scrapy, and setting up a project for web scraping.', 'duration': 54.059, 'max_score': 166.328, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ166328.jpg'}, {'end': 289.827, 'src': 'embed', 'start': 264.362, 'weight': 0, 'content': [{'end': 271.324, 'text': "So inside the spiders folder, we're going to create a new file called posts underscore spider dot pi.", 'start': 264.362, 'duration': 6.962}, {'end': 275.004, 'text': 'OK, so this is going to be our main file that we work with.', 'start': 271.344, 'duration': 3.66}, {'end': 279.825, 'text': "And the first thing we're going to do is import scrapy so that we can use it.", 'start': 275.664, 'duration': 4.161}, {'end': 283.026, 'text': 'And then we need to create a spider class.', 'start': 280.685, 'duration': 2.341}, {'end': 284.426, 'text': "So let's say class.", 'start': 283.346, 'duration': 1.08}, {'end': 287.766, 'text': "We'll call it posts spider.", 'start': 284.466, 'duration': 3.3}, {'end': 289.827, 'text': 'And this needs to extend.', 'start': 288.147, 'duration': 1.68}], 'summary': "Creating 'posts_spider.py' file in 'spiders' folder to import and extend scrapy", 'duration': 25.465, 'max_score': 264.362, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ264362.jpg'}, {'end': 365.914, 'src': 'embed', 'start': 342.325, 'weight': 1, 'content': [{'end': 353.029, 'text': "And we're going to do HTTPS blog dot scraping hub dot com and we can do slash.", 'start': 342.325, 'duration': 10.704}, {'end': 356.13, 'text': "Let's do slash page slash one slash.", 'start': 353.049, 'duration': 3.081}, {'end': 358.671, 'text': "And then I'm also going to get page two.", 'start': 356.81, 'duration': 1.861}, {'end': 365.914, 'text': "Now, in the end, I'm going to show you how we can just take the root URL and get all of the data from all the pages.", 'start': 359.171, 'duration': 6.743}], 'summary': 'Demonstrating how to scrape data from multiple pages on https://blog.scrapinghub.com', 'duration': 23.589, 'max_score': 342.325, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ342325.jpg'}, {'end': 418.463, 'src': 'embed', 'start': 390.064, 'weight': 2, 'content': [{'end': 397.711, 'text': "This is the default callback used by Scrapey to process download responses when their requests don't specify a callback.", 'start': 390.064, 'duration': 7.647}, {'end': 402.495, 'text': "So it's basically in charge of the response in returning the scraped data.", 'start': 397.751, 'duration': 4.744}, {'end': 412.6, 'text': "so let's go ahead and define parse and we want to pass in self, since it's a method of this class, and then it takes in a response.", 'start': 403.055, 'duration': 9.545}, {'end': 418.463, 'text': 'okay, so the response is basically the the data that we scrape now.', 'start': 412.6, 'duration': 5.863}], 'summary': 'Default callback in scrapey processes download responses, returning scraped data.', 'duration': 28.399, 'max_score': 390.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ390064.jpg'}], 'start': 143.771, 'title': 'Web scraping with scrapy', 'summary': 'Covers setting up scrapy for web scraping and creating a spider including steps like folder structure, class creation, importing scrapy, adding name and start urls, and defining the parse method.', 'chapters': [{'end': 220.387, 'start': 143.771, 'title': 'Using scrapey for web scraping', 'summary': 'Demonstrates setting up a virtual environment for a python project and installing scrapey for web scraping using pip, with the demonstration of creating a project using scrapey start project.', 'duration': 76.616, 'highlights': ['Setting up a virtual environment using Python3 venv and activating it using source venv/bin/activate.', 'Installing Scrapey using pip and creating a project with Scrapey start project.']}, {'end': 418.463, 'start': 220.547, 'title': 'Creating a spider with scrapy', 'summary': 'Explains the process of creating a spider using scrapy, including steps like folder structure, creating a spider class, importing scrapy, adding name and start urls, and defining the parse method for processing scraped data.', 'duration': 197.916, 'highlights': ['The chapter explains the process of creating a spider using Scrapy, including steps like folder structure, creating a spider class, importing scrapy, adding name and start URLs, and defining the parse method for processing scraped data. The transcript provides a step-by-step guide to creating a spider using Scrapy, covering folder structure, creating a spider class, importing scrapy, adding name and start URLs, and defining the parse method for processing scraped data.', "The start URLs are set to 'https://blog.scrapinghub.com/page/1/' and 'https://blog.scrapinghub.com/page/2/', indicating the pages to be crawled initially. The start URLs are specified as 'https://blog.scrapinghub.com/page/1/' and 'https://blog.scrapinghub.com/page/2/', signifying the initial pages to be crawled.", "The 'parse' method is defined to process the scraped data, taking in the response and returning the scraped data. The 'parse' method is explained as the default callback used by Scrapy to process download responses and is responsible for returning the scraped data."]}], 'duration': 274.692, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ143771.jpg', 'highlights': ['The chapter explains the process of creating a spider using Scrapy, including steps like folder structure, creating a spider class, importing scrapy, adding name and start URLs, and defining the parse method for processing scraped data.', "The start URLs are set to 'https://blog.scrapinghub.com/page/1/' and 'https://blog.scrapinghub.com/page/2/', indicating the pages to be crawled initially.", "The 'parse' method is defined to process the scraped data, taking in the response and returning the scraped data.", 'Setting up a virtual environment using Python3 venv and activating it using source venv/bin/activate.', 'Installing Scrapy using pip and creating a project with Scrapy start project.']}, {'end': 938.704, 'segs': [{'end': 441.695, 'src': 'embed', 'start': 418.463, 'weight': 0, 'content': [{'end': 426.627, 'text': "in this case, we're just going to basically copy both of these pages and create two new html files with the same exact html.", 'start': 418.463, 'duration': 8.164}, {'end': 428.689, 'text': "so we're scraping the entire page.", 'start': 426.627, 'duration': 2.062}, {'end': 435.912, 'text': "later on we're going to target certain certain elements of the page using selectors and put them into a JSON file.", 'start': 428.689, 'duration': 7.223}, {'end': 441.695, 'text': "So I'm going to create a variable called page and set that to response dot URL.", 'start': 436.532, 'duration': 5.163}], 'summary': 'Scraping two pages to create new html files and extracting specific elements into a json file.', 'duration': 23.232, 'max_score': 418.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ418463.jpg'}, {'end': 541.996, 'src': 'heatmap', 'start': 518.461, 'weight': 1, 'content': [{'end': 527.843, 'text': "so let's save that and we can run this with scrapey crawl and then, whatever we call, whatever the name is in this case, posts.", 'start': 518.461, 'duration': 9.382}, {'end': 530.624, 'text': 'okay. so whatever we put here is what we want to put here.', 'start': 527.843, 'duration': 2.781}, {'end': 534.251, 'text': 'OK, so we ran it.', 'start': 533.37, 'duration': 0.881}, {'end': 539.014, 'text': 'Now you can see over here we have post dash one and post dash two HTML.', 'start': 534.311, 'duration': 4.703}, {'end': 541.996, 'text': 'So post dash one is going to be this.', 'start': 539.494, 'duration': 2.502}], 'summary': 'The transcript discusses running a scrapey crawl and obtaining post dash one and post dash two html files.', 'duration': 23.535, 'max_score': 518.461, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ518461.jpg'}, {'end': 598.399, 'src': 'embed', 'start': 571.879, 'weight': 2, 'content': [{'end': 577.922, 'text': "because we're going to work with the shell, just to kind of show you how to select things, how to use methods and so on.", 'start': 571.879, 'duration': 6.043}, {'end': 587.249, 'text': 'Um, so the way we go into our shell is we call scrapey shell and then the, the, the domain or the URL that we want to crawl.', 'start': 578.702, 'duration': 8.547}, {'end': 593.975, 'text': 'So in this case, HTTPS blog dot scraping.', 'start': 587.329, 'duration': 6.646}, {'end': 598.399, 'text': "I can't, so I can't remember the URL scraping hub.com.", 'start': 594.075, 'duration': 4.324}], 'summary': 'Demonstrating how to use scrapy shell for web crawling.', 'duration': 26.52, 'max_score': 571.879, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ571879.jpg'}, {'end': 649.188, 'src': 'embed', 'start': 621.793, 'weight': 1, 'content': [{'end': 627.417, 'text': "So to use a CSS selector, we just do dot CSS and let's say we want to get the title.", 'start': 621.793, 'duration': 5.624}, {'end': 636.863, 'text': 'So what this returns is something called the selector list, which represents a list of selector objects that wrap around HTML elements.', 'start': 628.88, 'duration': 7.983}, {'end': 638.503, 'text': 'So you can see selector.', 'start': 637.203, 'duration': 1.3}, {'end': 641.404, 'text': "It has the X path, which I'll talk about in a little bit.", 'start': 638.843, 'duration': 2.561}, {'end': 646.006, 'text': 'And data is going to be the actual element, in this case the title.', 'start': 641.804, 'duration': 4.202}, {'end': 649.188, 'text': 'with the tags and the text inside of it.', 'start': 646.426, 'duration': 2.762}], 'summary': 'Using a css selector with dot css returns a selector list, including x path and data for the element', 'duration': 27.395, 'max_score': 621.793, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ621793.jpg'}, {'end': 887.723, 'src': 'embed', 'start': 856.481, 'weight': 4, 'content': [{'end': 863.046, 'text': 'if we want to get the second one, we could do that, which is the date all right.', 'start': 856.481, 'duration': 6.565}, {'end': 864.226, 'text': 'so pretty easy.', 'start': 863.046, 'duration': 1.18}, {'end': 869.21, 'text': 'um, we can also use regular expressions.', 'start': 864.226, 'duration': 4.984}, {'end': 872.912, 'text': "okay, so there's a method called re for regular expressions.", 'start': 869.21, 'duration': 3.702}, {'end': 883.96, 'text': "so if i say response dot css and let's say we want to get all the paragraph text, and let's say dot r, e,", 'start': 872.912, 'duration': 11.048}, {'end': 887.723, 'text': 'And in here we have to format this with an R string like this', 'start': 884.72, 'duration': 3.003}], 'summary': 'Demonstrating usage of regular expressions for text manipulation.', 'duration': 31.242, 'max_score': 856.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ856481.jpg'}], 'start': 418.463, 'title': 'Html scraping and website content with scrapey', 'summary': 'Covers the process of scraping entire html pages and creating new html files, targeting specific elements and putting them into a json file. it also covers using scrapey to scrape website content, including selecting elements using css selectors, extracting text, and using regular expressions, with quantifiable examples.', 'chapters': [{'end': 518.461, 'start': 418.463, 'title': 'Html scraping and file creation', 'summary': 'Covers the process of scraping entire html pages and creating new html files, then targeting specific elements using selectors and putting them into a json file, with a detailed explanation of the code involved.', 'duration': 99.998, 'highlights': ['The process involves copying both pages and creating two new HTML files with the entire HTML content, followed by targeting specific elements using selectors and putting them into a JSON file.', "The code includes creating a variable 'page' set to response.URL, using the split method to extract the page number, setting a file name with the page number, and creating the file by writing the entire HTML content to it."]}, {'end': 938.704, 'start': 518.461, 'title': 'Scraping website content with scrapey', 'summary': 'Covers using scrapey to scrape website content, including selecting elements using css selectors, extracting text and using regular expressions, demonstrating the process with quantifiable examples.', 'duration': 420.243, 'highlights': ['Using CSS selectors to select elements and extract text The speaker demonstrates using CSS selectors to select elements like title, H3s, paragraphs, and specifying methods like get and getAll to extract specific content.', 'Demonstrating the use of regular expressions for text extraction The speaker shows how to use regular expressions to extract specific instances from the text, such as finding instances of a word or pattern in the content.', 'Using Scrapey to scrape website content The speaker showcases the process of using Scrapey to scrape and save website content, emphasizing its usefulness for offline viewing or scraping entire websites.']}], 'duration': 520.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ418463.jpg', 'highlights': ['The process involves copying both pages and creating two new HTML files with the entire HTML content, followed by targeting specific elements using selectors and putting them into a JSON file.', 'Using CSS selectors to select elements and extract text The speaker demonstrates using CSS selectors to select elements like title, H3s, paragraphs, and specifying methods like get and getAll to extract specific content.', 'Using Scrapey to scrape website content The speaker showcases the process of using Scrapey to scrape and save website content, emphasizing its usefulness for offline viewing or scraping entire websites.', "The code includes creating a variable 'page' set to response.URL, using the split method to extract the page number, setting a file name with the page number, and creating the file by writing the entire HTML content to it.", 'Demonstrating the use of regular expressions for text extraction The speaker shows how to use regular expressions to extract specific instances from the text, such as finding instances of a word or pattern in the content.']}, {'end': 1351.401, 'segs': [{'end': 968.957, 'src': 'embed', 'start': 938.744, 'weight': 0, 'content': [{'end': 939.904, 'text': 'You can get data that way.', 'start': 938.744, 'duration': 1.16}, {'end': 946.888, 'text': 'So now what I want to do is take a look at using XPath selectors.', 'start': 941.085, 'duration': 5.803}, {'end': 954.671, 'text': 'So XPath is a language for selecting nodes in XML documents, and it can also be used with HTML.', 'start': 947.488, 'duration': 7.183}, {'end': 958.152, 'text': "It's really difficult for me, actually.", 'start': 954.931, 'duration': 3.221}, {'end': 959.233, 'text': "It's kind of confusing.", 'start': 958.212, 'duration': 1.021}, {'end': 964.955, 'text': 'But the CSS selectors are kind of like syntactic sugar for the XPath selectors.', 'start': 959.713, 'duration': 5.242}, {'end': 966.996, 'text': "The XPath is what's happening under the hood.", 'start': 965.015, 'duration': 1.981}, {'end': 968.957, 'text': 'But you can use them directly.', 'start': 967.336, 'duration': 1.621}], 'summary': 'Introduction to using xpath and css selectors for selecting nodes in xml and html documents.', 'duration': 30.213, 'max_score': 938.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ938744.jpg'}, {'end': 1094.808, 'src': 'embed', 'start': 1060.828, 'weight': 2, 'content': [{'end': 1063.591, 'text': 'the next thing I want to do is what I said.', 'start': 1060.828, 'duration': 2.763}, {'end': 1068.195, 'text': 'I want to get the the title, the date and the author,', 'start': 1063.591, 'duration': 4.604}, {'end': 1073.059, 'text': "and we're first going to do that in the terminal here and we're going to do it with just the first post,", 'start': 1068.195, 'duration': 4.864}, {'end': 1078.545, 'text': "and then I'll show you how we can kind of loop through the posts and get each each set of data.", 'start': 1073.059, 'duration': 5.486}, {'end': 1087.738, 'text': "So let's set a variable so we can set variables here and we're going to say set this post to response dot CSS.", 'start': 1079.185, 'duration': 8.553}, {'end': 1094.808, 'text': 'OK, so these queries, we can put them inside of variables and we want to grab the div.', 'start': 1088.459, 'duration': 6.349}], 'summary': 'Demonstrating how to extract title, date, and author from the first post and loop through the posts in the terminal.', 'duration': 33.98, 'max_score': 1060.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1060828.jpg'}, {'end': 1351.401, 'src': 'embed', 'start': 1322.862, 'weight': 1, 'content': [{'end': 1332.007, 'text': "so i'm going to use the, the dict function here and then pass in title is going to be the title variable.", 'start': 1322.862, 'duration': 9.145}, {'end': 1339.593, 'text': 'date will be the date variable and the author will be the author variable.', 'start': 1332.007, 'duration': 7.586}, {'end': 1348.499, 'text': "OK, so we'll run that and you can see that now we have a bunch of dictionaries that have the title, the date and the author.", 'start': 1339.613, 'duration': 8.886}, {'end': 1351.401, 'text': "So we've looped through all the posts and outputted that data.", 'start': 1348.559, 'duration': 2.842}], 'summary': 'Using the dict function to create dictionaries with title, date, and author variables. outputting data after looping through posts.', 'duration': 28.539, 'max_score': 1322.862, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1322862.jpg'}], 'start': 938.744, 'title': 'Data extraction and web scraping in python', 'summary': 'Covers using xpath selectors for data extraction in python, focusing on its syntax and usage with html, as well as scraping post data using python and scrapy to extract titles, dates, and authors from websites, including techniques for looping through multiple posts.', 'chapters': [{'end': 1060.828, 'start': 938.744, 'title': 'Using xpath selectors for data extraction', 'summary': 'Explains how to use xpath selectors for data extraction in python, highlighting its syntax and usage with html. it also touches on using chrome tools to obtain xpath for elements, demonstrating its utility in targeting and extracting specific data.', 'duration': 122.084, 'highlights': ["The chapter explains the usage of XPath selectors for data extraction, demonstrating how to select nodes in XML documents and HTML, and highlighting its application in targeting specific elements. It also discusses the relationship between CSS selectors and XPath, emphasizing the latter's role as the underlying mechanism.", 'The transcript details the process of obtaining Xpath for elements using Chrome tools, showcasing how it can be used to target and extract specific data, such as obtaining the Xpath for an author link and retrieving the associated text.', 'The speaker highlights the challenges of using XPath selectors, expressing difficulty and confusion in working with them, while acknowledging their potential for more precise targeting compared to CSS selectors.']}, {'end': 1351.401, 'start': 1060.828, 'title': 'Scraping post data in python', 'summary': 'Demonstrates how to scrape the title, date, and author from a website using python and scrapy, including techniques for extracting data from individual posts and looping through multiple posts.', 'duration': 290.573, 'highlights': ['Demonstrates how to scrape the title, date, and author from a website using Python and Scrapy The chapter covers the process of extracting the title, date, and author from a website using Python and Scrapy.', 'Techniques for extracting data from individual posts and looping through multiple posts The chapter provides techniques for extracting data from individual posts and looping through multiple posts, allowing for efficient scraping of multiple pieces of data.', 'Illustrates the use of CSS queries and variables to extract specific elements The demonstration includes the use of CSS queries and variables to extract specific elements, such as the title, date, and author, from the website.']}], 'duration': 412.657, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ938744.jpg', 'highlights': ['Covers using xpath selectors for data extraction in python, focusing on its syntax and usage with html, as well as scraping post data using python and scrapy to extract titles, dates, and authors from websites, including techniques for looping through multiple posts.', 'Demonstrates how to scrape the title, date, and author from a website using Python and Scrapy.', 'Techniques for extracting data from individual posts and looping through multiple posts.', 'The chapter explains the usage of XPath selectors for data extraction, demonstrating how to select nodes in XML documents and HTML, and highlighting its application in targeting specific elements.']}, {'end': 1531.239, 'segs': [{'end': 1410.594, 'src': 'embed', 'start': 1385.621, 'weight': 1, 'content': [{'end': 1390.924, 'text': "because again I'm going to show you how we can go through all the pages without having to actually manually add them.", 'start': 1385.621, 'duration': 5.303}, {'end': 1394.486, 'text': 'So we want that there.', 'start': 1391.664, 'duration': 2.822}, {'end': 1397.767, 'text': "And then in the parse, we don't need any of this.", 'start': 1394.626, 'duration': 3.141}, {'end': 1400.928, 'text': "We're not just going to copy the pages like we did before.", 'start': 1397.847, 'duration': 3.081}, {'end': 1404.21, 'text': 'We want to loop through like I just showed you.', 'start': 1401.969, 'duration': 2.241}, {'end': 1410.594, 'text': "We're going to say for post in response dot CSS.", 'start': 1404.25, 'duration': 6.344}], 'summary': 'Demonstrating automatic page crawling and looping through response pages.', 'duration': 24.973, 'max_score': 1385.621, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1385621.jpg'}, {'end': 1531.239, 'src': 'embed', 'start': 1476.565, 'weight': 0, 'content': [{'end': 1481.049, 'text': 'And again, this could be a class if there was an actual class on the link or an ID or something.', 'start': 1476.565, 'duration': 4.484}, {'end': 1482.174, 'text': 'All right.', 'start': 1481.874, 'duration': 0.3}, {'end': 1486.237, 'text': 'So just doing that, I think we should be good.', 'start': 1482.234, 'duration': 4.003}, {'end': 1488.559, 'text': "So I'm going to save and then I'm going to go down here.", 'start': 1486.277, 'duration': 2.282}, {'end': 1494.823, 'text': 'And if I just run scrapey crawl and then pose.', 'start': 1488.579, 'duration': 6.244}, {'end': 1502.543, 'text': "All it's going to do really is just show me down here in the console.", 'start': 1498.922, 'duration': 3.621}, {'end': 1504.724, 'text': 'You can see the data, the title, and so on.', 'start': 1502.583, 'duration': 2.141}, {'end': 1508.205, 'text': 'So we want to actually put this into a JSON file.', 'start': 1505.164, 'duration': 3.041}, {'end': 1514.748, 'text': 'So we just want to add on to this the output flag and say post dot JSON.', 'start': 1508.265, 'duration': 6.483}, {'end': 1518.829, 'text': 'And you can do a JL format, which is a JSON list.', 'start': 1515.148, 'duration': 3.681}, {'end': 1519.749, 'text': 'You can do CSV.', 'start': 1518.849, 'duration': 0.9}, {'end': 1521.01, 'text': 'You can do all kinds of stuff.', 'start': 1519.829, 'duration': 1.181}, {'end': 1525.013, 'text': "We're going to do a JSON file and check it out.", 'start': 1521.89, 'duration': 3.123}, {'end': 1529.257, 'text': 'So now we have a JSON array with all of the posts.', 'start': 1525.734, 'duration': 3.523}, {'end': 1531.239, 'text': "But notice it's only on the first page.", 'start': 1529.337, 'duration': 1.902}], 'summary': 'Demonstrating data scraping and saving as json file.', 'duration': 54.674, 'max_score': 1476.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1476565.jpg'}], 'start': 1352.392, 'title': 'Creating a json file from web scraping', 'summary': 'Covers the process of extracting data and saving it into a json file using scrapy, resulting in a json array with all the posts. it includes examples of code snippets and demonstrates the functionality.', 'chapters': [{'end': 1531.239, 'start': 1352.392, 'title': 'Creating json file from web scraping', 'summary': 'Covers creating a json file from web scraping to store data with examples of code snippets, demonstrating the process of extracting data and saving it into a json file using scrapy, resulting in a json array with all the posts.', 'duration': 178.847, 'highlights': ['The chapter demonstrates the process of extracting data from web scraping and saving it into a JSON file using Scrapy, resulting in a JSON array with all the posts.', 'The speaker explains the process of looping through the posts and yielding a dictionary with the title, date, and author from the response, showing how to automate the data extraction without manually adding the pages.', "The chapter also covers adding the output flag 'post.json' to save the extracted data into a JSON file using Scrapy, providing options for different output formats such as JSON list and CSV."]}], 'duration': 178.847, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1352392.jpg', 'highlights': ['The chapter demonstrates the process of extracting data from web scraping and saving it into a JSON file using Scrapy, resulting in a JSON array with all the posts.', 'The speaker explains the process of looping through the posts and yielding a dictionary with the title, date, and author from the response, showing how to automate the data extraction without manually adding the pages.', "The chapter also covers adding the output flag 'post.json' to save the extracted data into a JSON file using Scrapy, providing options for different output formats such as JSON list and CSV."]}, {'end': 1734.652, 'segs': [{'end': 1590.058, 'src': 'embed', 'start': 1557.66, 'weight': 2, 'content': [{'end': 1563.485, 'text': 'So remember that and then it has an href attribute that goes to the next page that we want to scrape.', 'start': 1557.66, 'duration': 5.825}, {'end': 1573.337, 'text': 'So, back here, we want to go on the same level as our for loop here and create a next page variable.', 'start': 1564.325, 'duration': 9.012}, {'end': 1577.542, 'text': "And we're going to set that to response dot CSS.", 'start': 1573.978, 'duration': 3.564}, {'end': 1583.491, 'text': "And remember, it's a link with a class of next.", 'start': 1580.607, 'duration': 2.884}, {'end': 1587.275, 'text': 'What was it? Next post link.', 'start': 1584.532, 'duration': 2.743}, {'end': 1590.058, 'text': 'But we want the actual attribute so we can do this.', 'start': 1587.375, 'duration': 2.683}], 'summary': 'Create a next page variable to scrape the href attribute for the next page.', 'duration': 32.398, 'max_score': 1557.66, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1557660.jpg'}, {'end': 1649.795, 'src': 'embed', 'start': 1620.28, 'weight': 3, 'content': [{'end': 1627.663, 'text': "so we want to say if it's not none, then let's say next page is going to equal response.", 'start': 1620.28, 'duration': 7.383}, {'end': 1630.585, 'text': "And then there's a method called URL join.", 'start': 1627.723, 'duration': 2.862}, {'end': 1633.386, 'text': 'And we want to join in the next page.', 'start': 1631.505, 'duration': 1.881}, {'end': 1637.188, 'text': "OK, so basically we're scraping the next page as well.", 'start': 1633.406, 'duration': 3.782}, {'end': 1644.092, 'text': 'And then the last thing we have to do is just call yield scrapey dot request.', 'start': 1637.588, 'duration': 6.504}, {'end': 1649.795, 'text': 'And that takes in our next page and a callback.', 'start': 1645.032, 'duration': 4.763}], 'summary': 'Scraping next page using yield scrapey dot request.', 'duration': 29.515, 'max_score': 1620.28, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1620280.jpg'}, {'end': 1734.652, 'src': 'embed', 'start': 1668.723, 'weight': 0, 'content': [{'end': 1674.59, 'text': "So output post Jason and it's going to take a little longer because there's more data to go through.", 'start': 1668.723, 'duration': 5.867}, {'end': 1679.902, 'text': "So right now it's just scraping the entire blog.", 'start': 1676.778, 'duration': 3.124}, {'end': 1683.848, 'text': 'And if I go to my post.json, check that out.', 'start': 1680.824, 'duration': 3.024}, {'end': 1691.959, 'text': "So now there's tons more data because it went through every single page and it took the title, the date, and the author.", 'start': 1683.928, 'duration': 8.031}, {'end': 1696.285, 'text': "right. so i mean that's pretty much it.", 'start': 1693.564, 'duration': 2.721}, {'end': 1698.265, 'text': "there's a lot more you can do.", 'start': 1696.285, 'duration': 1.98}, {'end': 1703.547, 'text': "that's much more advanced, but i think that just for the amount of code that we wrote here, what is this?", 'start': 1698.265, 'duration': 5.282}, {'end': 1706.267, 'text': "it's 21 lines, counting the the spaces here.", 'start': 1703.547, 'duration': 2.72}, {'end': 1715.25, 'text': "um, so you know less than 20 lines of code and we're able to scrape an entire website and get certain pieces of data.", 'start': 1706.267, 'duration': 8.983}, {'end': 1722.776, 'text': "i don't know how useful some blog fields are, but If you go to like an e-commerce site for a certain category,", 'start': 1715.25, 'duration': 7.526}, {'end': 1726.621, 'text': 'maybe you want to have a list of all the products or something like that.', 'start': 1722.776, 'duration': 3.845}, {'end': 1729.665, 'text': 'Scrapey is really good for stuff like that.', 'start': 1727.542, 'duration': 2.123}, {'end': 1732.469, 'text': 'So hopefully you learned something here and you enjoyed it.', 'start': 1730.206, 'duration': 2.263}, {'end': 1733.27, 'text': "And that's it.", 'start': 1732.749, 'duration': 0.521}, {'end': 1734.652, 'text': 'I will see you in the next video.', 'start': 1733.45, 'duration': 1.202}], 'summary': 'Scraping an entire website using less than 20 lines of code to extract specific data and demonstrating its potential for e-commerce sites.', 'duration': 65.929, 'max_score': 1668.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1668723.jpg'}], 'start': 1531.9, 'title': 'Scraping websites with scrapy', 'summary': "Covers scraping pagination, following links, and utilizing css selectors for data extraction, enabling scraping of entire sites. it also highlights scrapy's capability to scrape entire websites with minimal code, particularly beneficial for e-commerce sites.", 'chapters': [{'end': 1668.703, 'start': 1531.9, 'title': 'Scraping pagination and following links', 'summary': 'Explains how to scrape data from multiple pages by following links, utilizing css selectors and checking for the existence of next pages, enabling the scraping of the entire site.', 'duration': 136.803, 'highlights': ['The process involves identifying the next page link by its class and extracting the href attribute, allowing for the scraping of subsequent pages.', "The script checks for the existence of a next page by ensuring that the next page link is not 'none', ensuring the complete scraping of the site.", "The next page link is then utilized to initiate the scraping of the subsequent page by calling the 'scrapy.request' method with a callback to the 'parse' method."]}, {'end': 1734.652, 'start': 1668.723, 'title': 'Scraping data from websites', 'summary': 'Demonstrates how to use scrapey to scrape an entire website and gather specific pieces of data with just 21 lines of code, showcasing its potential for more advanced tasks and its usefulness for e-commerce sites.', 'duration': 65.929, 'highlights': ['Scrapey can scrape an entire website and gather specific data with less than 20 lines of code, as demonstrated in the 21-line example provided, showcasing its efficiency and simplicity.', 'The process of scraping the entire blog resulted in gathering more data, including the title, date, and author, demonstrating the comprehensive nature of the scraping.', "Scrapey's potential for more advanced tasks is highlighted, indicating that there's much more that can be done beyond the demonstrated example, showcasing its versatility and capability for complex scraping tasks.", 'The usefulness of Scrapey for e-commerce sites is mentioned, suggesting its effectiveness in extracting data such as a list of products, showcasing its practical application in different industry contexts.']}], 'duration': 202.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ALizgnSFTwQ/pics/ALizgnSFTwQ1531900.jpg', 'highlights': ['Scrapey can scrape an entire website and gather specific data with less than 20 lines of code, showcasing its efficiency and simplicity.', 'The process of scraping the entire blog resulted in gathering more data, including the title, date, and author, demonstrating the comprehensive nature of the scraping.', 'The process involves identifying the next page link by its class and extracting the href attribute, allowing for the scraping of subsequent pages.', "The script checks for the existence of a next page by ensuring that the next page link is not 'none', ensuring the complete scraping of the site.", "The next page link is then utilized to initiate the scraping of the subsequent page by calling the 'scrapy.request' method with a callback to the 'parse' method.", 'The usefulness of Scrapey for e-commerce sites is mentioned, suggesting its effectiveness in extracting data such as a list of products, showcasing its practical application in different industry contexts.', "Scrapey's potential for more advanced tasks is highlighted, indicating that there's much more that can be done beyond the demonstrated example, showcasing its versatility and capability for complex scraping tasks."]}], 'highlights': ['Scrapey can scrape an entire website and gather specific data with less than 20 lines of code, showcasing its efficiency and simplicity.', 'The process of scraping the entire blog resulted in gathering more data, including the title, date, and author, demonstrating the comprehensive nature of the scraping.', 'The process involves identifying the next page link by its class and extracting the href attribute, allowing for the scraping of subsequent pages.', "The script checks for the existence of a next page by ensuring that the next page link is not 'none', ensuring the complete scraping of the site.", "The next page link is then utilized to initiate the scraping of the subsequent page by calling the 'scrapy.request' method with a callback to the 'parse' method.", 'The usefulness of Scrapey for e-commerce sites is mentioned, suggesting its effectiveness in extracting data such as a list of products, showcasing its practical application in different industry contexts.', "Scrapey's potential for more advanced tasks is highlighted, indicating that there's much more that can be done beyond the demonstrated example, showcasing its versatility and capability for complex scraping tasks.", 'The chapter demonstrates the process of extracting data from web scraping and saving it into a JSON file using Scrapy, resulting in a JSON array with all the posts.', 'The speaker explains the process of looping through the posts and yielding a dictionary with the title, date, and author from the response, showing how to automate the data extraction without manually adding the pages.', "The chapter also covers adding the output flag 'post.json' to save the extracted data into a JSON file using Scrapy, providing options for different output formats such as JSON list and CSV.", 'Covers using xpath selectors for data extraction in python, focusing on its syntax and usage with html, as well as scraping post data using python and scrapy to extract titles, dates, and authors from websites, including techniques for looping through multiple posts.', 'Demonstrates how to scrape the title, date, and author from a website using Python and Scrapy.', 'Techniques for extracting data from individual posts and looping through multiple posts.', 'The chapter explains the usage of XPath selectors for data extraction, demonstrating how to select nodes in XML documents and HTML, and highlighting its application in targeting specific elements.', 'The process involves copying both pages and creating two new HTML files with the entire HTML content, followed by targeting specific elements using selectors and putting them into a JSON file.', 'Using CSS selectors to select elements and extract text The speaker demonstrates using CSS selectors to select elements like title, H3s, paragraphs, and specifying methods like get and getAll to extract specific content.', 'Using Scrapey to scrape website content The speaker showcases the process of using Scrapey to scrape and save website content, emphasizing its usefulness for offline viewing or scraping entire websites.', "The code includes creating a variable 'page' set to response.URL, using the split method to extract the page number, setting a file name with the page number, and creating the file by writing the entire HTML content to it.", 'Demonstrating the use of regular expressions for text extraction The speaker shows how to use regular expressions to extract specific instances from the text, such as finding instances of a word or pattern in the content.', 'The chapter explains the process of creating a spider using Scrapy, including steps like folder structure, creating a spider class, importing scrapy, adding name and start URLs, and defining the parse method for processing scraped data.', "The start URLs are set to 'https://blog.scrapinghub.com/page/1/' and 'https://blog.scrapinghub.com/page/2/', indicating the pages to be crawled initially.", "The 'parse' method is defined to process the scraped data, taking in the response and returning the scraped data.", 'Setting up a virtual environment using Python3 venv and activating it using source venv/bin/activate.', 'Installing Scrapy using pip and creating a project with Scrapy start project.', 'Scrapy is a Python framework for crawling websites and extracting data.', 'Ethical and legal considerations are important when web scraping for professional use.', 'The goal is to scrape post titles, dates, and authors from the specified blog and generate a JSON file.']}