title
Python Tutorial: Web Scraping with Requests-HTML

description
In this Python Programming Tutorial, we will be learning how to scrape websites using the Requests-HTML library. Requests-HTML is an excellent tool for parsing HTML code and grabbing exactly the information you need. So whether you're pulling down headlines from news sites, scores from sports websites, or prices from an online store... Requests-HTML and Python will help you get this done quickly and easily. Let's get started... The code from this video can be found at: https://github.com/CoreyMSchafer/code_snippets/tree/master/Python/Requests-HTML File Objects Tutorial - https://youtu.be/Uh2ebFW8OYM Requests Tutorial - https://youtu.be/tb8gHvYlCFs F-Strings Tutorial - https://youtu.be/nghuHvKLhJA Try/Except Tutorial - https://youtu.be/NIWwJbo-9_8 CSV Tutorial - https://youtu.be/q5uM4VKywbA ✅ Support My Channel Through Patreon: https://www.patreon.com/coreyms ✅ Become a Channel Member: https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g/join ✅ One-Time Contribution Through PayPal: https://goo.gl/649HFY ✅ Cryptocurrency Donations: Bitcoin Wallet - 3MPH8oY2EAgbLVy7RBMinwcBntggi7qeG3 Ethereum Wallet - 0x151649418616068fB46C3598083817101d3bCD33 Litecoin Wallet - MPvEBY5fxGkmPQgocfJbxP6EmTo5UUXMot ✅ Corey's Public Amazon Wishlist http://a.co/inIyro1 ✅ Equipment I Use and Books I Recommend: https://www.amazon.com/shop/coreyschafer ▶️ You Can Find Me On: My Website - http://coreyms.com/ My Second Channel - https://www.youtube.com/c/coreymschafer Facebook - https://www.facebook.com/CoreyMSchafer Twitter - https://twitter.com/CoreyMSchafer Instagram - https://www.instagram.com/coreymschafer/ #Python #Requests

detail
{'title': 'Python Tutorial: Web Scraping with Requests-HTML', 'heatmap': [{'end': 241.124, 'start': 202.508, 'weight': 0.714}, {'end': 1388.924, 'start': 1313.681, 'weight': 0.826}, {'end': 1458.064, 'start': 1414.789, 'weight': 0.709}, {'end': 1571.268, 'start': 1486.395, 'weight': 0.74}], 'summary': 'Learn web scraping and html parsing with python using the requests-html library, covering topics such as using css selectors, accessing html content, working with attributes, parsing video links, handling parsing errors, and making asynchronous requests for efficient data collection, reducing execution time from over 30 seconds to around 3 seconds for 10 different apis.', 'chapters': [{'end': 785.733, 'segs': [{'end': 25.641, 'src': 'embed', 'start': 0.229, 'weight': 0, 'content': [{'end': 5.752, 'text': "Hey there, how's it going everybody? In this video we're going to be learning how to scrape websites using the RequestHTML library.", 'start': 0.229, 'duration': 5.523}, {'end': 11.534, 'text': "Now I've done a video on web scraping before using Beautiful Soup, which is one of the more popular tools out there.", 'start': 6.092, 'duration': 5.442}, {'end': 17.337, 'text': "but RequestHTML is a newer project written by Kenneth Wrights, and he's the same person who wrote the Request library.", 'start': 11.534, 'duration': 5.803}, {'end': 21.539, 'text': 'And he has a history of writing libraries that are easy to use and pretty intuitive.', 'start': 17.757, 'duration': 3.782}, {'end': 25.641, 'text': "so I figured we'd give this library a look to see how we can scrape some web data.", 'start': 21.539, 'duration': 4.102}], 'summary': 'Learn how to use requesthtml library for web scraping.', 'duration': 25.412, 'max_score': 0.229, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g229.jpg'}, {'end': 105.221, 'src': 'embed', 'start': 72.381, 'weight': 2, 'content': [{'end': 76.484, 'text': 'And every post that I have has a title with this big heading tag here.', 'start': 72.381, 'duration': 4.103}, {'end': 81.308, 'text': 'And then I have some text of the summary of the video, so a description of the video.', 'start': 76.944, 'duration': 4.364}, {'end': 84.25, 'text': 'And then I have the embedded YouTube video here.', 'start': 81.768, 'duration': 2.482}, {'end': 88.733, 'text': "So let's say that we wanted to write a scraper that would go out and grab this information.", 'start': 84.67, 'duration': 4.063}, {'end': 93.693, 'text': 'So we wanted to grab the post titles, the summaries, and the links to the videos.', 'start': 89.11, 'duration': 4.583}, {'end': 96.535, 'text': 'And we just wanted to ignore all this other stuff here.', 'start': 94.013, 'duration': 2.522}, {'end': 98.176, 'text': 'So, to show you what this would look like,', 'start': 96.915, 'duration': 1.261}, {'end': 105.221, 'text': "let me run the finish script that we'll be writing in this video so that you can see what something like this can do and what it's capable of.", 'start': 98.176, 'duration': 7.045}], 'summary': 'The transcript discusses scraping post titles, summaries, and video links from a webpage, emphasizing the goal of ignoring irrelevant content.', 'duration': 32.84, 'max_score': 72.381, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g72381.jpg'}, {'end': 241.124, 'src': 'heatmap', 'start': 202.508, 'weight': 0.714, 'content': [{'end': 206.43, 'text': 'So first we need to install request.html.', 'start': 202.508, 'duration': 3.922}, {'end': 208.211, 'text': "So I'm going to pull my terminal back up here.", 'start': 206.53, 'duration': 1.681}, {'end': 211.012, 'text': 'Now we can do this with a simple pip install.', 'start': 208.231, 'duration': 2.781}, {'end': 216.174, 'text': "It's just pip install and that is request-html.", 'start': 211.232, 'duration': 4.942}, {'end': 219.776, 'text': 'Now I already have mine installed here, but yours would install there.', 'start': 216.555, 'duration': 3.221}, {'end': 225.459, 'text': "And once we have that installed, let's look at a very basic example to get us started and then we'll work up from there.", 'start': 219.796, 'duration': 5.663}, {'end': 231.381, 'text': "Now, you don't have to be extremely familiar with HTML in order to scrape websites, but it definitely helps.", 'start': 225.959, 'duration': 5.422}, {'end': 236.962, 'text': 'Basically, HTML is structured in a way where all of the information is contained within certain tags.', 'start': 231.961, 'duration': 5.001}, {'end': 241.124, 'text': "So if you're familiar with XML, then it's very similar to that.", 'start': 237.262, 'duration': 3.862}], 'summary': 'Install request-html using pip install and learn basic html structure for web scraping.', 'duration': 38.616, 'max_score': 202.508, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g202508.jpg'}, {'end': 458.238, 'src': 'embed', 'start': 418.915, 'weight': 3, 'content': [{'end': 425.94, 'text': "so let's say that we wanted to parse out the article headlines and also the summaries for our website and nothing else,", 'start': 418.915, 'duration': 7.025}, {'end': 428.141, 'text': 'just the article headlines and the summaries.', 'start': 425.94, 'duration': 2.201}, {'end': 436.407, 'text': "so in this example it's just gonna be article 1 headline and its summary text and then the article 2 headline and its summary text.", 'start': 428.141, 'duration': 8.266}, {'end': 446.092, 'text': "so I'm going to open up a blank file here and I just called this file rhtml-demo, and now we can parse the HTML within this script.", 'start': 436.407, 'duration': 9.685}, {'end': 448.673, 'text': 'Now we can parse HTML in multiple ways.', 'start': 446.552, 'duration': 2.121}, {'end': 455.937, 'text': "So we can either use an HTML session to go out and pull HTML from a website, and we'll see how to do that in just a minute,", 'start': 449.014, 'duration': 6.923}, {'end': 458.238, 'text': 'but we can also just parse HTML directly.', 'start': 455.937, 'duration': 2.301}], 'summary': 'Parsing article headlines and summaries from html for website content.', 'duration': 39.323, 'max_score': 418.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g418915.jpg'}, {'end': 677.743, 'src': 'embed', 'start': 626.509, 'weight': 5, 'content': [{'end': 631.314, 'text': "Now we're going to take a look at that further in just a bit, but let's not worry about it right now.", 'start': 626.509, 'duration': 4.805}, {'end': 637.041, 'text': "Okay, so now let's find out how we can parse out that information that we want from this HTML.", 'start': 632.275, 'duration': 4.766}, {'end': 641.624, 'text': "So let's say that we wanted to grab the title of our HTML page.", 'start': 637.481, 'duration': 4.143}, {'end': 652.19, 'text': 'So if I look at the title tag of this HTML up here at the top, then we should get this here where it says test dash a sample website.', 'start': 641.964, 'duration': 10.226}, {'end': 657.493, 'text': 'So in order to get the title, we can simply do something like this.', 'start': 652.67, 'duration': 4.823}, {'end': 663.197, 'text': "So I'm going to say match is equal to and I'm going to do HTML dot find.", 'start': 657.873, 'duration': 5.324}, {'end': 672.521, 'text': 'And with this find method we just want to find title okay, and now we can print out that match.', 'start': 663.697, 'duration': 8.824}, {'end': 677.743, 'text': 'so if I save that and run it, then we can see that it prints out this list of elements here.', 'start': 672.521, 'duration': 5.222}], 'summary': 'Demonstrating how to parse html to extract the title using the find method.', 'duration': 51.234, 'max_score': 626.509, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g626509.jpg'}, {'end': 747.359, 'src': 'embed', 'start': 720.587, 'weight': 7, 'content': [{'end': 728.03, 'text': 'and now I just have this element, and with these elements we have access to a lot of the same attributes and methods that we had before.', 'start': 720.587, 'duration': 7.443}, {'end': 730.532, 'text': 'so we could find additional elements within here.', 'start': 728.03, 'duration': 2.502}, {'end': 735.614, 'text': 'if this were nested, we can view the HTML or we can simply view the text.', 'start': 730.532, 'duration': 5.082}, {'end': 742.177, 'text': 'so if I wanted the HTML, then I could just say print the HTML of that element.', 'start': 735.614, 'duration': 6.563}, {'end': 747.359, 'text': 'if I save that and run it, then we can see that we get the HTML of that title tag.', 'start': 742.177, 'duration': 5.182}], 'summary': 'Access attributes and methods of html elements, view html/text, print element html.', 'duration': 26.772, 'max_score': 720.587, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g720587.jpg'}], 'start': 0.229, 'title': 'Web scraping and html parsing with python', 'summary': 'Explains how to use the requesthtml library to scrape websites and parsing html using python. it covers scraping specific information, parsing dynamic data, making asynchronous requests, creating csv files, and extracting post titles, summaries, and video links. additionally, it demonstrates opening and parsing html files, accessing html and text content, locating elements, and accessing attributes and methods of html elements.', 'chapters': [{'end': 436.407, 'start': 0.229, 'title': 'Web scraping with requesthtml', 'summary': 'Explains how to use the requesthtml library to scrape websites, showcasing its ability to pull specific information from a webpage and mentioning the capability to parse dynamic data generated by javascript, make asynchronous requests, and create csv files, with the example of extracting post titles, summaries, and video links from a personal website, and the process of parsing html for article headlines and summaries.', 'duration': 436.178, 'highlights': ['The RequestHTML library is used to scrape websites, showcasing its ability to pull specific information from a webpage. The chapter explains how to use the RequestHTML library to scrape websites, showcasing its ability to pull specific information from a webpage.', 'The RequestHTML library can parse dynamic data generated by JavaScript, make asynchronous requests, and create CSV files. The chapter mentions the capability of the RequestHTML library to parse dynamic data generated by JavaScript, make asynchronous requests, and create CSV files.', 'An example of extracting post titles, summaries, and video links from a personal website is demonstrated. The chapter demonstrates the example of extracting post titles, summaries, and video links from a personal website.', 'The process of parsing HTML for article headlines and summaries is explained. The chapter explains the process of parsing HTML for article headlines and summaries.']}, {'end': 785.733, 'start': 436.407, 'title': 'Parsing html with python', 'summary': 'Demonstrates parsing html using python, covering opening and parsing html files, accessing html and text content, using find method to locate elements, and accessing attributes and methods of html elements.', 'duration': 349.326, 'highlights': ['The chapter covers opening and parsing HTML files, demonstrating how to access and parse the HTML content.', 'Demonstrates accessing HTML and text content from the parsed HTML file, showcasing the extraction of text without tags.', 'Explains the usage of the find method to locate specific elements within the HTML, providing an example of finding the title of the HTML page.', 'Illustrates accessing attributes and methods of HTML elements, showcasing the retrieval of HTML and text of a specific element.']}], 'duration': 785.504, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g229.jpg', 'highlights': ['The RequestHTML library is used to scrape websites, showcasing its ability to pull specific information from a webpage.', 'The RequestHTML library can parse dynamic data generated by JavaScript, make asynchronous requests, and create CSV files.', 'An example of extracting post titles, summaries, and video links from a personal website is demonstrated.', 'The process of parsing HTML for article headlines and summaries is explained.', 'The chapter covers opening and parsing HTML files, demonstrating how to access and parse the HTML content.', 'Demonstrates accessing HTML and text content from the parsed HTML file, showcasing the extraction of text without tags.', 'Explains the usage of the find method to locate specific elements within the HTML, providing an example of finding the title of the HTML page.', 'Illustrates accessing attributes and methods of HTML elements, showcasing the retrieval of HTML and text of a specific element.']}, {'end': 1230.858, 'segs': [{'end': 816.621, 'src': 'embed', 'start': 786.033, 'weight': 0, 'content': [{'end': 789.616, 'text': "It'll just be the first element that gets found with that search.", 'start': 786.033, 'duration': 3.583}, {'end': 793.579, 'text': 'So if I save that and run it, then you can see that we get the exact same thing.', 'start': 789.916, 'duration': 3.663}, {'end': 797.202, 'text': 'Now, like I said, the find method uses CSS selectors.', 'start': 793.979, 'duration': 3.223}, {'end': 802.468, 'text': 'So if we wanted to get an element by a certain ID, then we can use the pound sign.', 'start': 797.603, 'duration': 4.865}, {'end': 811.617, 'text': 'So for example, if we wanted to grab the div with the ID of footer, then I could simply say pound sign footer.', 'start': 802.788, 'duration': 8.829}, {'end': 816.621, 'text': "and now it's going to return the element that has the ID of footer.", 'start': 812.157, 'duration': 4.464}], 'summary': 'Using css selectors, the find method returns the first element found. pound sign can be used to select elements by id.', 'duration': 30.588, 'max_score': 786.033, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g786033.jpg'}, {'end': 883.387, 'src': 'embed', 'start': 857.572, 'weight': 3, 'content': [{'end': 863.315, 'text': "And now I'm going to click on inspect and that will allow us to inspect this element.", 'start': 857.572, 'duration': 5.743}, {'end': 866.978, 'text': 'So this inspection popped up here on the right side.', 'start': 863.796, 'duration': 3.182}, {'end': 871.38, 'text': 'Let me see if I can make this text a little bit larger here so that we can see.', 'start': 867.038, 'duration': 4.342}, {'end': 878.144, 'text': "So now we can see that we have all of our HTML here, but the one that's highlighted is the one that we right clicked and inspected.", 'start': 871.78, 'duration': 6.364}, {'end': 883.387, 'text': 'So we can see whenever I hover over this, it actually highlights that in the browser.', 'start': 878.504, 'duration': 4.883}], 'summary': 'Demonstrating how to inspect elements, highlighting html in browser.', 'duration': 25.815, 'max_score': 857.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g857572.jpg'}, {'end': 930.607, 'src': 'embed', 'start': 898.624, 'weight': 1, 'content': [{'end': 903.627, 'text': 'So like we saw before, our article headlines are within a div with a class of article.', 'start': 898.624, 'duration': 5.003}, {'end': 907.129, 'text': 'And then we have an h2 tag and then an anchor tag.', 'start': 904.148, 'duration': 2.981}, {'end': 909.951, 'text': "So first let's grab the article div.", 'start': 907.53, 'duration': 2.421}, {'end': 916.716, 'text': "So I'm going to close this and go back to the HTML or the scraper here.", 'start': 910.231, 'duration': 6.485}, {'end': 921.679, 'text': "And now instead of match, I'm going to call this article instead.", 'start': 917.156, 'duration': 4.523}, {'end': 930.607, 'text': "and now in the find method here I'm going to find a div with a class of article.", 'start': 922.96, 'duration': 7.647}], 'summary': 'The transcript discusses locating article divs in html using specific class and tag names.', 'duration': 31.983, 'max_score': 898.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g898624.jpg'}, {'end': 1041.26, 'src': 'embed', 'start': 1010.72, 'weight': 6, 'content': [{'end': 1014.783, 'text': 'So if I save that and run it, then you can see that we have those matched elements.', 'start': 1010.72, 'duration': 4.063}, {'end': 1023.248, 'text': 'If I just wanted the text, then I could either say headline.txt here, or I could even just add it onto the end of my query here.', 'start': 1015.383, 'duration': 7.865}, {'end': 1024.949, 'text': "So I'll say .", 'start': 1023.388, 'duration': 1.561}, {'end': 1027.011, 'text': 'txt after that find, .', 'start': 1024.949, 'duration': 2.062}, {'end': 1029.212, 'text': 'txt after that find, save that and run it.', 'start': 1027.011, 'duration': 2.201}, {'end': 1034.457, 'text': 'And now we can see we got the Article 1 headline and the summary for Article 1.', 'start': 1029.571, 'duration': 4.886}, {'end': 1041.26, 'text': 'Okay, so now that we have this information from one article, we can most likely reuse this to parse the information from all of the articles.', 'start': 1034.457, 'duration': 6.803}], 'summary': 'Parsing and saving matched elements, extracting article 1 headline and summary.', 'duration': 30.54, 'max_score': 1010.72, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1010720.jpg'}, {'end': 1086.071, 'src': 'embed', 'start': 1058.027, 'weight': 8, 'content': [{'end': 1060.308, 'text': "I'm going to take away that first equals true.", 'start': 1058.027, 'duration': 2.281}, {'end': 1062.89, 'text': 'So now that should return a list of articles.', 'start': 1060.788, 'duration': 2.102}, {'end': 1070.438, 'text': "So now let's loop over those articles and reuse the same same code that we had before to access the headline and the summary.", 'start': 1063.591, 'duration': 6.847}, {'end': 1086.071, 'text': "So right underneath here I'm just going to say for article in articles, and then we will use this same code here to find the headline and the summary.", 'start': 1070.818, 'duration': 15.253}], 'summary': 'The code has been modified to return a list of articles and loop over them to access the headline and summary.', 'duration': 28.044, 'max_score': 1058.027, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1058027.jpg'}, {'end': 1135.048, 'src': 'embed', 'start': 1108.421, 'weight': 2, 'content': [{'end': 1113.943, 'text': 'okay. so now we got the headline and the summary for every article in our simple.html file here.', 'start': 1108.421, 'duration': 5.522}, {'end': 1114.884, 'text': "so that's good.", 'start': 1113.943, 'duration': 0.941}, {'end': 1119.405, 'text': "so we're starting to see how this would be useful for getting information from websites.", 'start': 1114.884, 'duration': 4.521}, {'end': 1123.145, 'text': "so now let's do something similar, but with an actual website.", 'start': 1119.405, 'duration': 3.74}, {'end': 1128.066, 'text': 'so I have my personal website pulled up here in the browser that we saw before.', 'start': 1123.145, 'duration': 4.921}, {'end': 1135.048, 'text': 'let me make this a little larger here and, like I said, I have a list of posts here and all of these posts have a title,', 'start': 1128.066, 'duration': 6.982}], 'summary': 'Demonstrating extraction of headline and summary from simple.html file, moving on to extracting information from personal website.', 'duration': 26.627, 'max_score': 1108.421, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1108421.jpg'}, {'end': 1177.455, 'src': 'embed', 'start': 1151.36, 'weight': 7, 'content': [{'end': 1156.346, 'text': "When we're grabbing data from a URL, we need to use something like an HTML session.", 'start': 1151.36, 'duration': 4.986}, {'end': 1160.01, 'text': 'So let me make this a little larger here.', 'start': 1156.686, 'duration': 3.324}, {'end': 1166.532, 'text': "And up at the top, instead of just importing HTML, I'm also going to import HTML session.", 'start': 1160.45, 'duration': 6.082}, {'end': 1170.513, 'text': "And now let's get the source code from my website.", 'start': 1167.632, 'duration': 2.881}, {'end': 1177.455, 'text': "So I'm just going to comment out this with open that we have here before because I'm going to do one more thing with this file later.", 'start': 1170.833, 'duration': 6.622}], 'summary': 'Using html session to grab data from url and comment out file operation for later use.', 'duration': 26.095, 'max_score': 1151.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1151360.jpg'}], 'start': 786.033, 'title': 'Using css selectors and web scraping with python', 'summary': 'Highlights the use of css selectors to locate elements and web scraping with python, including finding article headlines and summaries from html, using find methods and css selectors, and accessing information from a website using python.', 'chapters': [{'end': 878.144, 'start': 786.033, 'title': 'Using css selectors to locate elements', 'summary': 'Highlights the use of css selectors to locate elements by id, finding article headlines and summaries from html, and using the inspect feature in chrome to locate specific elements for web scraping.', 'duration': 92.111, 'highlights': ['The find method uses CSS selectors to locate elements by ID, demonstrated by using the pound sign to grab a div with the ID of footer.', 'Demonstrated the method of using the inspect feature in Chrome to locate specific elements for web scraping by right-clicking on the desired element and selecting inspect.', 'Explained the necessity of avoiding sifting through all the HTML code when scraping larger sites due to potential complexity.']}, {'end': 1034.457, 'start': 878.504, 'title': 'Web scraping with python', 'summary': 'Explains how to use python to scrape and extract specific elements from a webpage, such as grabbing headlines and summaries within a div with a class of article, using find methods and css selectors, and printing out the matched elements.', 'duration': 155.953, 'highlights': ['The chapter explains how to use Python to scrape and extract specific elements from a webpage It provides a practical demonstration of using Python to scrape and extract specific elements from a webpage, demonstrating the practical application of the topic.', 'grabbing headlines and summaries within a div with a class of article The tutorial demonstrates the process of grabbing headlines and summaries within a div with a class of article, showcasing practical implementation.', 'using find methods and CSS selectors The tutorial illustrates the usage of find methods and CSS selectors to locate specific elements within the webpage, providing a practical example of these techniques.', 'printing out the matched elements It demonstrates the process of printing out the matched elements, showcasing how to access and display the extracted information from the webpage.']}, {'end': 1230.858, 'start': 1034.457, 'title': 'Parsing website data with python', 'summary': 'Demonstrates how to use python to parse information from a website, reusing code to access headlines and summaries from articles, and utilizing the request library to obtain source code from a personal website.', 'duration': 196.401, 'highlights': ['The chapter demonstrates how to use Python to parse information from a website It shows the process of extracting information from a website using Python, demonstrating practical application.', 'Reusing code to access headlines and summaries from articles The code is modified to retrieve headlines and summaries from all articles, instead of just the first one, improving efficiency and scalability.', "Utilizing the request library to obtain source code from a personal website The HTML session and 'session.get' method are used to fetch the source code from the author's personal website, showcasing a real-world application of web scraping."]}], 'duration': 444.825, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g786033.jpg', 'highlights': ['The find method uses CSS selectors to locate elements by ID, demonstrated by using the pound sign to grab a div with the ID of footer.', 'The chapter explains how to use Python to scrape and extract specific elements from a webpage.', 'The chapter demonstrates how to use Python to parse information from a website.', 'Demonstrated the method of using the inspect feature in Chrome to locate specific elements for web scraping by right-clicking on the desired element and selecting inspect.', 'grabbing headlines and summaries within a div with a class of article The tutorial demonstrates the process of grabbing headlines and summaries within a div with a class of article, showcasing practical implementation.', 'using find methods and CSS selectors The tutorial illustrates the usage of find methods and CSS selectors to locate specific elements within the webpage, providing a practical example of these techniques.', 'printing out the matched elements It demonstrates the process of printing out the matched elements, showcasing how to access and display the extracted information from the webpage.', 'Explained the necessity of avoiding sifting through all the HTML code when scraping larger sites due to potential complexity.', 'Reusing code to access headlines and summaries from articles The code is modified to retrieve headlines and summaries from all articles, instead of just the first one, improving efficiency and scalability.', "Utilizing the request library to obtain source code from a personal website The HTML session and 'session.get' method are used to fetch the source code from the author's personal website, showcasing a real-world application of web scraping."]}, {'end': 1631.143, 'segs': [{'end': 1259.792, 'src': 'embed', 'start': 1230.858, 'weight': 0, 'content': [{'end': 1236.18, 'text': 'then you can watch my video where I go into more detail about these request and response objects.', 'start': 1230.858, 'duration': 5.322}, {'end': 1239.861, 'text': "So I'll leave a link to that video in the description section below if anyone is interested.", 'start': 1236.48, 'duration': 3.381}, {'end': 1244.102, 'text': "But what we're interested in for this video is the HTML attribute.", 'start': 1240.281, 'duration': 3.821}, {'end': 1249.945, 'text': "So I'm going to print r.html.", 'start': 1244.442, 'duration': 5.503}, {'end': 1259.792, 'text': 'if I save that and run it, then, when I printed that out, that HTML attribute gives us access to the HTML object for that website.', 'start': 1249.945, 'duration': 9.847}], 'summary': "The video discusses accessing the html object using the 'r.html' attribute.", 'duration': 28.934, 'max_score': 1230.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1230858.jpg'}, {'end': 1388.924, 'src': 'heatmap', 'start': 1294.998, 'weight': 4, 'content': [{'end': 1301.106, 'text': "So just like I did in the simple example, I'm going to right click on this headline and then go to inspect.", 'start': 1294.998, 'duration': 6.108}, {'end': 1310.277, 'text': 'And now we can see here our article headline is in an H2 tag with a class of entry title.', 'start': 1301.907, 'duration': 8.37}, {'end': 1313.221, 'text': "OK, so that's how we're going to find that.", 'start': 1310.577, 'duration': 2.644}, {'end': 1324.929, 'text': 'if I right click on the description here and inspect that, then we can see that this is a paragraph tag inside of a div with the class entry content.', 'start': 1313.681, 'duration': 11.248}, {'end': 1332.775, 'text': 'now both the heading and the summary are both inside of this article tag here.', 'start': 1324.929, 'duration': 7.846}, {'end': 1336.097, 'text': 'so this article tag is for one post.', 'start': 1332.775, 'duration': 3.322}, {'end': 1341.16, 'text': 'so if I hover over that, then you can see it just highlights that first post, but not the second one.', 'start': 1336.097, 'duration': 5.063}, {'end': 1343.781, 'text': 'If I hover over the second one, then it highlights that post.', 'start': 1341.28, 'duration': 2.501}, {'end': 1351.002, 'text': "So first, let's just grab that entire first article that contains all of the information that we want.", 'start': 1344.221, 'duration': 6.781}, {'end': 1356.863, 'text': 'So to grab that first article in the source code, let me go back to our script here.', 'start': 1351.402, 'duration': 5.461}, {'end': 1362.264, 'text': "We can simply just say I'm going to overwrite this print statement here.", 'start': 1357.523, 'duration': 4.741}, {'end': 1371.728, 'text': "I'm going to say article is equal to r.html.find, And we are going to find that article tag.", 'start': 1362.264, 'duration': 9.464}, {'end': 1375.472, 'text': "And I just want the first one for now while we're messing with this.", 'start': 1372.329, 'duration': 3.143}, {'end': 1377.754, 'text': 'So I will say first is equal to true.', 'start': 1375.872, 'duration': 1.882}, {'end': 1382.798, 'text': 'And now I can print out that article.html.', 'start': 1378.234, 'duration': 4.564}, {'end': 1384.6, 'text': "So I'll save that and run it.", 'start': 1382.998, 'duration': 1.602}, {'end': 1388.924, 'text': 'Okay, and this gives us all of the HTML of that first article.', 'start': 1385.04, 'duration': 3.884}], 'summary': 'Demonstrates using python to extract specific html elements from a web page.', 'duration': 29.931, 'max_score': 1294.998, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1294998.jpg'}, {'end': 1473.771, 'src': 'heatmap', 'start': 1414.789, 'weight': 2, 'content': [{'end': 1422.831, 'text': "So this isn't the neatest HTML here, but if we read through this, then we will be able to see that it's the HTML from that first article.", 'start': 1414.789, 'duration': 8.042}, {'end': 1428.053, 'text': 'so just to see this a bit better, I will go ahead and format this so that we can read it.', 'start': 1422.831, 'duration': 5.222}, {'end': 1435.514, 'text': "I have an online formatter pulled up here in my browser and I'm just going to use this to pretty up our HTML.", 'start': 1428.053, 'duration': 7.461}, {'end': 1441.696, 'text': "so I'm going to paste that into the HTML input part and then click on beautify and you can see that over here.", 'start': 1435.514, 'duration': 6.182}, {'end': 1443.416, 'text': 'it formats it nicely.', 'start': 1441.696, 'duration': 1.72}, {'end': 1458.064, 'text': "so now I'm going to copy that prettied up HTML and paste this in here to sublime and also let me set the syntax as HTML.", 'start': 1443.416, 'duration': 14.648}, {'end': 1461.546, 'text': 'okay, so now we have our pretty printed HTML.', 'start': 1458.064, 'duration': 3.482}, {'end': 1467.868, 'text': 'okay, so now that we have the HTML for this article now we can figure out how we want to grab what we want to grab.', 'start': 1461.546, 'duration': 6.322}, {'end': 1473.771, 'text': 'so we want to grab the headline summary and YouTube video link from this article here.', 'start': 1467.868, 'duration': 5.903}], 'summary': 'Formatting html using online tool for better readability and syntax, to extract headline, summary, and youtube video link.', 'duration': 30.355, 'max_score': 1414.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1414789.jpg'}, {'end': 1571.268, 'src': 'heatmap', 'start': 1486.395, 'weight': 0.74, 'content': [{'end': 1490.797, 'text': 'actually, this would probably be more readable if I turn on word wrap.', 'start': 1486.395, 'duration': 4.402}, {'end': 1498.5, 'text': 'okay, but we can see that this is the link here and the text of that link is the article headline.', 'start': 1490.797, 'duration': 7.703}, {'end': 1500.741, 'text': 'and actually this link has its own class.', 'start': 1498.5, 'duration': 2.241}, {'end': 1502.562, 'text': 'so that makes it a little easier on us.', 'start': 1500.741, 'duration': 1.821}, {'end': 1504.743, 'text': 'it says entry title link.', 'start': 1502.562, 'duration': 2.181}, {'end': 1507.824, 'text': "so let's just grab the text of that class.", 'start': 1504.743, 'duration': 3.081}, {'end': 1516.668, 'text': "so I'm going to go back to our script here and now, instead of printing out that article.html, I'm going to use it to find our headline.", 'start': 1507.824, 'duration': 8.844}, {'end': 1527.514, 'text': "so I'm going to say headline is equal to article dot find and within that article we want to find a class of entry.", 'start': 1516.668, 'duration': 10.846}, {'end': 1530.655, 'text': "I can actually just go back to the HTML here, so I don't screw it up.", 'start': 1527.514, 'duration': 3.141}, {'end': 1532.056, 'text': 'entry title link.', 'start': 1530.655, 'duration': 1.401}, {'end': 1538.02, 'text': 'I will copy that and paste that in and we just want to grab the first result.', 'start': 1532.056, 'duration': 5.964}, {'end': 1545.104, 'text': "so we'll say first is equal to true and we just want to grab the text from that match.", 'start': 1538.02, 'duration': 7.084}, {'end': 1554.092, 'text': 'so now if I print out headline, if I save that and run it, then we can see that we got the first headline from the first post.', 'start': 1545.104, 'duration': 8.988}, {'end': 1560.541, 'text': "okay, so now that we got the title of my latest post, now let's get the summary text for that post.", 'start': 1554.092, 'duration': 6.449}, {'end': 1571.268, 'text': "so let's go back to the HTML for our article here and let me scroll down until I see what looks like the description, and this is it right here.", 'start': 1560.541, 'duration': 10.727}], 'summary': 'Extracted headline and summary text from html using python script.', 'duration': 84.873, 'max_score': 1486.395, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1486395.jpg'}, {'end': 1631.143, 'src': 'embed', 'start': 1609.184, 'weight': 3, 'content': [{'end': 1618.171, 'text': "so now, if I save that and run it, then now we have our headline here and then right below this, it's a bit bunched together,", 'start': 1609.184, 'duration': 8.987}, {'end': 1621.314, 'text': 'but this is the description of that first post.', 'start': 1618.171, 'duration': 3.143}, {'end': 1628.04, 'text': "now again, the syntax that we're using inside of this find method is the same syntax that you would use in CSS.", 'start': 1621.314, 'duration': 6.726}, {'end': 1631.143, 'text': "so if you're not familiar with how that works, then that's where that comes from.", 'start': 1628.04, 'duration': 3.103}], 'summary': 'Using css syntax for find method to display headline and post description.', 'duration': 21.959, 'max_score': 1609.184, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1609184.jpg'}], 'start': 1230.858, 'title': 'Web scraping with python', 'summary': "Covers working with html attributes in python using the 'r.html' attribute and web scraping for video information, emphasizing the use of python's requests-html library and css syntax to extract specific elements and text content.", 'chapters': [{'end': 1275.904, 'start': 1230.858, 'title': 'Working with html attributes in python', 'summary': "Explains how to access the html object using the 'r.html' attribute in python, providing a way to interact with and search for elements on a website.", 'duration': 45.046, 'highlights': ["The 'r.html' attribute in Python provides access to the HTML object for a website, allowing interaction and element search.", "Accessing the 'r.html' attribute in Python is demonstrated as a means to work with the HTML object on a website."]}, {'end': 1631.143, 'start': 1276.225, 'title': 'Web scraping for video information', 'summary': "Discusses the process of extracting headline and summary information from a website's html structure using python's requests-html library, emphasizing the use of css syntax to locate specific elements and retrieve their text content.", 'duration': 354.918, 'highlights': ["The chapter discusses the process of extracting headline and summary information from a website's HTML structure. The speaker explains the process of using Python's requests-html library to extract headline and summary information from a website's HTML structure.", 'Emphasizes the use of CSS syntax to locate specific elements and retrieve their text content. The speaker emphasizes the importance of using CSS syntax to locate specific elements and extract their text content, demonstrating how to use CSS selectors within the requests-html library.', 'Demonstrates accessing specific elements such as headline and summary using CSS class and tag selectors. The speaker demonstrates how to access specific elements like headline and summary by using CSS class and tag selectors, providing examples of the code used for this purpose.']}], 'duration': 400.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1230858.jpg', 'highlights': ["The 'r.html' attribute in Python provides access to the HTML object for a website, allowing interaction and element search.", "Accessing the 'r.html' attribute in Python is demonstrated as a means to work with the HTML object on a website.", "The chapter discusses the process of extracting headline and summary information from a website's HTML structure.", 'Emphasizes the use of CSS syntax to locate specific elements and retrieve their text content.', 'Demonstrates accessing specific elements such as headline and summary using CSS class and tag selectors.']}, {'end': 2080.672, 'segs': [{'end': 1660.476, 'src': 'embed', 'start': 1631.463, 'weight': 0, 'content': [{'end': 1636.045, 'text': 'Okay. so, lastly, we need to get the link to the video for that post.', 'start': 1631.463, 'duration': 4.582}, {'end': 1638.786, 'text': 'Now this one is going to be a little more difficult,', 'start': 1636.405, 'duration': 2.381}, {'end': 1647.53, 'text': 'but I wanted to show you this because sometimes parsing information can be a bit ugly and require you to take several steps before getting your final desired result.', 'start': 1638.786, 'duration': 8.744}, {'end': 1655.654, 'text': "So if we look back at the HTML of the article, then let's see if we can find where this video is.", 'start': 1647.93, 'duration': 7.724}, {'end': 1660.476, 'text': "So it's actually down here within this iframe.", 'start': 1656.014, 'duration': 4.462}], 'summary': 'The task involves finding the video link in the html, which may require multiple steps.', 'duration': 29.013, 'max_score': 1631.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1631463.jpg'}, {'end': 1731.232, 'src': 'embed', 'start': 1703.693, 'weight': 1, 'content': [{'end': 1714.199, 'text': 'If we inspect this iframe here, if we look at the source attribute here, this source has a link to the embedded version of the video.', 'start': 1703.693, 'duration': 10.506}, {'end': 1717.181, 'text': "But it's not a direct link to the video itself.", 'start': 1714.659, 'duration': 2.522}, {'end': 1726.788, 'text': 'We can see it goes to youtube.com forward slash embed, forward slash this video ID here, and then it has a long URL after that.', 'start': 1717.541, 'duration': 9.247}, {'end': 1731.232, 'text': 'But if you know how YouTube videos work, they all have an ID for the video.', 'start': 1727.109, 'duration': 4.123}], 'summary': 'The iframe source links to the embedded video on youtube.com with a unique video id.', 'duration': 27.539, 'max_score': 1703.693, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1703693.jpg'}, {'end': 1786.956, 'src': 'embed', 'start': 1757.886, 'weight': 4, 'content': [{'end': 1759.287, 'text': 'So this is pretty simple to do.', 'start': 1757.886, 'duration': 1.401}, {'end': 1772.285, 'text': 'So instead of using this HTML of this vid source element here, we can instead access the attributes by saying vid source dot ATTRS for attributes.', 'start': 1759.767, 'duration': 12.518}, {'end': 1779.912, 'text': 'So if I run that, then we get a dictionary of the attributes for that iframe element.', 'start': 1772.645, 'duration': 7.267}, {'end': 1786.956, 'text': 'So to grab the source, we can just access that like any other Python dictionary.', 'start': 1780.292, 'duration': 6.664}], 'summary': 'Access vid source attributes using python dictionary for iframes.', 'duration': 29.07, 'max_score': 1757.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1757886.jpg'}, {'end': 1899.777, 'src': 'embed', 'start': 1874.292, 'weight': 2, 'content': [{'end': 1884.441, 'text': 'so if I save that and run it, then now we have a list of values from our string that were split on that forward slash.', 'start': 1874.292, 'duration': 10.149}, {'end': 1888.846, 'text': "Now, if you've never used the split method on a string, basically like I said,", 'start': 1884.741, 'duration': 4.105}, {'end': 1893.851, 'text': 'it just splits the string into a list of values based on the character that you specify.', 'start': 1888.846, 'duration': 5.005}, {'end': 1899.777, 'text': 'So now we can see that our URL is broken into several parts based on where the forward slashes were.', 'start': 1894.171, 'duration': 5.606}], 'summary': 'Using the split method, the url string is broken into a list of values based on the forward slash.', 'duration': 25.485, 'max_score': 1874.292, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1874292.jpg'}, {'end': 2036.813, 'src': 'embed', 'start': 2006.57, 'weight': 3, 'content': [{'end': 2009.974, 'text': 'so I wanted to show you how you might go about getting the data that you want.', 'start': 2006.57, 'duration': 3.404}, {'end': 2018.08, 'text': 'Okay, so now that we have that YouTube ID now we can create our own YouTube link using that video ID.', 'start': 2010.595, 'duration': 7.485}, {'end': 2022.183, 'text': 'So the way that YouTube links are formatted are like this.', 'start': 2018.481, 'duration': 3.702}, {'end': 2026.206, 'text': 'Let me make this a little smaller here so we have some more room.', 'start': 2022.223, 'duration': 3.983}, {'end': 2029.969, 'text': "So I'm going to remove that print statement there.", 'start': 2026.746, 'duration': 3.223}, {'end': 2032.691, 'text': "And now I'm going to say YouTube link.", 'start': 2030.409, 'duration': 2.282}, {'end': 2036.813, 'text': "is equal to, and I'm just going to make this an f-string.", 'start': 2033.431, 'duration': 3.382}], 'summary': 'Demonstrating how to create a youtube link using a video id.', 'duration': 30.243, 'max_score': 2006.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2006570.jpg'}], 'start': 1631.463, 'title': 'Extracting and parsing video links and html attributes', 'summary': 'Covers parsing and extracting video links from html iframes, accessing and manipulating html attributes using python, and extracting youtube video ids from urls, showcasing the complexity of information parsing and demonstrating practical python coding techniques.', 'chapters': [{'end': 1757.666, 'start': 1631.463, 'title': 'Extracting video link from html', 'summary': 'Explains the process of parsing and extracting the embedded video link from an html iframe, including identifying the source link to the video and extracting the video id from the url, showcasing the complexity of information parsing.', 'duration': 126.203, 'highlights': ['Identifying the source link to the video The chapter demonstrates the process of identifying the source link to the embedded video within an HTML iframe, showcasing the initial step in extracting the video link.', 'Extracting the video ID from the URL It explains the process of extracting the video ID from the URL, emphasizing the importance of understanding URL parameters and specifying the method to extract the ID before the query string.', 'Showcasing the complexity of information parsing The chapter underscores the complexity of information parsing by illustrating the multi-step process required for extracting the desired video link, highlighting the challenges of parsing information from HTML.']}, {'end': 1899.777, 'start': 1757.886, 'title': 'Accessing and parsing html attributes', 'summary': 'Demonstrates how to access and manipulate html attributes using python code, including splitting a url string based on forward slashes and using the split method to obtain a list of values.', 'duration': 141.891, 'highlights': ['The chapter showcases accessing HTML attributes using Python code, obtaining a dictionary of attributes for an iframe element, and extracting the source key to retrieve a YouTube link.', 'Demonstrating the process of parsing a URL string to obtain the ID of a video, including breaking the URL into several parts based on forward slashes and using the split method to create a list of values.', 'Explaining the functionality of the split method on a string, which divides the string into a list of values based on the specified character.']}, {'end': 2080.672, 'start': 1900.317, 'title': 'Extracting youtube video id and creating youtube link', 'summary': 'Explains the process of extracting a youtube video id from a url by splitting the url and accessing specific indexes, then creating a youtube link using the extracted video id. it also mentions the use of f-strings in python for formatting the youtube link.', 'duration': 180.355, 'highlights': ['The process involves extracting a YouTube video ID by accessing specific indexes after splitting the URL, followed by creating a YouTube link using the extracted video ID.', 'It demonstrates the use of f-strings in Python for formatting the YouTube link, which is particularly useful in Python 3.6 and above.', 'The explanation emphasizes the importance of parsing website source code to obtain desired information, highlighting the complexity involved in accessing specific data from web sources.']}], 'duration': 449.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g1631463.jpg', 'highlights': ['Covers parsing and extracting video links from html iframes, showcasing the complexity of information parsing and demonstrating practical python coding techniques.', 'The chapter demonstrates the process of identifying the source link to the embedded video within an HTML iframe, showcasing the initial step in extracting the video link.', 'Explaining the functionality of the split method on a string, which divides the string into a list of values based on the specified character.', 'The process involves extracting a YouTube video ID by accessing specific indexes after splitting the URL, followed by creating a YouTube link using the extracted video ID.', 'The chapter showcases accessing HTML attributes using Python code, obtaining a dictionary of attributes for an iframe element, and extracting the source key to retrieve a YouTube link.']}, {'end': 2976.175, 'segs': [{'end': 2181.409, 'src': 'embed', 'start': 2152.315, 'weight': 0, 'content': [{'end': 2157.958, 'text': 'So now for each article that we found, we are parsing out the headline, and let me print that out.', 'start': 2152.315, 'duration': 5.643}, {'end': 2162.32, 'text': 'We are parsing out the summary, and I will uncomment out that print statement.', 'start': 2158.678, 'duration': 3.642}, {'end': 2167.682, 'text': "And then we're doing all of this parsing here to also grab a YouTube link.", 'start': 2162.7, 'duration': 4.982}, {'end': 2172.985, 'text': 'So now, and also let me put a blank print statement here at the bottom.', 'start': 2168.123, 'duration': 4.862}, {'end': 2176.227, 'text': 'So that we have some separation between these articles.', 'start': 2173.605, 'duration': 2.622}, {'end': 2181.409, 'text': 'Okay, so now if I run this, then we should be able to scroll up here.', 'start': 2176.627, 'duration': 4.782}], 'summary': 'Parsing articles to extract headline, summary, and youtube link.', 'duration': 29.094, 'max_score': 2152.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2152315.jpg'}, {'end': 2338.742, 'src': 'embed', 'start': 2312.855, 'weight': 1, 'content': [{'end': 2318.581, 'text': 'And if it does run into an error parsing a video, then it will just skip that part of the post.', 'start': 2312.855, 'duration': 5.726}, {'end': 2323.906, 'text': "So to do this, I'm going to make my output a little smaller there.", 'start': 2318.981, 'duration': 4.925}, {'end': 2328.33, 'text': "So to do this, I'm just going to create a try accept block where we are.", 'start': 2324.286, 'duration': 4.044}, {'end': 2338.742, 'text': "trying to parse out the video and in the try section of this I'm going to take all of our code that parses the video and creates the link,", 'start': 2329.992, 'duration': 8.75}], 'summary': 'Code will skip video parsing errors to create smaller output.', 'duration': 25.887, 'max_score': 2312.855, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2312855.jpg'}, {'end': 2703.576, 'src': 'embed', 'start': 2676.477, 'weight': 2, 'content': [{'end': 2682.039, 'text': "So first of all, it's a common thing to just want to get all of the links on a site.", 'start': 2676.477, 'duration': 5.562}, {'end': 2687.26, 'text': "So perhaps you're writing a crawler and want to visit each page on a site or something like that.", 'start': 2682.399, 'duration': 4.861}, {'end': 2696.389, 'text': "Well, that's so common that there's actually a links attribute in the HTML object that has a set of all the links on a page.", 'start': 2687.74, 'duration': 8.649}, {'end': 2700.252, 'text': "So let's go back to our script here and see what this would look like.", 'start': 2696.749, 'duration': 3.503}, {'end': 2703.576, 'text': "Now I'm going to close out our output there.", 'start': 2700.653, 'duration': 2.923}], 'summary': 'Common to want to get all links on a site. html object has a links attribute with a set of all the links.', 'duration': 27.099, 'max_score': 2676.477, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2676477.jpg'}], 'start': 2080.672, 'title': 'Web scraping and video parsing', 'summary': 'Covers scraping websites for articles, handling video parsing errors, and web scraping using request html, with practical examples and references to additional learning resources. it details the process of scraping a website for articles, looping through each article to extract headline, summary, and youtube link, handling errors when parsing videos, and showcasing web scraping using the request html library for data collection and report generation.', 'chapters': [{'end': 2288.226, 'start': 2080.672, 'title': 'Scraping website for articles', 'summary': 'Details the process of scraping a website for articles, looping through each article to extract headline, summary, and youtube link, and handling situations where articles are missing data, with a demonstration of parsing a page without a youtube video.', 'duration': 207.554, 'highlights': ['Demonstrates the process of looping through each article on the website to extract headline, summary, and YouTube link. Extracted headline, summary, and YouTube link for each article.', 'Addresses the issue of missing data by showcasing a demonstration of parsing a page without a YouTube video. Demonstrated parsing a page without a YouTube video associated with it.']}, {'end': 2635.82, 'start': 2288.826, 'title': 'Handling video parsing errors', 'summary': 'Discusses handling errors when parsing videos, implementing try-accept blocks, and saving scraped information to a csv file, with practical examples and references to additional learning resources.', 'duration': 346.994, 'highlights': ['Explaining try-accept blocks to handle video parsing errors and demonstrating practical implementation with code examples. Handling errors when parsing videos, implementing try-accept blocks, and providing a detailed example of the process.', 'Demonstrating how to save scraped information to a CSV file and providing a brief explanation of the process. Explanation and demonstration of saving scraped information to a CSV file, with practical coding examples.', 'Referencing additional learning resources for those unfamiliar with try-accept blocks and CSV file handling. Providing references to additional learning resources for try-accept blocks and CSV file handling.']}, {'end': 2976.175, 'start': 2638.721, 'title': 'Web scraping with request html', 'summary': 'Demonstrates web scraping using the request html library, showcasing how to extract data from websites, including obtaining all links on a site and grabbing dynamically generated text using javascript. it also emphasizes the utility of the library for data collection and report generation.', 'duration': 337.454, 'highlights': ['The chapter demonstrates how to extract all links on a website using the links attribute in the HTML object, making it easy to access and print the set of links, providing a useful feature for crawlers or site analysis.', 'It showcases the capability of Request HTML to grab dynamically generated text by JavaScript, enabling the extraction of data not readily accessible by other libraries such as Beautiful Soup, highlighting the unique functionality of Request HTML.', 'The utility of web scraping for data collection and report compilation is emphasized, showcasing the practical application of the techniques and tools demonstrated in the video, underlining its significance for various purposes.']}], 'duration': 895.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2080672.jpg', 'highlights': ['Demonstrates the process of looping through each article on the website to extract headline, summary, and YouTube link. Extracted headline, summary, and YouTube link for each article.', 'Explaining try-accept blocks to handle video parsing errors and demonstrating practical implementation with code examples. Handling errors when parsing videos, implementing try-accept blocks, and providing a detailed example of the process.', 'The chapter demonstrates how to extract all links on a website using the links attribute in the HTML object, making it easy to access and print the set of links, providing a useful feature for crawlers or site analysis.']}, {'end': 3373.813, 'segs': [{'end': 3100.836, 'src': 'embed', 'start': 3075.431, 'weight': 1, 'content': [{'end': 3080.974, 'text': "Now, since this is running synchronously, it's going to go make the first request and wait for a response,", 'start': 3075.431, 'duration': 5.543}, {'end': 3086.177, 'text': 'then make the second request and wait for a response, and then make the third request and wait for the response.', 'start': 3080.974, 'duration': 5.203}, {'end': 3094.187, 'text': 'So we can imagine that this should probably take a little over six seconds since one plus two plus three is equal to six.', 'start': 3086.537, 'duration': 7.65}, {'end': 3096.11, 'text': 'And those are all of our delays.', 'start': 3094.528, 'duration': 1.582}, {'end': 3100.836, 'text': 'So if I run this, then we can see that we got the first website back.', 'start': 3096.49, 'duration': 4.346}], 'summary': 'Synchronous requests take over six seconds to complete, with delays from three requests.', 'duration': 25.405, 'max_score': 3075.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g3075431.jpg'}, {'end': 3262.84, 'src': 'embed', 'start': 3234.975, 'weight': 0, 'content': [{'end': 3238.216, 'text': "So that's about half of the time that it took to run synchronously.", 'start': 3234.975, 'duration': 3.241}, {'end': 3244.342, 'text': 'So if you have a lot of websites to parse, then doing it asynchronously could save you a ton of time.', 'start': 3238.676, 'duration': 5.666}, {'end': 3253.411, 'text': 'So if you can imagine you had to crawl 10 different APIs that took three seconds each to compute a response, then if you did that synchronously,', 'start': 3244.942, 'duration': 8.469}, {'end': 3255.232, 'text': 'then that could take over 30 seconds.', 'start': 3253.411, 'duration': 1.821}, {'end': 3258.515, 'text': 'But if you did it asynchronously, then it would take around three seconds.', 'start': 3255.553, 'duration': 2.962}, {'end': 3262.84, 'text': "So it's definitely something to think about if you're doing something like that.", 'start': 3258.716, 'duration': 4.124}], 'summary': 'Asynchronous parsing can save time, e.g., 10 apis in 3 secs.', 'duration': 27.865, 'max_score': 3234.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g3234975.jpg'}, {'end': 3295.882, 'src': 'embed', 'start': 3271.285, 'weight': 2, 'content': [{'end': 3279.231, 'text': 'Now, one thing I do want to mention is that if you want data from a large website like Twitter or Facebook or YouTube, or something like that,', 'start': 3271.285, 'duration': 7.946}, {'end': 3282.773, 'text': 'then it may be beneficial for you to see whether or not they have a public API.', 'start': 3279.231, 'duration': 3.542}, {'end': 3288.557, 'text': 'So public APIs allow those sites to serve up data to you in a more efficient way.', 'start': 3283.413, 'duration': 5.144}, {'end': 3292.94, 'text': "And sometimes they don't appreciate it if you try to scrape their data manually.", 'start': 3289.017, 'duration': 3.923}, {'end': 3295.882, 'text': "They'd rather you use an API instead.", 'start': 3293.42, 'duration': 2.462}], 'summary': 'Use public apis for efficient access to data from large websites like twitter, facebook, and youtube.', 'duration': 24.597, 'max_score': 3271.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g3271285.jpg'}, {'end': 3345.28, 'src': 'embed', 'start': 3312.937, 'weight': 3, 'content': [{'end': 3316.501, 'text': "So be aware that you might be bogging down someone's server if you aren't careful.", 'start': 3312.937, 'duration': 3.564}, {'end': 3318.322, 'text': 'So try to keep that in mind.', 'start': 3317.001, 'duration': 1.321}, {'end': 3324.747, 'text': 'So you know, after this tutorial, Try not to go out and hammer my website with a ton of different requests through your program.', 'start': 3318.903, 'duration': 5.844}, {'end': 3327.208, 'text': 'And that goes for other websites as well.', 'start': 3325.607, 'duration': 1.601}, {'end': 3334.253, 'text': "And some websites will even monitor if they're getting hit quickly and can block your program or IP address if you're hitting them too fast.", 'start': 3327.609, 'duration': 6.644}, {'end': 3337.495, 'text': 'Some websites will actually try to block bots completely.', 'start': 3335.093, 'duration': 2.402}, {'end': 3345.28, 'text': "But that's another good thing about request HTML is that it spoofs a user agent for us to make it seem like a real web browser.", 'start': 3338.215, 'duration': 7.065}], 'summary': 'Caution against overwhelming servers with requests; websites may block bots and ips.', 'duration': 32.343, 'max_score': 3312.937, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g3312937.jpg'}], 'start': 2976.535, 'title': 'Request html and asynchronous requests', 'summary': 'Explores the capabilities of the request html library, showcasing synchronous and asynchronous requests. the synchronous version takes around 6.6 seconds to complete due to sequential processing of delayed responses. additionally, the chapter discusses the benefits of asynchronous web scraping, demonstrating a reduction in execution time from over 30 seconds to around 3 seconds for 10 different apis with 3 seconds each. it also emphasizes the importance of checking for public apis for larger websites while being considerate to avoid overloading servers.', 'chapters': [{'end': 3121.512, 'start': 2976.535, 'title': 'Request html and asynchronous requests', 'summary': 'Explores the capabilities of the request html library, showcasing synchronous and asynchronous requests, with the synchronous version taking around 6.6 seconds to complete due to sequential processing of delayed responses.', 'duration': 144.977, 'highlights': ['The synchronous version of making requests to specific URLs took around 6.6 seconds to complete due to sequential processing of delayed responses. The synchronous requests to URLs with delayed responses of 1, 2, and 3 seconds took a little over 6 seconds to complete.', 'The chapter explores the capabilities of the request HTML library and its ability to perform asynchronous requests, allowing for continued script execution while waiting for responses. The request HTML library allows for asynchronous requests, enabling script execution to continue while waiting for responses, as opposed to synchronous requests that require waiting for each response before proceeding.']}, {'end': 3373.813, 'start': 3122.253, 'title': 'Asynchronous web scraping benefits', 'summary': 'Discusses the benefits of asynchronous web scraping, showing how it can save time by processing requests from multiple websites concurrently, reducing execution time from over 30 seconds to around 3 seconds for 10 different apis with 3 seconds each, and highlights the importance of checking for public apis for larger websites while being considerate to avoid overloading servers.', 'duration': 251.56, 'highlights': ['Asynchronous web scraping reduces execution time from over 30 seconds to around 3 seconds for 10 different APIs with 3 seconds each By processing requests concurrently and not waiting for each response, the asynchronous web scraping approach significantly reduces the execution time, demonstrating a substantial time-saving benefit.', 'Importance of checking for public APIs for larger websites The chapter emphasizes the significance of exploring public APIs for larger websites like Twitter, Facebook, or YouTube as they can serve data more efficiently, encouraging users to use APIs instead of manual scraping.', 'Consideration and caution when scraping websites to avoid overloading servers The chapter advises being considerate and cautious while scraping websites, highlighting the potential of overloading servers and the adverse consequences such as IP address blocking, and recommends being mindful of not overloading websites with a high volume of requests.']}], 'duration': 397.278, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/a6fIbtFB46g/pics/a6fIbtFB46g2976535.jpg', 'highlights': ['Asynchronous web scraping reduces execution time from over 30 seconds to around 3 seconds for 10 different APIs with 3 seconds each', 'The synchronous version of making requests to specific URLs took around 6.6 seconds to complete due to sequential processing of delayed responses', 'The chapter emphasizes the significance of exploring public APIs for larger websites like Twitter, Facebook, or YouTube as they can serve data more efficiently', 'The chapter advises being considerate and cautious while scraping websites, highlighting the potential of overloading servers and the adverse consequences such as IP address blocking']}], 'highlights': ['Asynchronous web scraping reduces execution time from over 30 seconds to around 3 seconds for 10 different APIs with 3 seconds each', 'The RequestHTML library is used to scrape websites, showcasing its ability to pull specific information from a webpage.', 'The RequestHTML library can parse dynamic data generated by JavaScript, make asynchronous requests, and create CSV files.', 'The chapter emphasizes the significance of exploring public APIs for larger websites like Twitter, Facebook, or YouTube as they can serve data more efficiently', 'The chapter advises being considerate and cautious while scraping websites, highlighting the potential of overloading servers and the adverse consequences such as IP address blocking', 'The chapter demonstrates how to extract all links on a website using the links attribute in the HTML object, making it easy to access and print the set of links, providing a useful feature for crawlers or site analysis.', 'The synchronous version of making requests to specific URLs took around 6.6 seconds to complete due to sequential processing of delayed responses', 'Explaining try-accept blocks to handle video parsing errors and demonstrating practical implementation with code examples. Handling errors when parsing videos, implementing try-accept blocks, and providing a detailed example of the process.', 'Covers parsing and extracting video links from html iframes, showcasing the complexity of information parsing and demonstrating practical python coding techniques.', 'The process involves extracting a YouTube video ID by accessing specific indexes after splitting the URL, followed by creating a YouTube link using the extracted video ID.', 'The chapter showcases accessing HTML attributes using Python code, obtaining a dictionary of attributes for an iframe element, and extracting the source key to retrieve a YouTube link.', 'The chapter demonstrates the process of identifying the source link to the embedded video within an HTML iframe, showcasing the initial step in extracting the video link.', 'Demonstrates the process of looping through each article on the website to extract headline, summary, and YouTube link. Extracted headline, summary, and YouTube link for each article.', "The chapter discusses the process of extracting headline and summary information from a website's HTML structure.", 'Emphasizes the use of CSS syntax to locate specific elements and retrieve their text content.', 'Demonstrates accessing specific elements such as headline and summary using CSS class and tag selectors.', "Accessing the 'r.html' attribute in Python is demonstrated as a means to work with the HTML object on a website.", "The 'r.html' attribute in Python provides access to the HTML object for a website, allowing interaction and element search.", "Utilizing the request library to obtain source code from a personal website The HTML session and 'session.get' method are used to fetch the source code from the author's personal website, showcasing a real-world application of web scraping.", 'Reusing code to access headlines and summaries from articles The code is modified to retrieve headlines and summaries from all articles, instead of just the first one, improving efficiency and scalability.', 'Explained the necessity of avoiding sifting through all the HTML code when scraping larger sites due to potential complexity.', 'printing out the matched elements It demonstrates the process of printing out the matched elements, showcasing how to access and display the extracted information from the webpage.', 'using find methods and CSS selectors The tutorial illustrates the usage of find methods and CSS selectors to locate specific elements within the webpage, providing a practical example of these techniques.', 'grabbing headlines and summaries within a div with a class of article The tutorial demonstrates the process of grabbing headlines and summaries within a div with a class of article, showcasing practical implementation.', 'Demonstrated the method of using the inspect feature in Chrome to locate specific elements for web scraping by right-clicking on the desired element and selecting inspect.', 'The chapter demonstrates how to use Python to parse information from a website.', 'The chapter explains how to use Python to scrape and extract specific elements from a webpage.', 'The find method uses CSS selectors to locate elements by ID, demonstrated by using the pound sign to grab a div with the ID of footer.', 'Illustrates accessing attributes and methods of HTML elements, showcasing the retrieval of HTML and text of a specific element.', 'Explains the usage of the find method to locate specific elements within the HTML, providing an example of finding the title of the HTML page.', 'Demonstrates accessing HTML and text content from the parsed HTML file, showcasing the extraction of text without tags.', 'The process of parsing HTML for article headlines and summaries is explained.', 'An example of extracting post titles, summaries, and video links from a personal website is demonstrated.']}