title
Dynamic Javascript Scraping - Web scraping with Beautiful Soup 4 p.4

description
Welcome to part 4 of the web scraping with Beautiful Soup 4 tutorial mini-series. Here, we're going to discuss how to parse dynamically updated data via javascript. Many websites will supply data that is dynamically loaded via javascript. In Python, you can make use of jinja templating and do this without javascript, but many websites use javascript to populate data. To simulate this, I have some javascript added to the sample page: https://pythonprogramming.net/parsememcparseface/ https://pythonprogramming.net https://twitter.com/sentdex https://www.facebook.com/pythonprogramming.net/ https://plus.google.com/+sentdex

detail
{'title': 'Dynamic Javascript Scraping - Web scraping with Beautiful Soup 4 p.4', 'heatmap': [{'end': 710.382, 'start': 687.225, 'weight': 0.997}], 'summary': 'Tutorial series covers web scraping and parsing with beautifulsoup4, using pyqt4 for windows installation, and emphasizes efficiency in handling asynchronous loading of web pages and processing javascript through multiprocessing and threading.', 'chapters': [{'end': 225.853, 'segs': [{'end': 50.255, 'src': 'embed', 'start': 2.082, 'weight': 0, 'content': [{'end': 6.789, 'text': 'What is going on everybody? Welcome to part four of our web scraping with BeautifulSoup4 tutorial series.', 'start': 2.082, 'duration': 4.707}, {'end': 14.62, 'text': "In this tutorial, what we're gonna be talking about is how to scrape dynamically updated information from a webpage.", 'start': 7.209, 'duration': 7.411}, {'end': 21.414, 'text': 'So to begin, I have added some information to the ParseMe McParseface page.', 'start': 15.689, 'duration': 5.725}, {'end': 28.881, 'text': 'Underneath this picture, you can see this JavaScript dynamic data test, and it just says, look at you shining.', 'start': 21.734, 'duration': 7.147}, {'end': 33.685, 'text': "It says that because we're viewing it with a client in a browser.", 'start': 29.301, 'duration': 4.384}, {'end': 37.907, 'text': 'And the browser is actually doing something that makes that show up.', 'start': 34.125, 'duration': 3.782}, {'end': 39.508, 'text': "Let's look further.", 'start': 38.188, 'duration': 1.32}, {'end': 47.493, 'text': "So viewing the source code, zooming in, scrolling down, here is what we're looking for.", 'start': 40.109, 'duration': 7.384}, {'end': 50.255, 'text': 'So this is what I was just showing you.', 'start': 47.513, 'duration': 2.742}], 'summary': 'Tutorial on scraping dynamic data from a webpage using beautifulsoup4.', 'duration': 48.173, 'max_score': 2.082, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU2082.jpg'}, {'end': 208.184, 'src': 'embed', 'start': 156.613, 'weight': 2, 'content': [{'end': 163.201, 'text': "Like, what the hell? The problem is you aren't a client, you're not a browser.", 'start': 156.613, 'duration': 6.588}, {'end': 170.564, 'text': 'So what we have to do is mimic being a client or a browser, and actually run that JavaScript,', 'start': 163.861, 'duration': 6.703}, {'end': 172.865, 'text': 'which is actually a little bit more involved than you might think.', 'start': 170.564, 'duration': 2.301}, {'end': 175.646, 'text': 'Or maybe you are thinking that and realizing, oh no.', 'start': 173.826, 'duration': 1.82}, {'end': 181.429, 'text': "So there's a whole lot of options at our disposal for how we're gonna do this.", 'start': 177.387, 'duration': 4.042}, {'end': 188.412, 'text': "I think the easiest way to do this is to use PyQt Specifically, we'll be using Qt4.", 'start': 182.129, 'duration': 6.283}, {'end': 191.294, 'text': "I'm sure you can do it in Qt5, but I'm just going to use Qt4.", 'start': 188.432, 'duration': 2.862}, {'end': 194.917, 'text': 'So, I do have a tutorial on PyQt4.', 'start': 192.375, 'duration': 2.542}, {'end': 201.661, 'text': "You don't need to follow this entire tutorial, but you should go to the first step, the first page, and you'll need to get Qt4.", 'start': 194.957, 'duration': 6.704}, {'end': 208.184, 'text': "if you're on windows, go to this url here and download the wheel for piqt4.", 'start': 203.562, 'duration': 4.622}], 'summary': 'To run javascript, mimic a client or browser using pyqt4, which is more involved than expected.', 'duration': 51.571, 'max_score': 156.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU156613.jpg'}], 'start': 2.082, 'title': 'Web scraping and parsing with pyqt4', 'summary': 'Covers web scraping with beautifulsoup4 for dynamic data, parsing web page tags, and using pyqt4 for windows installation, providing insights on browser interaction, unexpected results, and installation guidance.', 'chapters': [{'end': 122.243, 'start': 2.082, 'title': 'Web scraping with beautifulsoup4: dynamic data', 'summary': 'Discusses how to scrape dynamically updated information from a webpage using beautifulsoup4, with a focus on extracting javascript dynamic data and understanding the process of browser interaction and server communication.', 'duration': 120.161, 'highlights': ['The tutorial series focuses on part four of web scraping with BeautifulSoup4, emphasizing the extraction of dynamically updated information from a webpage.', "The JavaScript dynamic data test on the ParseMe McParseface page displays the text 'look at you shining' due to browser interaction.", "The script in the webpage finds the element by an id and modifies the HTML content to display the text 'look at you, shining' when the page is browsed.", "The initial information retrieved from the server was 'no, no, no,' but the script executed by the browser updated the content to display 'look at you, shining.'"]}, {'end': 181.429, 'start': 122.243, 'title': 'Parsing web page tags', 'summary': 'Discusses parsing web page tags using code, encountering unexpected results, and the need to mimic a client or browser to run javascript.', 'duration': 59.186, 'highlights': ['The need to mimic being a client or a browser to run JavaScript is emphasized, revealing the complexity of the task involved.', 'Encountering unexpected results when parsing web page tags is highlighted, indicating potential challenges in data extraction.', 'Discussing the process of parsing web page tags using code and the implications of encountering unexpected results.']}, {'end': 225.853, 'start': 182.129, 'title': 'Using pyqt4 for windows installation', 'summary': 'Discusses using pyqt4 for windows installation, emphasizing the need to download and install qt4 and offering assistance to users encountering issues, with a mention of a tutorial on pyqt4 and guidance on obtaining the required software.', 'duration': 43.724, 'highlights': ['Emphasizes the importance of obtaining Qt4 for Windows installation.', 'Mentions the availability of a tutorial on PyQt4.', 'Offers assistance to users encountering issues with the installation.']}], 'duration': 223.771, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU2082.jpg', 'highlights': ['The tutorial series focuses on part four of web scraping with BeautifulSoup4, emphasizing the extraction of dynamically updated information from a webpage.', "The JavaScript dynamic data test on the ParseMe McParseface page displays the text 'look at you shining' due to browser interaction.", 'The need to mimic being a client or a browser to run JavaScript is emphasized, revealing the complexity of the task involved.', 'Emphasizes the importance of obtaining Qt4 for Windows installation.']}, {'end': 732.729, 'segs': [{'end': 327.681, 'src': 'embed', 'start': 274.044, 'weight': 0, 'content': [{'end': 278.706, 'text': "QApplication is probably easy enough, but it's the thing for making applications.", 'start': 274.044, 'duration': 4.662}, {'end': 282.609, 'text': "QURL, this is how we're actually going to read the URL, basically.", 'start': 278.766, 'duration': 3.843}, {'end': 294.156, 'text': "And then finally, QTWebKit, actually that's a capital K, WebKit, import QWebPage.", 'start': 283.069, 'duration': 11.087}, {'end': 302.941, 'text': 'Lovely So this is going to let us actually load the page and act like a browser, act like a client.', 'start': 295.316, 'duration': 7.625}, {'end': 305.244, 'text': 'and run that JavaScript.', 'start': 303.362, 'duration': 1.882}, {'end': 310.548, 'text': "So it's saving us a ton of programming that would be involved there.", 'start': 305.384, 'duration': 5.164}, {'end': 313.171, 'text': 'And in theory, you could actually make the page show up even.', 'start': 310.588, 'duration': 2.583}, {'end': 318.516, 'text': 'You can make your own web browsers in Qt 4 as you might be able to surmise at this point.', 'start': 313.531, 'duration': 4.985}, {'end': 321.297, 'text': 'Anyway, cool.', 'start': 319.156, 'duration': 2.141}, {'end': 327.681, 'text': "So we have all the imports we need, and now what we need to do is we're going to write a client class.", 'start': 321.577, 'duration': 6.104}], 'summary': 'Qt allows making applications, reading urls, acting like a browser, and running javascript, saving programming effort.', 'duration': 53.637, 'max_score': 274.044, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU274044.jpg'}, {'end': 577.735, 'src': 'embed', 'start': 544.136, 'weight': 3, 'content': [{'end': 550.682, 'text': "we're at least letting qt load the page and then now we're going to grab the source code from q web page.", 'start': 544.136, 'duration': 6.546}, {'end': 563.564, 'text': 'basically. So the source is going to be the clientResponse, because this clientResponse is a client object inherited from the QWebPage.', 'start': 550.682, 'duration': 12.882}, {'end': 566.347, 'text': 'So we can use the QWebPage methods now.', 'start': 564.005, 'duration': 2.342}, {'end': 577.735, 'text': 'So clientResponse.mainframe, the mainframe, dot, shoot, is it, I think, yeah, 2, capital HTML.', 'start': 566.767, 'duration': 10.968}], 'summary': 'Using qt to load and access webpage source code.', 'duration': 33.599, 'max_score': 544.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU544136.jpg'}, {'end': 713.785, 'src': 'heatmap', 'start': 687.225, 'weight': 0.997, 'content': [{'end': 693.811, 'text': "And then the app will exit because we don't actually care that much.", 'start': 687.225, 'duration': 6.586}, {'end': 698.234, 'text': "OK Let's try it one more time.", 'start': 696.653, 'duration': 1.581}, {'end': 703.579, 'text': 'Boom We got what we wanted.', 'start': 701.577, 'duration': 2.002}, {'end': 710.382, 'text': 'All right, so that is how you can scrape dynamic data.', 'start': 704.178, 'duration': 6.204}, {'end': 713.785, 'text': 'Just for kicks, I want to see if I can actually show..', 'start': 710.783, 'duration': 3.002}], 'summary': 'Demonstrating dynamic data scraping successfully.', 'duration': 26.56, 'max_score': 687.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU687225.jpg'}], 'start': 225.853, 'title': 'Setting up qt4 and scraping dynamic data', 'summary': 'Covers importing necessary modules to set up qt4 for web browsing and creating a web browser, as well as scraping dynamic data using pyqt4 to handle asynchronous loading of web pages and extract html source code.', 'chapters': [{'end': 345.73, 'start': 225.853, 'title': 'Importing and setting up qt4 for web browsing', 'summary': 'Covers the process of importing necessary modules like sys, qapplication, qurl, and qwebpage in order to set up qt4 for web browsing and client-side interactions, enabling the creation of a web browser in qt 4.', 'duration': 119.877, 'highlights': ['Importing necessary modules like sys, QApplication, QUrl, and QWebPage is essential for setting up Qt4 for web browsing and client-side interactions, saving significant programming effort and enabling the creation of a web browser in Qt 4.', 'Understanding the purpose of each imported module, such as QApplication for making applications and QUrl for reading URLs, is crucial for effective utilization of Qt4 in web browsing.', 'The process of writing a client class, which is essential for setting up Qt4 for web browsing, is discussed, serving as an introduction to object-oriented programming for those unfamiliar with the concept.']}, {'end': 732.729, 'start': 345.73, 'title': 'Scraping dynamic data with qt', 'summary': 'Discusses how to scrape dynamic data using pyqt4 to handle asynchronous loading of web pages, connecting methods for page load handling, and using qwebpage methods to extract html source code.', 'duration': 386.999, 'highlights': ['The chapter explains the process of using PyQt4 to handle asynchronous loading of web pages and scrape dynamic data, showcasing the method of connecting page load handling methods and utilizing QWebPage methods to extract HTML source code.', 'The process involves defining a client class that inherits from the QWebPage, initializing the client with a URL, connecting the onPageLoad method to be executed when the page load is finished, and using QWebPage methods to extract the source code from the loaded page.', 'The detailed process includes initializing the client class with a URL, connecting the onPageLoad method to be executed upon page load completion, and using QWebPage methods to extract the source code from the loaded page, demonstrating the handling of asynchronous loading and extraction of dynamic data.']}], 'duration': 506.876, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU225853.jpg', 'highlights': ['Importing necessary modules like sys, QApplication, QUrl, and QWebPage is essential for setting up Qt4 for web browsing and client-side interactions, saving significant programming effort and enabling the creation of a web browser in Qt 4.', 'The chapter explains the process of using PyQt4 to handle asynchronous loading of web pages and scrape dynamic data, showcasing the method of connecting page load handling methods and utilizing QWebPage methods to extract HTML source code.', 'Understanding the purpose of each imported module, such as QApplication for making applications and QUrl for reading URLs, is crucial for effective utilization of Qt4 in web browsing.', 'The process involves defining a client class that inherits from the QWebPage, initializing the client with a URL, connecting the onPageLoad method to be executed when the page load is finished, and using QWebPage methods to extract the source code from the loaded page.', 'The detailed process includes initializing the client class with a URL, connecting the onPageLoad method to be executed upon page load completion, and using QWebPage methods to extract the source code from the loaded page, demonstrating the handling of asynchronous loading and extraction of dynamic data.', 'The process of writing a client class, which is essential for setting up Qt4 for web browsing, is discussed, serving as an introduction to object-oriented programming for those unfamiliar with the concept.']}, {'end': 905.165, 'segs': [{'end': 843.945, 'src': 'embed', 'start': 780.658, 'weight': 1, 'content': [{'end': 785.76, 'text': "set all that stuff up, ok, and then we're also going to process whatever javascript is there.", 'start': 780.658, 'duration': 5.102}, {'end': 787.581, 'text': 'we have to process that javascript.', 'start': 785.76, 'duration': 1.821}, {'end': 788.681, 'text': "that's going to take time.", 'start': 787.581, 'duration': 1.1}, {'end': 790.642, 'text': "Okay. so that's one thing.", 'start': 789.261, 'duration': 1.381}, {'end': 796.446, 'text': "Also, when you're parsing websites, the other thing is just latency and response time of the server.", 'start': 791.042, 'duration': 5.404}, {'end': 805.172, 'text': 'So a lot of people have asked me, as I was releasing this series what do we do about the fact that Beautiful Soup is slow?', 'start': 796.927, 'duration': 8.245}, {'end': 807.594, 'text': 'Beautiful Soup is not really that slow.', 'start': 806.153, 'duration': 1.441}, {'end': 811.137, 'text': "I mean, it's a fairly efficient framework.", 'start': 807.674, 'duration': 3.463}, {'end': 818.615, 'text': 'The problem is a server request and response time is probably going to be like 500 milliseconds or more.', 'start': 811.477, 'duration': 7.138}, {'end': 820.717, 'text': "Okay And so it's not instant.", 'start': 818.995, 'duration': 1.722}, {'end': 825.161, 'text': "So if you're trying to crawl 500 URLs, that 500 milliseconds is suddenly 250 seconds for all 500 URLs.", 'start': 820.757, 'duration': 4.404}, {'end': 829.977, 'text': "that's a really long time to wait.", 'start': 828.356, 'duration': 1.621}, {'end': 835.4, 'text': "so the two things you need to think about is if, say, you're, you're mimicking a client,", 'start': 829.977, 'duration': 5.423}, {'end': 843.945, 'text': "for that you're going to need to utilize multi processing and, just like I don't have yet a tutorial on object-oriented programming,", 'start': 835.4, 'duration': 8.545}], 'summary': 'Processing javascript and server latency impact web scraping efficiency.', 'duration': 63.287, 'max_score': 780.658, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU780658.jpg'}, {'end': 888.779, 'src': 'embed', 'start': 861.886, 'weight': 0, 'content': [{'end': 867.327, 'text': "Now, if it's a latency issue and you're just simply waiting for that response, that means your CPU is actually idle.", 'start': 861.886, 'duration': 5.441}, {'end': 868.547, 'text': "You've got idle threads.", 'start': 867.387, 'duration': 1.16}, {'end': 874.749, 'text': 'Thus, you can use the threading module and that will speed up the whole process.', 'start': 869.028, 'duration': 5.721}, {'end': 878.696, 'text': "And in reality, you're going to probably need to use both.", 'start': 875.655, 'duration': 3.041}, {'end': 884.758, 'text': "You're going to want to make full use because you're going to be waiting on other people's processing sometimes.", 'start': 879.896, 'duration': 4.862}, {'end': 888.779, 'text': "And then many times it's just going to be a bottleneck of your own processing.", 'start': 885.078, 'duration': 3.701}], 'summary': 'Using the threading module can speed up processing by making full use of cpu and overcoming latency issues.', 'duration': 26.893, 'max_score': 861.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU861886.jpg'}], 'start': 732.729, 'title': 'Web scraping efficiency', 'summary': 'Discusses the efficiency of web scraping with pyqt 4, emphasizing time-consuming processes of loading web content and processing javascript and the use of multiprocessing and threading to address latency and response time issues.', 'chapters': [{'end': 905.165, 'start': 732.729, 'title': 'Web scraping efficiency', 'summary': 'Discusses the efficiency of web scraping with pyqt 4, highlighting the time-consuming processes of loading web content and processing javascript, as well as the use of multiprocessing and threading to address latency and response time issues.', 'duration': 172.436, 'highlights': ['Loading web content and processing JavaScript are time-consuming processes in web scraping, leading to latency and response time issues.', 'The server response time for crawling 500 URLs could result in a wait of 250 seconds or more, emphasizing the need for efficient processing techniques.', 'Utilizing multi-processing and threading can address latency issues and optimize processing efficiency in web scraping.', 'The efficiency of Beautiful Soup framework is affected by server response time, requiring the use of multi-processing and threading for improved performance.']}], 'duration': 172.436, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FSH77vnOGqU/pics/FSH77vnOGqU732729.jpg', 'highlights': ['Utilizing multi-processing and threading can address latency issues and optimize processing efficiency in web scraping.', 'The efficiency of Beautiful Soup framework is affected by server response time, requiring the use of multi-processing and threading for improved performance.', 'Loading web content and processing JavaScript are time-consuming processes in web scraping, leading to latency and response time issues.', 'The server response time for crawling 500 URLs could result in a wait of 250 seconds or more, emphasizing the need for efficient processing techniques.']}], 'highlights': ['The tutorial series focuses on part four of web scraping with BeautifulSoup4, emphasizing the extraction of dynamically updated information from a webpage.', "The JavaScript dynamic data test on the ParseMe McParseface page displays the text 'look at you shining' due to browser interaction.", 'The need to mimic being a client or a browser to run JavaScript is emphasized, revealing the complexity of the task involved.', 'Emphasizes the importance of obtaining Qt4 for Windows installation.', 'Importing necessary modules like sys, QApplication, QUrl, and QWebPage is essential for setting up Qt4 for web browsing and client-side interactions, saving significant programming effort and enabling the creation of a web browser in Qt 4.', 'The chapter explains the process of using PyQt4 to handle asynchronous loading of web pages and scrape dynamic data, showcasing the method of connecting page load handling methods and utilizing QWebPage methods to extract HTML source code.', 'Understanding the purpose of each imported module, such as QApplication for making applications and QUrl for reading URLs, is crucial for effective utilization of Qt4 in web browsing.', 'The process involves defining a client class that inherits from the QWebPage, initializing the client with a URL, connecting the onPageLoad method to be executed when the page load is finished, and using QWebPage methods to extract the source code from the loaded page.', 'The detailed process includes initializing the client class with a URL, connecting the onPageLoad method to be executed upon page load completion, and using QWebPage methods to extract the source code from the loaded page, demonstrating the handling of asynchronous loading and extraction of dynamic data.', 'The process of writing a client class, which is essential for setting up Qt4 for web browsing, is discussed, serving as an introduction to object-oriented programming for those unfamiliar with the concept.', 'Utilizing multi-processing and threading can address latency issues and optimize processing efficiency in web scraping.', 'The efficiency of Beautiful Soup framework is affected by server response time, requiring the use of multi-processing and threading for improved performance.', 'Loading web content and processing JavaScript are time-consuming processes in web scraping, leading to latency and response time issues.', 'The server response time for crawling 500 URLs could result in a wait of 250 seconds or more, emphasizing the need for efficient processing techniques.']}