title
Scraping Data from a Real Website | Web Scraping in Python

description
Take my Full Python Course Here: https://bit.ly/48O581R In this Web Scraping tutorial we are going to be scraping data from a real website! GitHub Code: https://bit.ly/442kIVi ____________________________________________ SUBSCRIBE! Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content! ____________________________________________ RESOURCES: Coursera Courses: 📖Google Data Analyst Certification: https://coursera.pxf.io/5bBd62 📖Data Analysis with Python - https://coursera.pxf.io/BXY3Wy 📖IBM Data Analysis Specialization - https://coursera.pxf.io/AoYOdR 📖Tableau Data Visualization - https://coursera.pxf.io/MXYqaN Udemy Courses: 📖Python for Data Science - https://bit.ly/3Z4A5K6 📖Statistics for Data Science - https://bit.ly/37jqDbq 📖SQL for Data Analysts (SSMS) - https://bit.ly/3fkqEij 📖Tableau A-Z - http://bit.ly/385lYvN *Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!* ____________________________________________ BECOME A MEMBER - Want to support the channel? Consider becoming a member! I do Monthly Livestreams and you get some awesome Emoji's to use in chat and comments! https://www.youtube.com/channel/UC7cs8q-gJRlGwj4A8OmCmXg/join ____________________________________________ Websites: 💻Website: AlexTheAnalyst.com 💾GitHub: https://github.com/AlexTheAnalyst 📱Instagram: @Alex_The_Analyst ____________________________________________ *All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*

detail
{'title': 'Scraping Data from a Real Website | Web Scraping in Python', 'heatmap': [{'end': 231.229, 'start': 196.572, 'weight': 0.811}, {'end': 642.538, 'start': 548.177, 'weight': 0.792}, {'end': 762.927, 'start': 730.115, 'weight': 1}, {'end': 1312.87, 'start': 1279.146, 'weight': 0.822}], 'summary': 'Demonstrates web data scraping for a pandas dataframe from a real website, including handling complex tables, using beautifulsoup for table data extraction and data cleaning, and extracting and processing table data using python, resulting in csv export and addressing user errors.', 'chapters': [{'end': 224.945, 'segs': [{'end': 92.14, 'src': 'embed', 'start': 30.163, 'weight': 0, 'content': [{'end': 34.847, 'text': "We're going to be going on to Wikipedia and looking at the list of the largest companies in the United States by revenue.", 'start': 30.163, 'duration': 4.684}, {'end': 37.81, 'text': "And we're going to be pulling all of this information.", 'start': 35.407, 'duration': 2.403}, {'end': 45.577, 'text': "So if you thought this was going to be easy and a little mini project, it's now a full project because why not? So let's get started.", 'start': 37.85, 'duration': 7.727}, {'end': 49.121, 'text': "What we're gonna do is we're gonna import Beautiful Soup and requests.", 'start': 46.098, 'duration': 3.023}, {'end': 53.527, 'text': "We're gonna get this information and we're gonna see how we can do this.", 'start': 49.562, 'duration': 3.965}, {'end': 56.971, 'text': "And it's gonna get a little bit more complicated, a little bit more tricky.", 'start': 53.987, 'duration': 2.984}, {'end': 63.739, 'text': "We're gonna have to format things properly to get it into our Pandas DataFrame to make it looking good and making it more usable.", 'start': 56.991, 'duration': 6.748}, {'end': 66.141, 'text': "So let's go ahead and get rid of this easy table.", 'start': 64.119, 'duration': 2.022}, {'end': 67.001, 'text': "We don't want that one.", 'start': 66.181, 'duration': 0.82}, {'end': 69.963, 'text': "And we're going to come in here and we're just going to start off.", 'start': 67.361, 'duration': 2.602}, {'end': 72.205, 'text': 'This should look really familiar by now.', 'start': 70.003, 'duration': 2.202}, {'end': 79.11, 'text': "We're going to say from BS4 import beautiful soup.", 'start': 72.725, 'duration': 6.385}, {'end': 83.333, 'text': "I don't know if you've noticed, but I've messed up spelling beautiful soup in every single video.", 'start': 79.13, 'duration': 4.203}, {'end': 84.714, 'text': "I've noticed.", 'start': 84.194, 'duration': 0.52}, {'end': 85.915, 'text': "Let's run this.", 'start': 85.295, 'duration': 0.62}, {'end': 89.318, 'text': 'And now we need to go ahead and get our URL.', 'start': 87.016, 'duration': 2.302}, {'end': 90.278, 'text': "So let's come up here.", 'start': 89.478, 'duration': 0.8}, {'end': 92.14, 'text': "Let's get our URL.", 'start': 90.298, 'duration': 1.842}], 'summary': 'Scraping wikipedia for the largest us companies and formatting into pandas dataframe.', 'duration': 61.977, 'max_score': 30.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH030163.jpg'}, {'end': 167.544, 'src': 'embed', 'start': 138.49, 'weight': 1, 'content': [{'end': 141.443, 'text': 'Am I right? So we got a lot of things going for us.', 'start': 138.49, 'duration': 2.953}, {'end': 144.626, 'text': 'The stuff was imported properly.', 'start': 142.124, 'duration': 2.502}, {'end': 146.047, 'text': 'We got our URL.', 'start': 145.186, 'duration': 0.861}, {'end': 152.332, 'text': "We got our soup, which is not beautiful in my opinion, but let's keep on rolling.", 'start': 146.468, 'duration': 5.864}, {'end': 153.113, 'text': "Let's come right down here.", 'start': 152.352, 'duration': 0.761}, {'end': 157.196, 'text': "Now, what we need to do is we need to specify what data we're looking for.", 'start': 153.633, 'duration': 3.563}, {'end': 160.038, 'text': "So let's come and let's inspect this webpage.", 'start': 157.676, 'duration': 2.362}, {'end': 164.142, 'text': "Now, the only information that we're gonna want is right in here.", 'start': 160.359, 'duration': 3.783}, {'end': 167.544, 'text': "We're gonna want these titles or these headers, whoops.", 'start': 164.182, 'duration': 3.362}], 'summary': 'Imported stuff, url and soup ready. need to specify data from webpage for titles/headers.', 'duration': 29.054, 'max_score': 138.49, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0138490.jpg'}], 'start': 0.109, 'title': 'Web data scraping', 'summary': 'Covers scraping data for a pandas dataframe from a real website, including switching to a more complex table from wikipedia, and web scraping basics using beautiful soup to extract specific information from a webpage.', 'chapters': [{'end': 67.001, 'start': 0.109, 'title': 'Scraping data for pandas dataframe', 'summary': 'Discusses scraping data from a real website to create a pandas dataframe, including the switch from a previously easy table to a more complex one from wikipedia, aiming to create a full project.', 'duration': 66.892, 'highlights': ['The chapter discusses scraping data from a real website to create a pandas dataframe. It involves scraping data from Wikipedia to obtain the list of the largest companies in the United States by revenue and then formatting it to be usable in a Pandas DataFrame.', "The switch from a previously easy table to a more complex one from Wikipedia, aiming to create a full project. The transition from working with an easy table to a more complex one from Wikipedia's list of the largest companies in the United States by revenue turns the project into a full project, challenging the learners.", 'The need to format the data properly for it to be usable in a Pandas DataFrame. The process involves formatting the obtained data from Wikipedia properly to ensure it can be effectively used in a Pandas DataFrame, indicating a more complex and challenging task.']}, {'end': 224.945, 'start': 67.361, 'title': 'Web scraping with beautiful soup', 'summary': 'Covers the basics of web scraping using beautiful soup to extract specific information from a webpage, such as obtaining the url, parsing the html, and identifying the desired data.', 'duration': 157.584, 'highlights': ['The process involves importing the Beautiful Soup library, obtaining the URL, making a request to access the information, and parsing the HTML to extract the desired data.', 'Identifying and inspecting the specific elements on a webpage, such as tables, to determine the required information for extraction.', 'Understanding the structure of the webpage and recognizing multiple tables present, which may impact the data extraction process.']}], 'duration': 224.836, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0109.jpg', 'highlights': ['The process involves importing the Beautiful Soup library, obtaining the URL, making a request to access the information, and parsing the HTML to extract the desired data.', 'Identifying and inspecting the specific elements on a webpage, such as tables, to determine the required information for extraction.', 'The need to format the data properly for it to be usable in a Pandas DataFrame. The process involves formatting the obtained data from Wikipedia properly to ensure it can be effectively used in a Pandas DataFrame, indicating a more complex and challenging task.', "The switch from a previously easy table to a more complex one from Wikipedia, aiming to create a full project. The transition from working with an easy table to a more complex one from Wikipedia's list of the largest companies in the United States by revenue turns the project into a full project, challenging the learners.", 'The chapter discusses scraping data from a real website to create a pandas dataframe. It involves scraping data from Wikipedia to obtain the list of the largest companies in the United States by revenue and then formatting it to be usable in a Pandas DataFrame.']}, {'end': 421.538, 'segs': [{'end': 277.148, 'src': 'embed', 'start': 253.382, 'weight': 1, 'content': [{'end': 263.106, 'text': "so it looks like there are two tables with the same class, which shouldn't be a problem if we're using find to get our text,", 'start': 253.382, 'duration': 9.724}, {'end': 266.948, 'text': 'because we should be taking the first one, which will be this table, and this is the table we want.', 'start': 263.106, 'duration': 3.842}, {'end': 277.148, 'text': "And if we wanted this one, we could just use find all and since it's a list, we could use indexing to pull this table right?", 'start': 268.275, 'duration': 8.873}], 'summary': 'Identified issue: two tables with the same class. solution: use find for first, find all for list.', 'duration': 23.766, 'max_score': 253.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0253382.jpg'}, {'end': 383.144, 'src': 'embed', 'start': 356.095, 'weight': 0, 'content': [{'end': 361.121, 'text': 'And then we have rank, name, industry, all the ones that we were hoping to see.', 'start': 356.095, 'duration': 5.026}, {'end': 370.832, 'text': "And I guarantee you, if we scroll all the way to the bottom, we're gonna see potentially Wells Fargo, Goldman Sachs.", 'start': 361.141, 'duration': 9.691}, {'end': 371.613, 'text': "I'm pretty sure those are..", 'start': 370.852, 'duration': 0.761}, {'end': 375.121, 'text': "Let's see.", 'start': 374.721, 'duration': 0.4}, {'end': 375.981, 'text': 'Yeah, here we go.', 'start': 375.201, 'duration': 0.78}, {'end': 378.442, 'text': 'Like Ford Motor, Wells Fargo, Goldman Sachs.', 'start': 376.001, 'duration': 2.441}, {'end': 379.603, 'text': "That's this table right here.", 'start': 378.462, 'duration': 1.141}, {'end': 383.144, 'text': "So now we're looking at the third table, but again, this is a list.", 'start': 379.923, 'duration': 3.221}], 'summary': 'The transcript discusses a list of companies including ford motor, wells fargo, and goldman sachs.', 'duration': 27.049, 'max_score': 356.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0356095.jpg'}], 'start': 225.766, 'title': 'Web scraping table data', 'summary': 'Discusses using beautifulsoup to identify and select tables, encountering issues with multiple tables of the same class, experimenting with find and find all methods, and demonstrates web scraping to extract specific table data, including the rank, name, and industry of companies, identifying wells fargo and goldman sachs as potential data points.', 'chapters': [{'end': 305.36, 'start': 225.766, 'title': 'Identifying and selecting tables with beautifulsoup', 'summary': 'Discusses using beautifulsoup to identify and select tables, encountering issues with multiple tables of the same class, and experimenting with find and find all methods to retrieve the desired table.', 'duration': 79.594, 'highlights': ['The chapter discusses using BeautifulSoup to identify and select tables, encountering issues with multiple tables of the same class, and experimenting with find and find all methods to retrieve the desired table.', 'The author encounters a situation where there are two tables with the same class, leading to ambiguity in table selection.', 'The author experiments with the find and find all methods in BeautifulSoup to retrieve the desired table, expressing uncertainty about the correctness of the result obtained.']}, {'end': 356.075, 'start': 305.36, 'title': 'Identifying and modifying a table', 'summary': "Involves identifying and modifying a table, using the 'find_all' method to locate multiple instances of a specific element, and encountering a 'weird one' before finding the desired elements.", 'duration': 50.715, 'highlights': ["Using the 'find_all' method to locate multiple instances of a specific element The chapter involves using the 'find_all' method to locate multiple instances of a specific element within the code.", "Encountering a 'weird one' before finding the desired elements The chapter describes encountering a 'weird one' before finding the desired elements in the code.", 'Identifying and modifying a table The chapter focuses on identifying and modifying a table within the code.']}, {'end': 421.538, 'start': 356.095, 'title': 'Web scraping table data', 'summary': 'Demonstrates using web scraping to extract specific table data, including the rank, name, and industry of companies, and identifies wells fargo and goldman sachs as potential data points.', 'duration': 65.443, 'highlights': ['The chapter demonstrates using web scraping to extract specific table data, including the rank, name, and industry of companies, and identifies Wells Fargo and Goldman Sachs as potential data points.', 'The process involves using indexing to select the desired data from the table, and specifies using find all to extract the required information.', 'The speaker emphasizes the importance of selecting the relevant information and discarding unnecessary data when scraping web tables.']}], 'duration': 195.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0225766.jpg', 'highlights': ['The chapter demonstrates using web scraping to extract specific table data, including the rank, name, and industry of companies, and identifies Wells Fargo and Goldman Sachs as potential data points.', 'The chapter discusses using BeautifulSoup to identify and select tables, encountering issues with multiple tables of the same class, and experimenting with find and find all methods to retrieve the desired table.', "Using the 'find_all' method to locate multiple instances of a specific element The chapter involves using the 'find_all' method to locate multiple instances of a specific element within the code."]}, {'end': 768.548, 'segs': [{'end': 642.538, 'src': 'heatmap', 'start': 498.394, 'weight': 1, 'content': [{'end': 502.977, 'text': 'So it says wiki table sortable jQuery dash table sorter right here.', 'start': 498.394, 'duration': 4.583}, {'end': 512.602, 'text': "But in our actual Python script that we're running, it was only pulling in the wiki table sortable.", 'start': 503.717, 'duration': 8.885}, {'end': 515.89, 'text': "So it wasn't pulling in the jQuery-tableSorter.", 'start': 513.629, 'duration': 2.261}, {'end': 524.279, 'text': "Why? I'm not 100% sure, but all things that we're working through and we were able to figure out.", 'start': 516.472, 'duration': 7.807}, {'end': 527.041, 'text': "So we're gonna make this our table.", 'start': 525.059, 'duration': 1.982}, {'end': 531.625, 'text': "We're gonna say tables equal to soup.findAll.", 'start': 527.842, 'duration': 3.783}, {'end': 532.927, 'text': "And let's run this.", 'start': 532.206, 'duration': 0.721}, {'end': 536.664, 'text': 'And if we print out our table, we have this table.', 'start': 533.687, 'duration': 2.977}, {'end': 539.227, 'text': 'Now, this is our only data that we are looking at.', 'start': 536.704, 'duration': 2.523}, {'end': 544.953, 'text': 'Now, the first thing that I want to get is I wanna get these titles or these headers right here.', 'start': 539.467, 'duration': 5.486}, {'end': 546.395, 'text': "That's what we're gonna get first.", 'start': 545.334, 'duration': 1.061}, {'end': 548.137, 'text': "So let's go in here.", 'start': 546.976, 'duration': 1.161}, {'end': 549.378, 'text': 'We can just look in this information.', 'start': 548.177, 'duration': 1.201}, {'end': 552.141, 'text': 'You can see that these are with these TH tags.', 'start': 549.398, 'duration': 2.743}, {'end': 556.174, 'text': 'and we can pull out those th tags really easily.', 'start': 553.012, 'duration': 3.162}, {'end': 557.735, 'text': "Let's come right down here.", 'start': 556.194, 'duration': 1.541}, {'end': 563.459, 'text': "We're just gonna say th, and we can get rid of this, and let's run this.", 'start': 557.755, 'duration': 5.704}, {'end': 570.204, 'text': 'Now these are our only th tags because everything else is a tr tag for these rows of data.', 'start': 564.28, 'duration': 5.924}, {'end': 575.007, 'text': 'So these th tags are pretty unique, which makes it really easy, which is really great,', 'start': 570.624, 'duration': 4.383}, {'end': 578.949, 'text': 'because then we can just do world underscore titles is equal to.', 'start': 575.007, 'duration': 3.942}, {'end': 582.352, 'text': "Now we have these titles, but they're not perfect.", 'start': 578.969, 'duration': 3.383}, {'end': 585.473, 'text': "But what we're going to do is we're going to loop through it.", 'start': 582.992, 'duration': 2.481}, {'end': 589.216, 'text': "So I'm going to say world underscore titles, and I'll kind of walk through what I'm talking about.", 'start': 585.494, 'duration': 3.722}, {'end': 593.758, 'text': 'This is in a list, and each one is within these th tags.', 'start': 589.236, 'duration': 4.522}, {'end': 597.18, 'text': "So th, and then there's our string that we're trying to get.", 'start': 593.778, 'duration': 3.402}, {'end': 601.283, 'text': 'So we can easily take this list and use..', 'start': 597.761, 'duration': 3.522}, {'end': 605.158, 'text': 'list comprehension and we can do that right down here.', 'start': 602.156, 'duration': 3.002}, {'end': 607.3, 'text': "so i'm going to keep this where we can see it.", 'start': 605.158, 'duration': 2.142}, {'end': 612.184, 'text': "um, we'll do world underscore table, underscore titles.", 'start': 607.3, 'duration': 4.884}, {'end': 613.925, 'text': "that's equal to now.", 'start': 612.184, 'duration': 1.741}, {'end': 614.546, 'text': "we'll do our list.", 'start': 613.925, 'duration': 0.621}, {'end': 616.367, 'text': 'comprehension should be super easy.', 'start': 614.546, 'duration': 1.821}, {'end': 621.211, 'text': "uh, we'll just say for title in world underscore titles.", 'start': 616.367, 'duration': 4.844}, {'end': 622.532, 'text': 'and then what do we want?', 'start': 621.211, 'duration': 1.321}, {'end': 624.614, 'text': 'we want title dot text.', 'start': 622.532, 'duration': 2.082}, {'end': 626.134, 'text': "that's it.", 'start': 625.154, 'duration': 0.98}, {'end': 631.056, 'text': "um, because we're just taking the text from each of these, we're just looping through and we're getting rank.", 'start': 626.134, 'duration': 4.922}, {'end': 634.237, 'text': "then we're looping through, getting name, looping through, getting industry.", 'start': 631.056, 'duration': 3.181}, {'end': 635.278, 'text': "that's it.", 'start': 634.237, 'duration': 1.041}, {'end': 642.538, 'text': "so let's go and print our world table titles and see if it worked And it did.", 'start': 635.278, 'duration': 7.26}], 'summary': 'Python script extracts table data, finds titles using beautifulsoup, and prints them successfully.', 'duration': 50.984, 'max_score': 498.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0498394.jpg'}, {'end': 768.548, 'src': 'heatmap', 'start': 730.115, 'weight': 0, 'content': [{'end': 731.435, 'text': 'Jeez, we were not thinking here.', 'start': 730.115, 'duration': 1.32}, {'end': 738.839, 'text': 'So now we need to do findall on the table, not the soup, because now we were looking at all of them.', 'start': 732.456, 'duration': 6.383}, {'end': 740.499, 'text': 'Oh, what a rookie mistake.', 'start': 739.539, 'duration': 0.96}, {'end': 742.16, 'text': "Okay Let's go back.", 'start': 740.599, 'duration': 1.561}, {'end': 743.401, 'text': "Now let's look at this.", 'start': 742.58, 'duration': 0.821}, {'end': 746.758, 'text': "now it's just down to headquarters.", 'start': 744.375, 'duration': 2.383}, {'end': 749.601, 'text': "okay, okay, let's go ahead and run this.", 'start': 746.758, 'duration': 2.843}, {'end': 750.822, 'text': "let's run this now.", 'start': 749.601, 'duration': 1.221}, {'end': 752.864, 'text': 'we just have headquarters now.', 'start': 750.822, 'duration': 2.042}, {'end': 755.265, 'text': "let's run this Now.", 'start': 752.864, 'duration': 2.401}, {'end': 756.665, 'text': 'we are sitting pretty okay.', 'start': 755.265, 'duration': 1.4}, {'end': 758.186, 'text': 'Excuse my mistakes.', 'start': 757.226, 'duration': 0.96}, {'end': 760.846, 'text': 'Hey, listen, you know, if it happens to me, it happens to you.', 'start': 758.226, 'duration': 2.62}, {'end': 762.927, 'text': 'I promise you this is, you know, this is a project.', 'start': 760.906, 'duration': 2.021}, {'end': 764.847, 'text': "There's a little, a little project we're creating here.", 'start': 762.947, 'duration': 1.9}, {'end': 766.867, 'text': "So we're going to run the issues and that's okay.", 'start': 764.887, 'duration': 1.98}, {'end': 768.548, 'text': "We're figuring it out as we go.", 'start': 767.027, 'duration': 1.521}], 'summary': 'Fixing code errors, now focusing on headquarters, feeling confident.', 'duration': 43.815, 'max_score': 730.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0730115.jpg'}], 'start': 421.778, 'title': 'Web scraping and data cleaning', 'summary': 'Demonstrates web scraping and data cleaning, including parsing html table data in python, using beautifulsoup and troubleshooting discrepancies in the output, and resolving errors in the scraping process.', 'chapters': [{'end': 527.041, 'start': 421.778, 'title': 'Parsing html table data in python', 'summary': "Demonstrates parsing html table data in python using the soup.find method, with emphasis on extracting the 'wiki table sortable' class and troubleshooting discrepancies in the output.", 'duration': 105.263, 'highlights': ["The chapter demonstrates using the soup.find method to extract the 'wiki table sortable' class from an HTML table, showcasing the process of parsing HTML table data in Python.", "The narrator encounters a discrepancy in the output, as the script only pulls in 'wiki table sortable' instead of 'jQuery-tableSorter', highlighting the troubleshooting process involved in resolving the issue.", 'The demonstration involves finding and extracting specific HTML elements using Python, providing a practical example of parsing and manipulating HTML content.', 'The narrator resolves the issue by analyzing and adjusting the script, showcasing problem-solving skills in Python programming.', 'The chapter emphasizes the practical application of the soup.find method in extracting relevant HTML classes, contributing to the understanding of web scraping and data extraction in Python.']}, {'end': 768.548, 'start': 527.842, 'title': 'Web scraping and data cleaning', 'summary': 'Covers the process of scraping data from a website using beautifulsoup, identifying and cleaning specific data points, such as table headers, and resolving errors in the scraping process.', 'duration': 240.706, 'highlights': ['Using BeautifulSoup to scrape data from a website The process involves utilizing BeautifulSoup to extract data from a website, such as retrieving table headers and specific data points.', 'Identifying and cleaning specific data points The chapter demonstrates the identification and cleaning of specific data points, such as table headers, by looping through the data and applying list comprehension to extract the text.', 'Resolving errors in the scraping process The highlighted section showcases the process of identifying and resolving errors in the scraping process, such as mistakenly pulling in secondary tables and cleaning up unwanted data points.']}], 'duration': 346.77, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0421778.jpg', 'highlights': ['The chapter emphasizes the practical application of the soup.find method in extracting relevant HTML classes, contributing to the understanding of web scraping and data extraction in Python.', 'The demonstration involves finding and extracting specific HTML elements using Python, providing a practical example of parsing and manipulating HTML content.', 'The narrator resolves the issue by analyzing and adjusting the script, showcasing problem-solving skills in Python programming.', 'Using BeautifulSoup to scrape data from a website The process involves utilizing BeautifulSoup to extract data from a website, such as retrieving table headers and specific data points.', 'Identifying and cleaning specific data points The chapter demonstrates the identification and cleaning of specific data points, such as table headers, by looping through the data and applying list comprehension to extract the text.', 'Resolving errors in the scraping process The highlighted section showcases the process of identifying and resolving errors in the scraping process, such as mistakenly pulling in secondary tables and cleaning up unwanted data points.', "The chapter demonstrates using the soup.find method to extract the 'wiki table sortable' class from an HTML table, showcasing the process of parsing HTML table data in Python.", "The narrator encounters a discrepancy in the output, as the script only pulls in 'wiki table sortable' instead of 'jQuery-tableSorter', highlighting the troubleshooting process involved in resolving the issue."]}, {'end': 1063.193, 'segs': [{'end': 812.885, 'src': 'embed', 'start': 768.988, 'weight': 0, 'content': [{'end': 774.129, 'text': 'Now, what I want to do before we start pulling in all the data is I want to put this into our pandas data frame.', 'start': 768.988, 'duration': 5.141}, {'end': 777.029, 'text': "We'll have the, you know, headers there for us to go.", 'start': 774.149, 'duration': 2.88}, {'end': 778.829, 'text': "So we won't have to get that later.", 'start': 777.449, 'duration': 1.38}, {'end': 780.95, 'text': 'And it just makes it easier in general.', 'start': 779.09, 'duration': 1.86}, {'end': 781.31, 'text': 'Trust me.', 'start': 780.99, 'duration': 0.32}, {'end': 784.231, 'text': "So we're going to import pandas as PD.", 'start': 781.69, 'duration': 2.541}, {'end': 785.291, 'text': "Let's go ahead and run this.", 'start': 784.251, 'duration': 1.04}, {'end': 787.471, 'text': "And now we're going to create our data frame.", 'start': 785.891, 'duration': 1.58}, {'end': 788.911, 'text': "So we'll say PD dot.", 'start': 787.491, 'duration': 1.42}, {'end': 792.592, 'text': 'Now we have these world table titles.', 'start': 790.032, 'duration': 2.56}, {'end': 795.933, 'text': "So what we're going to do is PD dot data frame.", 'start': 792.612, 'duration': 3.321}, {'end': 801.514, 'text': "And then in here for our columns, we'll say that's equal to the world table titles.", 'start': 796.513, 'duration': 5.001}, {'end': 806.435, 'text': "And let's just go ahead and say that's our data frame and call our data frame right here.", 'start': 802.194, 'duration': 4.241}, {'end': 806.835, 'text': "Let's run it.", 'start': 806.455, 'duration': 0.38}, {'end': 808.062, 'text': 'There we go.', 'start': 807.642, 'duration': 0.42}, {'end': 812.885, 'text': 'So we were able to pull out and extract those headers and those titles of these columns.', 'start': 808.703, 'duration': 4.182}], 'summary': 'Using pandas, the data was put into a data frame with headers, making it easier to work with. the headers and titles of the columns were successfully extracted.', 'duration': 43.897, 'max_score': 768.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0768988.jpg'}, {'end': 1037.946, 'src': 'embed', 'start': 963.603, 'weight': 1, 'content': [{'end': 968.104, 'text': 'Instead of printing this off, because, again, this is all in a list.', 'start': 963.603, 'duration': 4.501}, {'end': 968.945, 'text': "We're using find all.", 'start': 968.125, 'duration': 0.82}, {'end': 972.566, 'text': "So we're printing off another list, which isn't actually super helpful.", 'start': 968.985, 'duration': 3.581}, {'end': 981.689, 'text': "For each of all these data that we're pulling in, what we can do is we can call this the row underscore data.", 'start': 974.507, 'duration': 7.182}, {'end': 984.698, 'text': "And then we'll put the row data in here.", 'start': 982.677, 'duration': 2.021}, {'end': 988.379, 'text': "So we'll say for, and we'll say in row data.", 'start': 984.858, 'duration': 3.521}, {'end': 994.402, 'text': "So we'll just say for the data in row data, and we'll take the data, we'll exchange that.", 'start': 988.459, 'duration': 5.943}, {'end': 1007.447, 'text': "And now instead of world table titles, we can change this into individual row data, right? And now let's print off the individual row data.", 'start': 994.762, 'duration': 12.685}, {'end': 1015.314, 'text': "so it's the exact same process that we were doing up here, and that's how we cleaned it up and got this, and we may not need to strip,", 'start': 1008.047, 'duration': 7.267}, {'end': 1017.516, 'text': "but let's just run this and see what we get there.", 'start': 1015.314, 'duration': 2.202}, {'end': 1019.058, 'text': 'we go um in.', 'start': 1017.516, 'duration': 1.542}, {'end': 1020.119, 'text': "strip, i'm sure was helpful.", 'start': 1019.058, 'duration': 1.061}, {'end': 1022.802, 'text': "let's actually get rid of this.", 'start': 1020.119, 'duration': 2.683}, {'end': 1024.143, 'text': 'yeah, strip was helpful.', 'start': 1022.802, 'duration': 1.341}, {'end': 1026.945, 'text': "it's the exact same thing that happened on the last one.", 'start': 1024.143, 'duration': 2.802}, {'end': 1029.637, 'text': "so let's keep that actually, Let's run this.", 'start': 1026.945, 'duration': 2.692}, {'end': 1032.68, 'text': "And now let's just kind of glance at this information.", 'start': 1030.297, 'duration': 2.383}, {'end': 1033.962, 'text': "Let's look through it.", 'start': 1033.32, 'duration': 0.642}, {'end': 1037.946, 'text': "This looks exactly like the information that's in the table.", 'start': 1034.182, 'duration': 3.764}], 'summary': 'Using find all to extract and print individual row data from a list.', 'duration': 74.343, 'max_score': 963.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0963603.jpg'}], 'start': 768.988, 'title': 'Data extraction and processing', 'summary': 'Details the process of putting data into a pandas dataframe, extracting headers and titles, and preparing to pull in the data. it also explains the process of extracting and processing table data using python, resulting in cleaned and formatted data for a table.', 'chapters': [{'end': 886.422, 'start': 768.988, 'title': 'Data extraction and dataframe setup', 'summary': 'Details the process of putting the data into a pandas dataframe, extracting headers and titles, and preparing to pull in the data, demonstrating the use of pandas data frame and the process of identifying and extracting specific data from the html table.', 'duration': 117.434, 'highlights': ['The chapter details the process of putting the data into a pandas dataframe, highlighting the use of pandas data frame and the process of identifying and extracting specific data from the HTML table.', 'The process involves extracting headers and titles of columns into the dataframe, enabling ease of data manipulation and analysis.', 'The speaker emphasizes the significance of using pandas dataframe for easier data manipulation and ensures that the headers are readily available for future use.']}, {'end': 1063.193, 'start': 886.842, 'title': 'Extracting and processing table data', 'summary': 'Explains the process of extracting and processing table data using python, including looping through the data and utilizing find_all function, resulting in cleaned and formatted data for a table.', 'duration': 176.351, 'highlights': ['The process involves looping through the data and using the find_all function to extract and process the table data in Python. The chapter discusses looping through the data and utilizing find_all function to process the table data in Python.', 'The data is cleaned and formatted for a table, resulting in accurate and organized information. The cleaned and formatted data is obtained, ensuring accuracy and organization for the table.', 'The chapter demonstrates the use of the strip function to further clean and format the data. The strip function is used to clean and format the data, ensuring accuracy and cleanliness.']}], 'duration': 294.205, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH0768988.jpg', 'highlights': ['The chapter details the process of putting the data into a pandas dataframe, highlighting the use of pandas data frame and the process of identifying and extracting specific data from the HTML table.', 'The process involves looping through the data and using the find_all function to extract and process the table data in Python. The chapter discusses looping through the data and utilizing find_all function to process the table data in Python.', 'The process involves extracting headers and titles of columns into the dataframe, enabling ease of data manipulation and analysis.', 'The data is cleaned and formatted for a table, resulting in accurate and organized information. The cleaned and formatted data is obtained, ensuring accuracy and organization for the table.', 'The speaker emphasizes the significance of using pandas dataframe for easier data manipulation and ensures that the headers are readily available for future use.', 'The chapter demonstrates the use of the strip function to further clean and format the data. The strip function is used to clean and format the data, ensuring accuracy and cleanliness.']}, {'end': 1509.531, 'segs': [{'end': 1091.28, 'src': 'embed', 'start': 1063.293, 'weight': 1, 'content': [{'end': 1070.74, 'text': "We can't just take the entire table and plop it into into the data frame, we need a way to kind of put this in one at a time.", 'start': 1063.293, 'duration': 7.447}, {'end': 1075.445, 'text': "Now, if you're just here for web scraping and you haven't taken like my Panda series, that's totally fine.", 'start': 1071.18, 'duration': 4.265}, {'end': 1076.546, 'text': "That's not what we're here for anyways.", 'start': 1075.485, 'duration': 1.061}, {'end': 1083.873, 'text': "But what we can do, we'll have our individual row data and we're going to put it in kind of one at a time.", 'start': 1077.507, 'duration': 6.366}, {'end': 1091.28, 'text': "Now, the reason we have to do that is because when we have it like this and let's go back, we had it like this it's printing out all of it,", 'start': 1084.293, 'duration': 6.987}], 'summary': 'Data needs to be added to the data frame one row at a time.', 'duration': 27.987, 'max_score': 1063.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH01063293.jpg'}, {'end': 1265.082, 'src': 'embed', 'start': 1234.473, 'weight': 2, 'content': [{'end': 1237.774, 'text': "it's working, it's thinking, and it looks like we got an issue.", 'start': 1234.473, 'duration': 3.301}, {'end': 1241.335, 'text': 'cannot set a row with mismatched columns.', 'start': 1237.774, 'duration': 3.561}, {'end': 1246.536, 'text': "now we're encountering an issue, not one that i got earlier, but we're gonna cancel this out.", 'start': 1241.335, 'duration': 5.201}, {'end': 1248.398, 'text': "We're going to figure this out together.", 'start': 1247.196, 'duration': 1.202}, {'end': 1251.624, 'text': "So let's print off our individual road data.", 'start': 1248.438, 'duration': 3.186}, {'end': 1252.726, 'text': "Let's look at this.", 'start': 1252.085, 'duration': 0.641}, {'end': 1254.369, 'text': 'This one is empty.', 'start': 1253.387, 'duration': 0.982}, {'end': 1257.935, 'text': "This is I'm almost certain is probably the issue.", 'start': 1255.271, 'duration': 2.664}, {'end': 1265.082, 'text': "I didn't encounter this issue when I wrote this lesson, but I'm almost certain that this is the issue right here.", 'start': 1259.256, 'duration': 5.826}], 'summary': 'Encountered an issue with mismatched columns while working on road data.', 'duration': 30.609, 'max_score': 1234.473, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH01234473.jpg'}, {'end': 1312.87, 'src': 'heatmap', 'start': 1279.146, 'weight': 0.822, 'content': [{'end': 1280.467, 'text': "So now that first one's gone.", 'start': 1279.146, 'duration': 1.321}, {'end': 1282.628, 'text': 'So now we just have the information.', 'start': 1280.487, 'duration': 2.141}, {'end': 1288.832, 'text': "I didn't even think about that just a second ago, but I'm glad we're running into it in case you ran into that issue.", 'start': 1282.648, 'duration': 6.184}, {'end': 1291.033, 'text': "Let's go ahead and try this again.", 'start': 1289.912, 'duration': 1.121}, {'end': 1293.474, 'text': 'And it looked like it worked.', 'start': 1291.053, 'duration': 2.421}, {'end': 1295.315, 'text': "So let's pull our data frame down.", 'start': 1293.815, 'duration': 1.5}, {'end': 1296.636, 'text': 'I could have just wrote DF.', 'start': 1295.335, 'duration': 1.301}, {'end': 1297.977, 'text': "Let's pull our data frame down.", 'start': 1297.036, 'duration': 0.941}, {'end': 1301.841, 'text': 'and now this is looking fantastic.', 'start': 1299.139, 'duration': 2.702}, {'end': 1305.464, 'text': "now these three dots just mean there's information in there.", 'start': 1301.841, 'duration': 3.623}, {'end': 1312.87, 'text': "just doesn't want to display it, but it looks like we have our rank, we have our name, we have the industry, revenue, revenue, growth,", 'start': 1305.464, 'duration': 7.406}], 'summary': 'Successfully retrieved and displayed data frame with rank, name, industry, revenue, and revenue growth.', 'duration': 33.724, 'max_score': 1279.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH01279146.jpg'}, {'end': 1503.486, 'src': 'embed', 'start': 1457.987, 'weight': 0, 'content': [{'end': 1463.152, 'text': 'So we specify this is our table and we worked with just our table going forward.', 'start': 1457.987, 'duration': 5.165}, {'end': 1474.122, 'text': 'Of course we encountered some small issues user errors on my end but we were able to get our world titles and we put those into our data frame right here using pandas.', 'start': 1463.192, 'duration': 10.93}, {'end': 1482.789, 'text': 'Then next we went back and we got all the row data and the individual data from those rows and we put it into our pandas data frame.', 'start': 1474.642, 'duration': 8.147}, {'end': 1487.373, 'text': 'Then we came below and we exported this into an actual CSV file.', 'start': 1483.45, 'duration': 3.923}, {'end': 1493.998, 'text': 'So that is how we can use web scraping to get data from something like a table and put it into a pandas data frame.', 'start': 1487.713, 'duration': 6.285}, {'end': 1495.6, 'text': 'I hope that this lesson was helpful.', 'start': 1494.359, 'duration': 1.241}, {'end': 1496.961, 'text': 'I know we encountered some issues.', 'start': 1495.64, 'duration': 1.321}, {'end': 1498.842, 'text': "That's on my end and I apologize.", 'start': 1497.241, 'duration': 1.601}, {'end': 1501.704, 'text': 'But if you run into the same issues, hopefully that helped.', 'start': 1498.962, 'duration': 2.742}, {'end': 1503.486, 'text': 'But I hope this was helpful.', 'start': 1502.305, 'duration': 1.181}], 'summary': 'Web scraping used to extract table data, converted to pandas dataframe and exported as csv. some issues encountered.', 'duration': 45.499, 'max_score': 1457.987, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH01457987.jpg'}], 'start': 1063.293, 'title': 'Web scraping data and csv export', 'summary': 'Covers web scraping data into a data frame using pandas, addressing issues, and successfully populating the data frame, as well as the process of web scraping for csv export, encountering user errors along the way.', 'chapters': [{'end': 1324.997, 'start': 1063.293, 'title': 'Web scraping data into data frame', 'summary': 'Explains how to loop through and append individual row data into a data frame using pandas, encountering and addressing issues along the way, resulting in successfully populating the data frame with the desired information.', 'duration': 261.704, 'highlights': ['Explaining the process of putting individual row data into the data frame one at a time The speaker explains the need to insert individual row data into the data frame one at a time, emphasizing the process of appending the information onto the data frame as it loops through, resulting in the successful population of the data frame.', 'Encountering and addressing the issue of mismatched columns when populating the data frame The speaker encounters an issue of mismatched columns when populating the data frame, identifies the cause as an empty column, and resolves it by adjusting the position to start from, ensuring the successful population of the data frame.', 'Successful population of the data frame with the desired information After encountering and addressing the issue of mismatched columns, the speaker successfully populates the data frame with the desired information, displaying the rank, name, industry, revenue, revenue growth, employees, and headquarters for every entry.']}, {'end': 1509.531, 'start': 1325.438, 'title': 'Web scraping for csv export', 'summary': 'Demonstrates the process of web scraping to extract data from a table, putting it into a pandas data frame, and exporting it into a csv file, encountering some user errors along the way.', 'duration': 184.093, 'highlights': ['The process of web scraping to extract data from a table and export it into a CSV file is demonstrated. Demonstration of web scraping process, data extraction from a table, and exporting to a CSV file.', 'Encountering user errors during the process of web scraping and data manipulation. Mention of encountering user errors and issues during the web scraping and data manipulation process.', 'Use of pandas data frame for organizing and manipulating the extracted data. Utilization of pandas data frame for organizing and manipulating the extracted data.']}], 'duration': 446.238, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8dTpNajxaH0/pics/8dTpNajxaH01063293.jpg', 'highlights': ['Successful population of the data frame with the desired information', 'Explaining the process of putting individual row data into the data frame one at a time', 'Encountering and addressing the issue of mismatched columns when populating the data frame', 'The process of web scraping to extract data from a table and export it into a CSV file is demonstrated', 'Use of pandas data frame for organizing and manipulating the extracted data', 'Encountering user errors during the process of web scraping and data manipulation']}], 'highlights': ['Demonstrates web data scraping for a pandas dataframe from a real website, including handling complex tables, using beautifulsoup for table data extraction and data cleaning, and extracting and processing table data using python, resulting in csv export and addressing user errors.', 'The process involves importing the Beautiful Soup library, obtaining the URL, making a request to access the information, and parsing the HTML to extract the desired data.', 'The chapter discusses scraping data from a real website to create a pandas dataframe. It involves scraping data from Wikipedia to obtain the list of the largest companies in the United States by revenue and then formatting it to be usable in a Pandas DataFrame.', 'The need to format the data properly for it to be usable in a Pandas DataFrame. The process involves formatting the obtained data from Wikipedia properly to ensure it can be effectively used in a Pandas DataFrame, indicating a more complex and challenging task.', 'The process involves looping through the data and using the find_all function to extract and process the table data in Python. The chapter discusses looping through the data and utilizing find_all function to process the table data in Python.', 'The chapter details the process of putting the data into a pandas dataframe, highlighting the use of pandas data frame and the process of identifying and extracting specific data from the HTML table.', 'The narrator resolves the issue by analyzing and adjusting the script, showcasing problem-solving skills in Python programming.', 'The chapter emphasizes the practical application of the soup.find method in extracting relevant HTML classes, contributing to the understanding of web scraping and data extraction in Python.', "The chapter demonstrates using the soup.find method to extract the 'wiki table sortable' class from an HTML table, showcasing the process of parsing HTML table data in Python.", 'The chapter demonstrates the use of the strip function to further clean and format the data. The strip function is used to clean and format the data, ensuring accuracy and cleanliness.']}