title
13.7: Manual Parsing - Processing Tutorial
description
This video covers how to manually parse data from a URL when no API or standardized format is available. The example used here is IMDB.
Book: Learning Processing A Beginner's Guide to Programming, Images,Animation, and Interaction
Chapter: 18
Official book website: http://learningprocessing.com/
Twitter: https://twitter.com/shiffman
Help us caption & translate this video!
http://amara.org/v/Qbwj/
📄 Code of Conduct: https://github.com/CodingTrain/Code-of-Conduct
detail
{'title': '13.7: Manual Parsing - Processing Tutorial', 'heatmap': [], 'summary': 'The tutorial explores challenges of unstandardized web data and techniques for extracting specific patterns using regular expressions and string manipulation in processing and python, with a focus on web data mining and custom function creation.', 'chapters': [{'end': 184.283, 'segs': [{'end': 43.976, 'src': 'embed', 'start': 2.113, 'weight': 0, 'content': [{'end': 5.315, 'text': 'In this video, I want to look at the worst case scenario.', 'start': 2.113, 'duration': 3.202}, {'end': 7.095, 'text': 'You found some data online.', 'start': 5.615, 'duration': 1.48}, {'end': 11.958, 'text': "You want to use it, but it's not available in some nice standardized format.", 'start': 8.036, 'duration': 3.922}, {'end': 13.518, 'text': "There's no CSV to download.", 'start': 11.998, 'duration': 1.52}, {'end': 14.559, 'text': "There's no XML feed.", 'start': 13.538, 'duration': 1.021}, {'end': 15.359, 'text': "There's no API.", 'start': 14.579, 'duration': 0.78}, {'end': 17.5, 'text': "There's no processing library that takes care of it for you.", 'start': 15.379, 'duration': 2.121}, {'end': 20.422, 'text': "There's nothing but the web page itself.", 'start': 17.76, 'duration': 2.662}, {'end': 25.124, 'text': 'And this can apply to other scenarios where the data is in some strange, unrecognizable format.', 'start': 20.882, 'duration': 4.242}, {'end': 29.567, 'text': "So let's look at a kind of scenario here.", 'start': 25.484, 'duration': 4.083}, {'end': 34.39, 'text': "So let's say you want to make some type of data visualization about movies.", 'start': 30.187, 'duration': 4.203}, {'end': 43.976, 'text': 'And you look at imdb.com and you say, aha, everything I ever needed to know, every piece of data that I need is here somehow in this website.', 'start': 35.53, 'duration': 8.446}], 'summary': 'Addressing challenges of using unstructured data for visualization and analysis from imdb.com', 'duration': 41.863, 'max_score': 2.113, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw2113.jpg'}, {'end': 142.345, 'src': 'embed', 'start': 93.552, 'weight': 2, 'content': [{'end': 99.478, 'text': 'Regular expressions is a bit beyond the scope of this particular set of videos about data and processing.', 'start': 93.552, 'duration': 5.926}, {'end': 102.56, 'text': "So I'm just going to leave that aside, just mentioning it.", 'start': 99.938, 'duration': 2.622}, {'end': 109.606, 'text': 'You might look at the match function or the match all function in processing, which uses regular expressions to match a particular pattern,', 'start': 102.821, 'duration': 6.785}, {'end': 112.069, 'text': 'to search for a pattern in a body of text.', 'start': 109.606, 'duration': 2.463}, {'end': 117.153, 'text': 'And perhaps someday I will add a video or some materials about regular expressions in processing.', 'start': 112.649, 'duration': 4.504}, {'end': 125.337, 'text': "But for now, we're going to do it in a bit more of a rudimentary way using two string functions in processing that I don't believe we've seen yet.", 'start': 117.253, 'duration': 8.084}, {'end': 130.94, 'text': 'One is index of, and the other is substring.', 'start': 126.237, 'duration': 4.703}, {'end': 142.345, 'text': "So let's say we have a particular piece of text, string s equals I have 21 apples.", 'start': 132.44, 'duration': 9.905}], 'summary': 'Regular expressions are not covered; instead, index of and substring functions in processing will be used for text processing.', 'duration': 48.793, 'max_score': 93.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw93552.jpg'}], 'start': 2.113, 'title': 'Data extraction challenges and techniques', 'summary': 'Examines challenges of unstandardized data from web pages with no available csv, xml, or api, using the example of extracting movie data from imdb.com. it also covers techniques for extracting specific data from a string using regular expressions, index of, and substring functions in processing, with the goal of identifying and retrieving specific patterns from a body of text.', 'chapters': [{'end': 61.348, 'start': 2.113, 'title': 'Handling unavailable data formats', 'summary': 'Examines the challenges of using unstandardized data from web pages with no available csv, xml, or api, using the example of extracting movie data from imdb.com.', 'duration': 59.235, 'highlights': ['The data is not available in some nice standardized format, such as CSV, XML, or API, and may only exist on the web page itself.', 'The example of extracting movie data from IMDb.com illustrates the challenge of finding data on a web page without an available API or XML feed.', 'The need for specific data like the year of a movie or its length, despite the absence of a convenient data format, demonstrates the difficulties of working with unstandardized data.']}, {'end': 184.283, 'start': 66.59, 'title': 'Extracting data from text in processing', 'summary': 'Covers techniques for extracting specific data from a string using regular expressions, index of, and substring functions in processing, with the goal of identifying and retrieving specific patterns from a body of text.', 'duration': 117.693, 'highlights': ['Using regular expressions to match a particular pattern in a body of text, such as the match function or the match all function in processing, is a valuable technique for searching within a string.', "Exploring the use of two string functions in processing, index of, and substring, to extract specific data from a given text, exemplified by the scenario of retrieving a number from a string containing the phrase 'I have 21 apples'."]}], 'duration': 182.17, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw2113.jpg', 'highlights': ['The need for specific data like the year of a movie or its length, despite the absence of a convenient data format, demonstrates the difficulties of working with unstandardized data.', 'The example of extracting movie data from IMDb.com illustrates the challenge of finding data on a web page without an available API or XML feed.', 'Using regular expressions to match a particular pattern in a body of text, such as the match function or the match all function in processing, is a valuable technique for searching within a string.', "Exploring the use of two string functions in processing, index of, and substring, to extract specific data from a given text, exemplified by the scenario of retrieving a number from a string containing the phrase 'I have 21 apples'.", 'The data is not available in some nice standardized format, such as CSV, XML, or API, and may only exist on the web page itself.']}, {'end': 612.501, 'segs': [{'end': 223.797, 'src': 'embed', 'start': 184.723, 'weight': 2, 'content': [{'end': 187.604, 'text': 'So what if I could find wherever have is?', 'start': 184.723, 'duration': 2.881}, {'end': 189.465, 'text': 'What if I could find wherever apples is?', 'start': 187.664, 'duration': 1.801}, {'end': 199.089, 'text': 'index of will search for the index of a particular string inside of a string and then substring will pull out a substring.', 'start': 190.305, 'duration': 8.784}, {'end': 202.77, 'text': 'What if I could then pull out all of the characters in between these two points?', 'start': 199.109, 'duration': 3.661}, {'end': 223.797, 'text': "In other words, I'm saying what if I were to say int begin equals s.index of have and int end equals s.index of apples,", 'start': 203.15, 'duration': 20.647}], 'summary': 'Exploring finding indexes and substrings in a string.', 'duration': 39.074, 'max_score': 184.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw184723.jpg'}, {'end': 293.419, 'src': 'embed', 'start': 267.656, 'weight': 0, 'content': [{'end': 273.38, 'text': 'certainly I want to look at how we could go and pull out a piece of data by requesting this information.', 'start': 267.656, 'duration': 5.724}, {'end': 276.824, 'text': 'So what do we do?', 'start': 273.98, 'duration': 2.844}, {'end': 278.307, 'text': "What's the actual data we're getting into processing??", 'start': 276.865, 'duration': 1.442}, {'end': 282.994, 'text': "Let's say we're looking for the length of the movie Mary Poppins 139 minutes.", 'start': 278.327, 'duration': 4.667}, {'end': 291.459, 'text': "So if I go up here, view developer view source, we can now see this is actually what we're going to get into processing.", 'start': 283.274, 'duration': 8.185}, {'end': 293.419, 'text': 'And this is just a big mess of stuff.', 'start': 291.659, 'duration': 1.76}], 'summary': 'Analyzing data retrieval and processing, such as the 139-minute length of the movie mary poppins.', 'duration': 25.763, 'max_score': 267.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw267656.jpg'}, {'end': 557.982, 'src': 'embed', 'start': 531.046, 'weight': 1, 'content': [{'end': 535.409, 'text': "I want the index of this, and I want the index of that, and I want the stuff that's in between.", 'start': 531.046, 'duration': 4.363}, {'end': 537.27, 'text': "But look, there's this function, give me text between.", 'start': 535.429, 'duration': 1.841}, {'end': 540.331, 'text': "Oh, why do we do any of this? There's just a function called give me text between.", 'start': 537.43, 'duration': 2.901}, {'end': 541.272, 'text': "There isn't.", 'start': 540.892, 'duration': 0.38}, {'end': 546.095, 'text': 'So this is a function that I wrote for this particular example,', 'start': 541.812, 'duration': 4.283}, {'end': 551.938, 'text': 'which will pull out a chunk of data from a particular string in between a beginning and an end.', 'start': 546.095, 'duration': 5.843}, {'end': 557.982, 'text': "And if we look down at that function, you'll see find the index of before,", 'start': 552.259, 'duration': 5.723}], 'summary': 'A custom function extracts data between specific indexes.', 'duration': 26.936, 'max_score': 531.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw531046.jpg'}], 'start': 184.723, 'title': 'String manipulation and web data mining in python', 'summary': 'Covers the usage of index and substring methods to manipulate strings in python, aiming to extract specific substrings and discusses web data mining techniques, including locating specific data within raw html sources and creating custom functions for extracting data from non-standardized web pages.', 'chapters': [{'end': 244.577, 'start': 184.723, 'title': 'String manipulation in python', 'summary': 'Discusses the usage of index and substring methods to manipulate strings in python, aiming to extract specific substrings based on the index of certain strings.', 'duration': 59.854, 'highlights': ['The chapter explores the usage of index and substring methods to extract specific substrings based on the index of certain strings, such as finding the index of a particular string inside another string and pulling out a substring, showcasing the practical application of string manipulation in Python.', "The speaker discusses the potential application of the 'index of' method to locate the position of a specific string within another string, followed by utilizing the 'substring' method to extract the desired substring, demonstrating the practical implementation of these string manipulation methods.", "The chapter highlights the process of utilizing the 'index of' method to find the position of a particular string within a given string and then extracting the substring between the obtained start and end positions, emphasizing the practical utility of these string manipulation techniques in Python."]}, {'end': 612.501, 'start': 246.041, 'title': 'Web data mining techniques', 'summary': 'Discusses web data mining techniques, including locating specific data within raw html sources, handling string manipulation, and creating a custom function for extracting data from non-standardized web pages.', 'duration': 366.46, 'highlights': ['The chapter covers techniques for locating specific data within raw HTML sources, such as finding the length of a movie (e.g., Mary Poppins is 139 minutes), and the process of searching for and extracting data from non-standardized web pages.', 'Explanation of string manipulation techniques, including understanding the index and length of strings, and using substring to extract the desired data from a larger piece of text.', "The detailed explanation of a custom function 'give me text between' for extracting data from non-standardized web pages, handling instances where the desired text cannot be found, and returning an empty string to prevent function failure."]}], 'duration': 427.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Pchg6F_koOw/pics/Pchg6F_koOw184723.jpg', 'highlights': ['The chapter covers techniques for locating specific data within raw HTML sources, such as finding the length of a movie (e.g., Mary Poppins is 139 minutes), and the process of searching for and extracting data from non-standardized web pages.', "The detailed explanation of a custom function 'give me text between' for extracting data from non-standardized web pages, handling instances where the desired text cannot be found, and returning an empty string to prevent function failure.", 'The chapter explores the usage of index and substring methods to extract specific substrings based on the index of certain strings, such as finding the index of a particular string inside another string and pulling out a substring, showcasing the practical application of string manipulation in Python.']}], 'highlights': ['Using regular expressions to match a particular pattern in a body of text is a valuable technique for searching within a string.', 'The example of extracting movie data from IMDb.com illustrates the challenge of finding data on a web page without an available API or XML feed.', 'The need for specific data like the year of a movie or its length, despite the absence of a convenient data format, demonstrates the difficulties of working with unstandardized data.', 'The chapter covers techniques for locating specific data within raw HTML sources, such as finding the length of a movie and the process of searching for and extracting data from non-standardized web pages.', "The detailed explanation of a custom function 'give me text between' for extracting data from non-standardized web pages, handling instances where the desired text cannot be found, and returning an empty string to prevent function failure.", "Exploring the use of two string functions in processing, index of, and substring, to extract specific data from a given text, exemplified by the scenario of retrieving a number from a string containing the phrase 'I have 21 apples'.", 'The data is not available in some nice standardized format, such as CSV, XML, or API, and may only exist on the web page itself.', 'The chapter explores the usage of index and substring methods to extract specific substrings based on the index of certain strings, showcasing the practical application of string manipulation in Python.']}