title
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

description
πŸ”₯ Edureka Hadoop Training (Use Code "π˜πŽπ”π“π”ππ„πŸπŸŽ"): https://www.edureka.co/big-data-hadoop-training-certification This Edureka "Hadoop tutorial For Beginners" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to understand the problem with traditional system while processing Big Data and how Hadoop solves it. This tutorial will provide you a comprehensive idea about HDFS and YARN along with their architecture that has been explained in a very simple manner using examples and practical demonstration. At the end, you will get to know how to analyze Olympic data set using Hadoop and gain useful insights. Below are the topics covered in this tutorial: 1. Big Data Growth Drivers 2. What is Big Data? 3. Hadoop Introduction 4. Hadoop Master/Slave Architecture 5. Hadoop Core Components 6. HDFS Data Blocks 7. HDFS Read/Write Mechanism 8. What is MapReduce 9. MapReduce Program 10. MapReduce Job Workflow 11. Hadoop Ecosystem 12. Hadoop Use Case: Analyzing Olympic Dataset Subscribe to our channel to get video updates. Hit the subscribe button above. Check our complete Hadoop playlist here: https://goo.gl/ExJdZs --------------------Edureka Big Data Training and Certifications------------------------ πŸ”΅ Edureka Hadoop Training: http://bit.ly/2YBlw29 πŸ”΅ Edureka Spark Training: http://bit.ly/2PeHvc9 πŸ”΅ Edureka Kafka Training: http://bit.ly/34e7Riy πŸ”΅ Edureka Cassandra Training: http://bit.ly/2E9AK54 πŸ”΅ Edureka Talend Training: http://bit.ly/2YzYIjg πŸ”΅ Edureka Hadoop Administration Training: http://bit.ly/2YE8Nf9 PG in Big Data Engineering with NIT Rourkela : https://www.edureka.co/post-graduate/big-data-engineering (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies) Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka Telegram: https://t.me/edurekaupdates #edureka #edurekaHadoop #HadoopTutorial #Hadoop #HadoopTutorialForBeginners #HadoopArchitecture #LearnHadoop #HadoopTraining #HadoopCertification How it Works? 1. This is a 5 Week Instructor led Online Course, 40 hours of assignment and 30 hours of project work 2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course. 3. At the end of the training you will have to undergo a 2-hour LIVE Practical Exam based on which we will provide you a Grade and a Verifiable Certificate! - - - - - - - - - - - - - - About the Course Edureka’s Big Data and Hadoop online training is designed to help you become a top Hadoop developer. During this course, our expert Hadoop instructors will help you: 1. Master the concepts of HDFS and MapReduce framework 2. Understand Hadoop 2.x Architecture 3. Setup Hadoop Cluster and write Complex MapReduce programs 4. Learn data loading techniques using Sqoop and Flume 5. Perform data analytics using Pig, Hive and YARN 6. Implement HBase and MapReduce integration 7. Implement Advanced Usage and Indexing 8. Schedule jobs using Oozie 9. Implement best practices for Hadoop development 10. Work on a real life Project on Big Data Analytics 11. Understand Spark and its Ecosystem 12. Learn how to work in RDD in Spark - - - - - - - - - - - - - - Who should go for this course? If you belong to any of the following groups, knowledge of Big Data and Hadoop is crucial for you if you want to progress in your career: 1. Analytics professionals 2. BI /ETL/DW professionals 3. Project managers 4. Testing professionals 5. Mainframe professionals 6. Software developers and architects 7. Recent graduates passionate about building successful career in Big Data - - - - - - - - - - - - - - Why Learn Hadoop? Big Data! A Worldwide Problem? According to Wikipedia, "Big data is collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However, it is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success! The good news is - Hadoop has become an integral part for storing, handling, evaluating and retrieving hundreds of terabytes, and even petabytes of data. - - - - - - - - - - - - - - Opportunities for Hadoopers! Opportunities for Hadoopers are infinite - from a Hadoop Developer, to a Hadoop Tester or a Hadoop Architect, and so on. If cracking and managing BIG Data is your passion in life, then think no more and Join Edureka's Hadoop Online course and carve a niche for yourself! For more information, Please write back to us at sales@edureka.in or call us at IND: 9606058406 / US: 18338555775 (toll-free).

detail
{'title': 'Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka', 'heatmap': [{'end': 1775.279, 'start': 1643.778, 'weight': 0.974}, {'end': 4271.728, 'start': 4085.172, 'weight': 0.924}], 'summary': "Tutorial covers an introduction to hadoop and big data, the challenges of big data, the 5 v's of big data, hadoop architecture and data storage, hdfs write mechanism, operations, and mapreduce for word count, mapreduce in hadoop, mapreduce workflow and hadoop architecture, and pig and olympic dataset analysis, with quantifiable data such as 217 new mobile users every 60 seconds and a reduction in word count time from four hours to one hour.", 'chapters': [{'end': 66.669, 'segs': [{'end': 75.791, 'src': 'embed', 'start': 47.04, 'weight': 0, 'content': [{'end': 52.022, 'text': "So we'll also see the master-slave architecture of Hadoop and the different Hadoop core components.", 'start': 47.04, 'duration': 4.982}, {'end': 58.985, 'text': "We'll also study how HDFS stores data into data blocks and how the read-write mechanism works in HDFS.", 'start': 52.582, 'duration': 6.403}, {'end': 66.669, 'text': "Then we'll understand the programming part of Hadoop, which is known as MapReduce, and we'll understand this with a MapReduce program.", 'start': 59.566, 'duration': 7.103}, {'end': 75.791, 'text': "We'll understand the entire MapReduce job workflow and we'll see the Hadoop ecosystem, the different tools that the Hadoop ecosystem comprises of.", 'start': 67.389, 'duration': 8.402}], 'summary': "Learn hadoop's master-slave architecture, hdfs storage, mapreduce programming, and ecosystem tools.", 'duration': 28.751, 'max_score': 47.04, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA47040.jpg'}], 'start': 0.089, 'title': 'Introduction to hadoop and big data', 'summary': 'Provides an introduction to hadoop, reasons for big data growth, an overview of hadoop and big data, master-slave architecture, hadoop core components, hdfs data storage mechanism, and the mapreduce programming model.', 'chapters': [{'end': 66.669, 'start': 0.089, 'title': 'Introduction to hadoop and big data', 'summary': 'Covers an introduction to hadoop, including the reasons for the growth of big data, an overview of big data and hadoop, master-slave architecture, hadoop core components, hdfs data storage mechanism, and the mapreduce programming model.', 'duration': 66.58, 'highlights': ['The chapter covers an introduction to Hadoop, including the reasons for the growth of big data, an overview of big data and Hadoop, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.', 'The session begins with a confirmation from attendees Kanika, Neha, Keshav, and Jason Sebastian before diving into the topics of big data growth drivers, Hadoop solution, master-slave architecture, Hadoop core components, HDFS data storage, and MapReduce programming.', 'The tutorial will focus on the big data growth drivers, reasons for the conversion of data into big data, an overview of big data, Hadoop solution, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.']}], 'duration': 66.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA89.jpg', 'highlights': ['The tutorial will focus on the big data growth drivers, reasons for the conversion of data into big data, an overview of big data, Hadoop solution, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.', 'The chapter covers an introduction to Hadoop, including the reasons for the growth of big data, an overview of big data and Hadoop, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.', 'The session begins with a confirmation from attendees Kanika, Neha, Keshav, and Jason Sebastian before diving into the topics of big data growth drivers, Hadoop solution, master-slave architecture, Hadoop core components, HDFS data storage, and MapReduce programming.']}, {'end': 524.903, 'segs': [{'end': 163.768, 'src': 'embed', 'start': 129.735, 'weight': 0, 'content': [{'end': 135.858, 'text': 'we have smart appliances that are interconnected and they form a network of things, which is nothing but Internet of Things.', 'start': 129.735, 'duration': 6.123}, {'end': 141.381, 'text': "So these smart appliances are also generating data when they're trying to communicate with each other.", 'start': 136.38, 'duration': 5.001}, {'end': 146.783, 'text': 'And one prominent factor behind the rise of big data that comes to our mind is social media.', 'start': 141.822, 'duration': 4.961}, {'end': 153.805, 'text': 'We have billions of people on social media because we human, we are social animals and we love to interact.', 'start': 147.303, 'duration': 6.502}, {'end': 160.627, 'text': 'We love to share our thoughts and feelings and social media website provides us just the platform that we need.', 'start': 154.245, 'duration': 6.382}, {'end': 163.768, 'text': 'and we have been using it extensively every day.', 'start': 161.087, 'duration': 2.681}], 'summary': 'Interconnected smart appliances generate data forming internet of things. social media drives rise of big data with billions of users sharing thoughts and feelings daily.', 'duration': 34.033, 'max_score': 129.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA129735.jpg'}, {'end': 538.095, 'src': 'embed', 'start': 508.774, 'weight': 5, 'content': [{'end': 514.498, 'text': "because in Facebook you can see that when you're browsing onto your newsfeed, on the right hand side, there are certain ads popping up.", 'start': 508.774, 'duration': 5.724}, {'end': 518.299, 'text': "And you'll find out that those ads are also user-specific.", 'start': 514.898, 'duration': 3.401}, {'end': 524.903, 'text': 'They know what kind of things you like because you have browsed through different pages on Facebook, on Google, or many other websites.', 'start': 518.659, 'duration': 6.244}, {'end': 531.766, 'text': 'So that is why these unstructured data, which comprises of the 90% of data, is very, very important.', 'start': 525.263, 'duration': 6.503}, {'end': 538.095, 'text': 'And this is also a problem because our traditional systems are incapable of processing this unstructured data.', 'start': 532.614, 'duration': 5.481}], 'summary': 'User-specific ads on facebook are generated from unstructured data, comprising 90% of the total data, posing a challenge for traditional systems.', 'duration': 29.321, 'max_score': 508.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA508774.jpg'}], 'start': 67.389, 'title': 'Big data challenges', 'summary': 'Delves into the challenges of big data, including the growing volume and variety of data sources, and highlights the importance of unstructured data. it also presents statistics, such as 217 new mobile users every 60 seconds, reflecting the widespread usage of mobile phones.', 'chapters': [{'end': 328.208, 'start': 67.389, 'title': 'Big data growth and hadoop ecosystem', 'summary': "Explores the drivers behind the growth of big data, including the impact of technology, internet of things, and social media, and provides statistics from platforms like facebook, twitter, reddit, instagram, and youtube. it also highlights cisco's prediction of dealing with 30.6 exabytes of data by 2020 and the reasons behind the rise of data, including adapting to smarter mobile devices and defining cell network advances.", 'duration': 260.819, 'highlights': ["Cisco's prediction of dealing with 30.6 exabytes of data by 2020 Cisco's white paper predicts dealing with 30.6 exabytes of data by 2020, showcasing the exponential rise in data from 3.7 exabytes in 2015, highlighting the significant growth in data volume.", 'Statistics from social media platforms like Facebook, Twitter, Reddit, Instagram, and YouTube The transcript provides statistics indicating the massive amount of data generated every 60 seconds on platforms like Facebook, Twitter, Reddit, Instagram, and YouTube, demonstrating the scale of data accumulation due to social media usage.', "Reasons behind the rise of data, including adapting to smarter mobile devices and defining cell network advances The transcript highlights Cisco's reasons for the rise of data, including adapting to smarter mobile devices and defining cell network advances, emphasizing the impact of technology advancements on data generation and transmission.", 'Impact of technology, internet of things, and social media on the growth of big data The chapter explores the impact of technology, internet of things, and social media on the growth of big data, emphasizing the exponential rise in data volume due to the widespread use of gadgets, smart appliances, and social media platforms.']}, {'end': 524.903, 'start': 329.048, 'title': 'Big data and its challenges', 'summary': 'Discusses the challenges of big data, including the increasing volume of data, the variety of data sources, and the importance of utilizing unstructured data, with the statistic of 217 new mobile users every 60 seconds as a testament to the widespread usage of mobile phones.', 'duration': 195.855, 'highlights': ['The statistic of 217 new users every 60 seconds demonstrates the widespread usage of mobile phones. The mention of 217 new users every 60 seconds showcases the rapid growth and widespread usage of mobile phones, reflecting the significant impact of mobile technology on data generation and consumption.', 'The challenge of processing and storing large volumes of data is exemplified by the need for alternative solutions due to the incapabilities of traditional systems. The discussion on the incapabilities of traditional systems to process and store the increasing volume of data highlights the pressing need for alternative solutions to effectively manage and utilize large datasets, emphasizing the challenges posed by the volume of data.', 'The significance of unstructured data, such as photos and videos shared on social media platforms, is emphasized as crucial for gaining insights and profiling customers. The emphasis on the importance of unstructured data, particularly in social media platforms, underscores its value in providing insights for businesses, showcasing the importance of utilizing unstructured data for customer profiling and targeted advertising, with a focus on its potential impact on business decisions.']}], 'duration': 457.514, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA67389.jpg', 'highlights': ["Cisco's white paper predicts dealing with 30.6 exabytes of data by 2020, showcasing the exponential rise in data from 3.7 exabytes in 2015, highlighting the significant growth in data volume.", 'The transcript provides statistics indicating the massive amount of data generated every 60 seconds on platforms like Facebook, Twitter, Reddit, Instagram, and YouTube, demonstrating the scale of data accumulation due to social media usage.', 'The chapter explores the impact of technology, internet of things, and social media on the growth of big data, emphasizing the exponential rise in data volume due to the widespread use of gadgets, smart appliances, and social media platforms.', 'The mention of 217 new users every 60 seconds showcases the rapid growth and widespread usage of mobile phones, reflecting the significant impact of mobile technology on data generation and consumption.', 'The discussion on the incapabilities of traditional systems to process and store the increasing volume of data highlights the pressing need for alternative solutions to effectively manage and utilize large datasets, emphasizing the challenges posed by the volume of data.', 'The emphasis on the importance of unstructured data, particularly in social media platforms, underscores its value in providing insights for businesses, showcasing the importance of utilizing unstructured data for customer profiling and targeted advertising, with a focus on its potential impact on business decisions.']}, {'end': 1474.278, 'segs': [{'end': 592.15, 'src': 'embed', 'start': 567.65, 'weight': 0, 'content': [{'end': 574.618, 'text': 'So you suppose that your company has a threshold of 500 transactions at a point of time and that is your upper limit.', 'start': 567.65, 'duration': 6.968}, {'end': 578.722, 'text': 'But today you cannot have the amount of number in the big data world.', 'start': 575.098, 'duration': 3.624}, {'end': 580.383, 'text': 'You talk about sensors.', 'start': 579.122, 'duration': 1.261}, {'end': 581.464, 'text': 'you talk about machines.', 'start': 580.383, 'duration': 1.081}, {'end': 587.107, 'text': 'that is continuously sending you information, like GPS is continuously sending you the information to somebody.', 'start': 581.464, 'duration': 5.643}, {'end': 592.15, 'text': "You're talking about millions and billions of events track per second on real time.", 'start': 587.467, 'duration': 4.683}], 'summary': 'Company faces challenge processing millions of events per second in big data world.', 'duration': 24.5, 'max_score': 567.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA567650.jpg'}, {'end': 658.72, 'src': 'embed', 'start': 627.58, 'weight': 2, 'content': [{'end': 632.182, 'text': 'Now there might be unnecessary data lying around in your data set that is unnecessary for you.', 'start': 627.58, 'duration': 4.602}, {'end': 639.165, 'text': "Now you'll also have to be able to identify which data set will give you the value that you need in order to develop your business.", 'start': 632.802, 'duration': 6.363}, {'end': 645.808, 'text': 'So that is again a problem in order to identify the valuable data and hence it is again a big data problem.', 'start': 639.445, 'duration': 6.363}, {'end': 648.409, 'text': "And finally we'll talk about veracity.", 'start': 646.328, 'duration': 2.081}, {'end': 652.038, 'text': 'So veracity talks about the sparseness of data.', 'start': 649.437, 'duration': 2.601}, {'end': 658.72, 'text': "So in simple words, veracity says that you cannot expect the data to be always correct or reliable in today's world.", 'start': 652.458, 'duration': 6.262}], 'summary': 'Identifying valuable data is key in big data, and veracity emphasizes data sparseness and unreliability.', 'duration': 31.14, 'max_score': 627.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA627580.jpg'}, {'end': 724.021, 'src': 'embed', 'start': 695.916, 'weight': 3, 'content': [{'end': 698.497, 'text': 'and then you can find the approach for a solution for it.', 'start': 695.916, 'duration': 2.581}, {'end': 701.658, 'text': 'So this was an introduction to big data.', 'start': 699.437, 'duration': 2.221}, {'end': 710.001, 'text': "So now we'll understand the problems of big data and how you should approach for a solution for it with a story that you can relate to.", 'start': 702.318, 'duration': 7.683}, {'end': 713.222, 'text': "So I hope that you'll find this part very interesting.", 'start': 710.621, 'duration': 2.601}, {'end': 717.357, 'text': 'So this is a very typical scenario.', 'start': 714.915, 'duration': 2.442}, {'end': 724.021, 'text': 'So this is Bob and he has opened up a very small restaurant in a city and he has hired a waiter for taking up orders.', 'start': 717.377, 'duration': 6.644}], 'summary': 'Introduction to big data problems and solutions with a relatable story.', 'duration': 28.105, 'max_score': 695.916, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA695916.jpg'}, {'end': 888.136, 'src': 'embed', 'start': 856.179, 'weight': 6, 'content': [{'end': 859.222, 'text': 'So Sebastian is saying that Bob should hire more cooks.', 'start': 856.179, 'duration': 3.043}, {'end': 862.504, 'text': 'And exactly Sebastian, you are correct.', 'start': 859.782, 'duration': 2.722}, {'end': 866.247, 'text': 'So the issue was that there were too many orders per hour.', 'start': 862.924, 'duration': 3.323}, {'end': 868.949, 'text': 'So the solution would be hire multiple cooks.', 'start': 866.727, 'duration': 2.222}, {'end': 872.251, 'text': 'And that is exactly what Bob did.', 'start': 869.95, 'duration': 2.301}, {'end': 876.695, 'text': 'So he hired four more cooks and now he has five cooks.', 'start': 872.732, 'duration': 3.963}, {'end': 882.509, 'text': 'And all the cooks have access to the food shelf, this is where they all get their ingredients from.', 'start': 877.195, 'duration': 5.314}, {'end': 888.136, 'text': 'So now there are multiple cooks cooking food, even though there are ten orders per hour.', 'start': 883.29, 'duration': 4.846}], 'summary': 'Bob hired four more cooks to handle the high volume of ten orders per hour.', 'duration': 31.957, 'max_score': 856.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA856179.jpg'}, {'end': 1118.272, 'src': 'embed', 'start': 1093.991, 'weight': 7, 'content': [{'end': 1101.399, 'text': "So again we don't have to worry, since there are three more shelves and at that time of disaster we have a backup of three more shelves,", 'start': 1093.991, 'duration': 7.408}, {'end': 1105.524, 'text': 'so he can go ahead and use the ingredients from any of the shelves over here.', 'start': 1101.399, 'duration': 4.125}, {'end': 1115.37, 'text': 'So, basically, we have distributed and made parallel the whole process or task into smaller tasks and now there is no problem in boss restaurant.', 'start': 1106.124, 'duration': 9.246}, {'end': 1118.272, 'text': 'he is able to serve his customers happily.', 'start': 1115.37, 'duration': 2.902}], 'summary': 'Three backup shelves ensure no problem in restaurant operations.', 'duration': 24.281, 'max_score': 1093.991, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1093991.jpg'}, {'end': 1231.156, 'src': 'embed', 'start': 1199.358, 'weight': 9, 'content': [{'end': 1201.299, 'text': 'but have we solved all the problems??', 'start': 1199.358, 'duration': 1.941}, {'end': 1207.544, 'text': 'Do we have a framework like that who can solve all the big data problems of storing it and processing it??', 'start': 1201.379, 'duration': 6.165}, {'end': 1210.166, 'text': 'Well, the answer is yes.', 'start': 1208.384, 'duration': 1.782}, {'end': 1216.47, 'text': 'We have something called Apache Hadoop and this is the framework to process big data.', 'start': 1210.946, 'duration': 5.524}, {'end': 1220.068, 'text': 'So let us go ahead and see Apache Hadoop in detail.', 'start': 1217.014, 'duration': 3.054}, {'end': 1226.813, 'text': 'So Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion.', 'start': 1220.888, 'duration': 5.925}, {'end': 1231.156, 'text': 'Now you know that there were two major problems in dealing with big data.', 'start': 1227.293, 'duration': 3.863}], 'summary': 'Apache hadoop is a framework solving big data problems by storing and processing large data sets in parallel and distributed fashion.', 'duration': 31.798, 'max_score': 1199.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1199358.jpg'}, {'end': 1283.475, 'src': 'embed', 'start': 1256.439, 'weight': 4, 'content': [{'end': 1263.145, 'text': 'And these machines are interconnected on which our data is getting distributed, and in Hadoop terms, it is called a Hadoop cluster.', 'start': 1256.439, 'duration': 6.706}, {'end': 1270.259, 'text': 'And again like how Bob has managed to divided the task among his chefs and made the serving process quite quicker.', 'start': 1263.837, 'duration': 6.422}, {'end': 1277.642, 'text': 'Similarly in order to process big data we have something called MapReduce and this is the programming unit of Hadoop.', 'start': 1270.699, 'duration': 6.943}, {'end': 1283.475, 'text': 'So this allows a parallel and distributed processing of data that is lying across our Hadoop cluster.', 'start': 1277.932, 'duration': 5.543}], 'summary': "Hadoop cluster enables parallel processing of big data, like bob's chefs handling tasks efficiently.", 'duration': 27.036, 'max_score': 1256.439, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1256439.jpg'}, {'end': 1321.319, 'src': 'embed', 'start': 1298.536, 'weight': 1, 'content': [{'end': 1307.809, 'text': "So now let us understand the Hadoop architecture, which is a master-slave architecture, and we'll understand it by taking a very simple scenario,", 'start': 1298.536, 'duration': 9.273}, {'end': 1310.673, 'text': "which I'm very sure that you'll all relate to very closely.", 'start': 1307.809, 'duration': 2.864}, {'end': 1315.876, 'text': 'So this is the scenario which is usually found in every other company.', 'start': 1311.454, 'duration': 4.422}, {'end': 1321.319, 'text': 'So we have a project manager here and this project manager handles a team of four people.', 'start': 1316.456, 'duration': 4.863}], 'summary': 'Hadoop architecture is a master-slave architecture, illustrated with a project manager and a team of four people.', 'duration': 22.783, 'max_score': 1298.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1298536.jpg'}, {'end': 1417.845, 'src': 'embed', 'start': 1388.386, 'weight': 8, 'content': [{'end': 1391.468, 'text': 'So it thinks of a plan because he is a very clever fellow.', 'start': 1388.386, 'duration': 3.082}, {'end': 1398.32, 'text': 'So, in order to tackle this problem, what the project manager does, he goes to John and he tells him hey, John, how are you doing?', 'start': 1392.378, 'duration': 5.942}, {'end': 1400.12, 'text': "And John says yeah, I'm doing great.", 'start': 1398.76, 'duration': 1.36}, {'end': 1405.122, 'text': "Yeah, I heard that you were doing really great and you're doing excellent in your project.", 'start': 1400.98, 'duration': 4.142}, {'end': 1408.182, 'text': "But John thinks that something's fishy.", 'start': 1405.542, 'duration': 2.64}, {'end': 1410.003, 'text': 'Why is he appreciating me so much today?', 'start': 1408.282, 'duration': 1.721}, {'end': 1417.845, 'text': "Then the project manager goes ahead and tells him John, since you're doing so well, why don't you take up the project C as well?", 'start': 1410.803, 'duration': 7.042}], 'summary': 'Project manager assigns project c to john for excelling in current project', 'duration': 29.459, 'max_score': 1388.386, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1388386.jpg'}], 'start': 525.263, 'title': 'Big data and hadoop', 'summary': "Covers the 5 v's of big data, emphasizing the challenges and importance of handling large volumes, high velocity, diverse variety, valuable data, and data veracity. it also explains how a restaurant scaled its business using a distributed approach similar to hadoop's mapreduce, resulting in enhanced efficiency. additionally, it introduces apache hadoop, highlighting its components and architecture through a relatable scenario.", 'chapters': [{'end': 717.357, 'start': 525.263, 'title': "Understanding the 5 v's of big data", 'summary': "Introduces the 5 v's of big data - volume, velocity, variety, value, and veracity - highlighting the challenges and importance of processing unstructured data, handling high velocity of data, identifying valuable data, and dealing with data veracity.", 'duration': 192.094, 'highlights': ['Unstructured data comprising 90% of data is important, but traditional systems are incapable of processing it. Unstructured data, which comprises 90% of data, is essential, but traditional systems cannot process it.', 'Challenges of handling high velocity of data, with examples of millions and billions of events tracked per second. The challenge of handling high velocity of data is illustrated with examples of millions and billions of events tracked per second.', "Identifying valuable data is crucial for gaining insights and developing business, but it's a challenge in big data. Identifying valuable data is crucial for gaining insights and developing business, but it's a challenge in big data.", 'Dealing with data veracity, including the challenge of working with incorrect or unreliable data. The challenge of dealing with data veracity is outlined, including the issue of working with incorrect or unreliable data.']}, {'end': 1199.358, 'start': 717.377, 'title': 'Scaling a restaurant business', 'summary': "Illustrates a restaurant's transition from a traditional setup to handling a surge in orders by hiring more cooks and implementing a distributed and parallel approach, akin to hadoop's mapreduce, resulting in enhanced scalability and efficiency.", 'duration': 481.981, 'highlights': ['Bob hired more cooks to handle the surge in orders, increasing from 2 to 10 orders per hour. Bob hired four more cooks, resulting in a total of five cooks to handle the increased demand of 10 orders per hour.', "The transition to a distributed and parallel approach, akin to Hadoop's MapReduce, allowed the restaurant to efficiently handle the surge in orders and scale as needed. The restaurant implemented a distributed and parallel approach, akin to Hadoop's MapReduce, to efficiently handle the surge in orders and scale up or down as needed, showcasing enhanced scalability and efficiency.", 'The implementation of a scalable system enabled the restaurant to handle even higher order volumes during peak times like Christmas or New Year. The scalable system allowed the restaurant to handle more than 10 orders per hour, showcasing its ability to scale up and down as needed, ensuring efficient service even during peak times like Christmas or New Year.']}, {'end': 1474.278, 'start': 1199.358, 'title': 'Apache hadoop: solving big data problems', 'summary': 'Introduces apache hadoop, a framework for processing big data, which includes hdfs for storage and mapreduce for parallel and distributed processing, and explains its architecture using a relatable scenario involving a project manager and his team.', 'duration': 274.92, 'highlights': ['Apache Hadoop is a framework that allows storing and processing large data sets in a parallel and distributed fashion, with specific components such as HDFS for storage and MapReduce for processing. Introduces the core functionalities of Apache Hadoop, emphasizing its ability to handle large data sets and the specific components HDFS and MapReduce.', 'HDFS, which stands for Hadoop Distributed File System, solves the storage problem of big data by distributing data over different machines in a Hadoop cluster. Explains the role of HDFS in solving the storage problem of big data, highlighting its distributed nature and its function within a Hadoop cluster.', "MapReduce allows parallel and distributed processing of data across the Hadoop cluster, with 'map' processing data on individual machines and 'reduce' combining intermediary outputs for the final output. Describes the functionality of MapReduce in parallel and distributed processing, illustrating the 'map' and 'reduce' phases in data processing within a Hadoop cluster.", 'The chapter explains the master-slave architecture of Hadoop using a relatable scenario involving a project manager and his team, providing a clear analogy for understanding the architecture. Utilizes a relatable scenario to clarify the master-slave architecture of Hadoop, enhancing understanding through a familiar analogy involving a project manager and his team.']}], 'duration': 949.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA525263.jpg', 'highlights': ['Unstructured data, comprising 90% of data, is essential but traditional systems cannot process it.', 'Challenges of handling high velocity of data, with examples of millions and billions of events tracked per second.', "Identifying valuable data is crucial for gaining insights and developing business, but it's a challenge in big data.", 'Dealing with data veracity, including the challenge of working with incorrect or unreliable data.', "The transition to a distributed and parallel approach, akin to Hadoop's MapReduce, allowed the restaurant to efficiently handle the surge in orders and scale as needed.", 'The implementation of a scalable system enabled the restaurant to handle even higher order volumes during peak times like Christmas or New Year.', 'Apache Hadoop is a framework that allows storing and processing large data sets in a parallel and distributed fashion, with specific components such as HDFS for storage and MapReduce for processing.', 'HDFS, which stands for Hadoop Distributed File System, solves the storage problem of big data by distributing data over different machines in a Hadoop cluster.', "MapReduce allows parallel and distributed processing of data across the Hadoop cluster, with 'map' processing data on individual machines and 'reduce' combining intermediary outputs for the final output.", 'The chapter explains the master-slave architecture of Hadoop using a relatable scenario involving a project manager and his team, providing a clear analogy for understanding the architecture.']}, {'end': 2324.339, 'segs': [{'end': 1563.44, 'src': 'embed', 'start': 1516.092, 'weight': 1, 'content': [{'end': 1522.014, 'text': 'And in case of disaster, if any of the slave node goes down, the master node has always got a backup.', 'start': 1516.092, 'duration': 5.922}, {'end': 1528.097, 'text': 'Now if we compare this whole office situation to our Hadoop cluster, this is what it looks like.', 'start': 1522.614, 'duration': 5.483}, {'end': 1532.64, 'text': 'So this is the master node, this is the project manager in case of our office,', 'start': 1528.418, 'duration': 4.222}, {'end': 1536.322, 'text': 'and these are the processing units where the work is getting carried out.', 'start': 1532.64, 'duration': 3.682}, {'end': 1544.447, 'text': 'So this is exactly how Hadoop processes and Hadoop manages big data using the master-slave architecture.', 'start': 1536.743, 'duration': 7.704}, {'end': 1550.231, 'text': 'So understand more about the master node and the slave nodes in detail later on in this tutorial.', 'start': 1545.628, 'duration': 4.603}, {'end': 1552.252, 'text': 'So any doubts till now?', 'start': 1551.051, 'duration': 1.201}, {'end': 1563.44, 'text': "Alright, so now we'll move ahead and we'll take a look at the Hadoop core components and we're going to take a look at HDFS first,", 'start': 1554.614, 'duration': 8.826}], 'summary': 'Hadoop cluster uses master-slave architecture for big data processing.', 'duration': 47.348, 'max_score': 1516.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1516092.jpg'}, {'end': 1613.056, 'src': 'embed', 'start': 1584.657, 'weight': 0, 'content': [{'end': 1589.44, 'text': 'so the master node is known as the name node and the slave nodes are known as data node.', 'start': 1584.657, 'duration': 4.783}, {'end': 1591.061, 'text': 'So the name node over here.', 'start': 1589.821, 'duration': 1.24}, {'end': 1596.005, 'text': 'this maintains and manages all the different data nodes, which are slave nodes.', 'start': 1591.061, 'duration': 4.944}, {'end': 1603.39, 'text': 'just like our project manager manages a team, and like how you guys report to your manager about your work, progress and everything,', 'start': 1596.005, 'duration': 7.385}, {'end': 1608.013, 'text': 'the data nodes also do the same thing by sending signals which are known as heartbeats.', 'start': 1603.39, 'duration': 4.623}, {'end': 1613.056, 'text': 'Now this is just a signal to tell the name node that the data node is alive and working fine.', 'start': 1608.493, 'duration': 4.563}], 'summary': 'Name node manages data nodes, receives heartbeats to monitor their status.', 'duration': 28.399, 'max_score': 1584.657, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1584657.jpg'}, {'end': 1775.279, 'src': 'heatmap', 'start': 1643.778, 'weight': 0.974, 'content': [{'end': 1649.983, 'text': 'and by the name you might be guessing that this is just a backup for the name node, like when the name node might crash.', 'start': 1643.778, 'duration': 6.205}, {'end': 1651.524, 'text': 'so this will take over.', 'start': 1649.983, 'duration': 1.541}, {'end': 1654.947, 'text': 'but actually this is not the purpose of secondary name node.', 'start': 1651.524, 'duration': 3.423}, {'end': 1659.03, 'text': "the purpose is entirely different and I'll tell you what is that.", 'start': 1654.947, 'duration': 4.083}, {'end': 1666.513, 'text': "so you just have to keep patience for a while and And I'm very sure that you'll be intrigued to know about how important the secondary name node is.", 'start': 1659.03, 'duration': 7.483}, {'end': 1670.354, 'text': 'So now let me tell you about the secondary name node.', 'start': 1667.453, 'duration': 2.901}, {'end': 1676.355, 'text': "Well, since we're talking about metadata, which is nothing but information about our data,", 'start': 1671.334, 'duration': 5.021}, {'end': 1682.497, 'text': 'it contains all the modifications that had took place across the Hadoop cluster or our HDFS namespace.', 'start': 1676.355, 'duration': 6.142}, {'end': 1686.698, 'text': 'And this metadata is maintained by HDFS using two files.', 'start': 1683.117, 'duration': 3.581}, {'end': 1690.456, 'text': 'And those two files are FSImage and EditLog.', 'start': 1687.813, 'duration': 2.643}, {'end': 1692.458, 'text': 'Now let me tell you what are those.', 'start': 1690.996, 'duration': 1.462}, {'end': 1695.28, 'text': 'So FSImage this file over here.', 'start': 1693.058, 'duration': 2.222}, {'end': 1702.527, 'text': 'this contains all the modifications that have been made across your Hadoop cluster ever since the name node was started.', 'start': 1695.28, 'duration': 7.247}, {'end': 1711.634, 'text': "So let's say the name node was started 20 days back, so my FS image will contain all the details of all the changes that happened in the past 20 days.", 'start': 1703.268, 'duration': 8.366}, {'end': 1719.72, 'text': 'So obviously you can imagine that there will be a lot of data contained in this file over here, and that is why we stored the FS image on our disk.', 'start': 1712.255, 'duration': 7.465}, {'end': 1724.464, 'text': "So you'll find this FS image file in the local disk of your name node machine.", 'start': 1720.461, 'duration': 4.003}, {'end': 1734.146, 'text': 'Now coming to edit log, so this file also contains metadata, that is the data about your modifications, but it only contains the most recent changes.', 'start': 1725.543, 'duration': 8.603}, {'end': 1738.007, 'text': "Let's say whatever modifications that took place in the past one hour.", 'start': 1734.546, 'duration': 3.461}, {'end': 1743.249, 'text': 'And this file is small and this file resides in the RAM of your name node machine.', 'start': 1738.827, 'duration': 4.422}, {'end': 1748.136, 'text': 'So we have the secondary name node here which performs a task known as checkpointing.', 'start': 1744.252, 'duration': 3.884}, {'end': 1753.301, 'text': 'Now what is checkpointing? It is the process of combining the edit log with the FS image.', 'start': 1748.576, 'duration': 4.725}, {'end': 1754.722, 'text': 'And how is it done?', 'start': 1753.881, 'duration': 0.841}, {'end': 1764.691, 'text': 'So the secondary name node over here has got a copy of the edit log and the FS image from the name node and then it adds them up in order to get a new FS image.', 'start': 1755.242, 'duration': 9.449}, {'end': 1775.279, 'text': 'So why do we need a new FS image? We need an updated file of the FS image in order to incorporate all the recent changes also into our FS image file.', 'start': 1765.232, 'duration': 10.047}], 'summary': 'Secondary name node performs checkpointing by combining fsimage and editlog to incorporate recent changes in hadoop cluster metadata.', 'duration': 131.501, 'max_score': 1643.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1643778.jpg'}, {'end': 1711.634, 'src': 'embed', 'start': 1683.117, 'weight': 2, 'content': [{'end': 1686.698, 'text': 'And this metadata is maintained by HDFS using two files.', 'start': 1683.117, 'duration': 3.581}, {'end': 1690.456, 'text': 'And those two files are FSImage and EditLog.', 'start': 1687.813, 'duration': 2.643}, {'end': 1692.458, 'text': 'Now let me tell you what are those.', 'start': 1690.996, 'duration': 1.462}, {'end': 1695.28, 'text': 'So FSImage this file over here.', 'start': 1693.058, 'duration': 2.222}, {'end': 1702.527, 'text': 'this contains all the modifications that have been made across your Hadoop cluster ever since the name node was started.', 'start': 1695.28, 'duration': 7.247}, {'end': 1711.634, 'text': "So let's say the name node was started 20 days back, so my FS image will contain all the details of all the changes that happened in the past 20 days.", 'start': 1703.268, 'duration': 8.366}], 'summary': 'Hdfs maintains metadata using fsimage and editlog files, capturing all modifications since name node start.', 'duration': 28.517, 'max_score': 1683.117, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1683117.jpg'}, {'end': 1748.136, 'src': 'embed', 'start': 1725.543, 'weight': 3, 'content': [{'end': 1734.146, 'text': 'Now coming to edit log, so this file also contains metadata, that is the data about your modifications, but it only contains the most recent changes.', 'start': 1725.543, 'duration': 8.603}, {'end': 1738.007, 'text': "Let's say whatever modifications that took place in the past one hour.", 'start': 1734.546, 'duration': 3.461}, {'end': 1743.249, 'text': 'And this file is small and this file resides in the RAM of your name node machine.', 'start': 1738.827, 'duration': 4.422}, {'end': 1748.136, 'text': 'So we have the secondary name node here which performs a task known as checkpointing.', 'start': 1744.252, 'duration': 3.884}], 'summary': 'Edit log contains metadata of most recent changes, small file residing in ram, secondary name node performs checkpointing.', 'duration': 22.593, 'max_score': 1725.543, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1725543.jpg'}], 'start': 1474.278, 'title': 'Hadoop architecture and data storage', 'summary': "Discusses hadoop's master-slave architecture, with the master node supervising the slave nodes, and hdfs data storage, emphasizing fault tolerance through data replication.", 'chapters': [{'end': 1876.824, 'start': 1474.278, 'title': 'Hadoop master-slave architecture', 'summary': 'Discusses how a project manager ensures backup for all team members, similar to the master node supervising the slave nodes in hadoop, with the name node managing data nodes in hdfs and the role of the secondary name node in maintaining metadata and facilitating checkpointing.', 'duration': 402.546, 'highlights': ['The project manager ensures backup for all team members, similar to the master node supervising the slave nodes in Hadoop, ensuring the completion of tasks and client satisfaction. The project manager ensures backup for all team members, ensuring the completion of tasks and client satisfaction.', 'The master node in Hadoop supervises the different slave nodes and keeps track of all processing, ensuring a backup in case of disaster, facilitating efficient management of big data using the master-slave architecture. The master node in Hadoop supervises the different slave nodes and keeps track of all processing, ensuring a backup in case of disaster, facilitating efficient management of big data using the master-slave architecture.', 'The name node in HDFS manages all the different data nodes, while the data nodes are responsible for managing data across data blocks, sending heartbeats to signal their status. The name node in HDFS manages all the different data nodes, while the data nodes are responsible for managing data across data blocks, sending heartbeats to signal their status.', 'The secondary name node in HDFS maintains metadata using FSImage and EditLog, performing checkpointing to combine these files and ensure an updated FSImage, facilitating efficient failure recovery and reducing data loss and setup time for a new name node. The secondary name node in HDFS maintains metadata using FSImage and EditLog, performing checkpointing to combine these files and ensure an updated FSImage, facilitating efficient failure recovery and reducing data loss and setup time for a new name node.']}, {'end': 2324.339, 'start': 1877.284, 'title': 'Hdfs data storage and fault tolerance', 'summary': 'Covers how hdfs stores files in data blocks, the benefits of using a distributed file system, and how hadoop ensures fault tolerance through data replication.', 'duration': 447.055, 'highlights': ['HDFS stores files in blocks of 128 MB by default, distributing them across data nodes. HDFS divides files into blocks of 128 MB by default and distributes them across data nodes, optimizing storage space.', "HDFS's distributed file system provides scalability, abstraction, and parallel processing benefits. HDFS's distributed file system offers scalability, abstraction, and parallel processing benefits, allowing for efficient management of large files and resources.", 'HDFS ensures fault tolerance through a replication factor of three, creating multiple copies of data blocks across the cluster. HDFS maintains fault tolerance through a replication factor of three, ensuring multiple copies of data blocks across the cluster to mitigate data loss in case of node failures.']}], 'duration': 850.061, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA1474278.jpg', 'highlights': ['The master node in Hadoop supervises the different slave nodes and keeps track of all processing, ensuring a backup in case of disaster, facilitating efficient management of big data using the master-slave architecture.', 'The secondary name node in HDFS maintains metadata using FSImage and EditLog, performing checkpointing to combine these files and ensure an updated FSImage, facilitating efficient failure recovery and reducing data loss and setup time for a new name node.', "HDFS's distributed file system offers scalability, abstraction, and parallel processing benefits, allowing for efficient management of large files and resources.", 'HDFS ensures fault tolerance through a replication factor of three, creating multiple copies of data blocks across the cluster to mitigate data loss in case of node failures.', 'The name node in HDFS manages all the different data nodes, while the data nodes are responsible for managing data across data blocks, sending heartbeats to signal their status.', 'The project manager ensures backup for all team members, similar to the master node supervising the slave nodes in Hadoop, ensuring the completion of tasks and client satisfaction.']}, {'end': 2965.321, 'segs': [{'end': 2853.194, 'src': 'embed', 'start': 2824.102, 'weight': 0, 'content': [{'end': 2829.627, 'text': "So let us understand MapReduce with another story which we'll find amusing again, I'm very sure about that.", 'start': 2824.102, 'duration': 5.525}, {'end': 2836.409, 'text': 'So let us consider a situation where we have a professor and there are four students in the class.', 'start': 2830.387, 'duration': 6.022}, {'end': 2839.63, 'text': 'So they are reading a Julius Caesar book.', 'start': 2837.109, 'duration': 2.521}, {'end': 2844.552, 'text': 'So now the professor wants to know how many times the word Julius occurs in the book.', 'start': 2839.87, 'duration': 4.682}, {'end': 2853.194, 'text': 'So for that he asked his students that go ahead Read the entire book and tell me how many times the word Julius is there on the book.', 'start': 2845.192, 'duration': 8.002}], 'summary': "Using mapreduce, the professor asked 4 students to count the occurrence of 'julius' in a book.", 'duration': 29.092, 'max_score': 2824.102, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA2824102.jpg'}], 'start': 2324.379, 'title': 'Hdfs write mechanism, operations, and mapreduce for word count', 'summary': 'Covers the hdfs write mechanism, emphasizing the replication factor of three and the sequence of steps involved, explains hdfs mechanisms and operations including the simultaneous writing and sequential copying of blocks, and illustrates the efficiency of mapreduce through a story of reducing word count time from four hours to one hour.', 'chapters': [{'end': 2618.57, 'start': 2324.379, 'title': 'Hdfs write mechanism', 'summary': 'Discusses the hdfs write mechanism, detailing the pipeline setup, actual writing process, and acknowledgements, emphasizing the replication factor of three and the sequence of steps involved in writing a file into hdfs.', 'duration': 294.191, 'highlights': ['The pipeline setup involves obtaining the IP addresses of three data nodes for copying the file, and in case of unavailability of data nodes, the client node requests new IP addresses from the name node, ensuring the pipeline is set up. During the pipeline setup, the client node obtains the IP addresses of three data nodes for copying the file, and requests new IP addresses from the name node if the initially provided data nodes are unavailable, ensuring the pipeline is set up for writing the file.', 'The actual writing process involves the client contacting data nodes sequentially to copy the file block, ensuring that each data node receives the block and the replication factor of three is achieved. During the actual writing process, the client contacts data nodes sequentially to copy the file block, ensuring each data node receives the block and the replication factor of three is achieved, as required in HDFS.', 'The acknowledgement phase involves a series of acknowledgements in reverse order, from the last data node to the client node, culminating in the client node informing the name node of the successful writing of all blocks. The acknowledgement phase involves a series of acknowledgements in reverse order, from the last data node to the client node, culminating in the client node informing the name node of the successful writing of all blocks, thus completing the write mechanism.']}, {'end': 2823.362, 'start': 2619.05, 'title': 'Hdfs mechanism and operations', 'summary': 'Explains the writing and reading mechanisms in hdfs, including the simultaneous writing of block a and block b, the sequential copying of blocks onto data nodes, and the simultaneous fetching and reading of data blocks by the client, providing insights into the distributed file system and the read and write mechanisms in hdfs.', 'duration': 204.312, 'highlights': ['The writing process of block A and block B will happen at the same time, with the blocks being copied at the same time, and the writing mechanism taking place in three steps sequentially, and as many blocks the file contains, all the blocks will be copied at the same time in sequential steps onto data nodes.', 'Reading a file from different data nodes in HDFS involves the client requesting the name node for the IP addresses of the data blocks, contacting the data nodes, fetching the data blocks, and reading block A and block B simultaneously.', "Understanding the advantages of using a distributed file system, the role of name node and data nodes, data storage, division of files into data blocks, Hadoop's handling of data node failures through replication factor, and the read and write mechanisms in Hadoop distributed file system."]}, {'end': 2965.321, 'start': 2824.102, 'title': 'Mapreduce for word count', 'summary': "Explains mapreduce with an amusing story of a professor and four students who initially took four hours to count the word 'julius' in a book, but then reduced the time to one hour by distributing the chapters to each student, showcasing the efficiency of mapreduce.", 'duration': 141.219, 'highlights': ["The professor and four students initially took four hours to count the word 'Julius' in the entire book.", 'The professor applied a MapReduce-like method by assigning each student a chapter, reducing the time to one hour with each student finishing their assigned chapter at the same time.', "The word 'Julius' was found 12 times in chapter 1, 14 times in chapter 2, 8 times in chapter 3, and 11 times in chapter 4, showcasing the efficiency of the distributed method."]}], 'duration': 640.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA2324379.jpg', 'highlights': ['The writing process of block A and block B will happen at the same time, with the blocks being copied at the same time, and the writing mechanism taking place in three steps sequentially, and as many blocks the file contains, all the blocks will be copied at the same time in sequential steps onto data nodes.', 'The acknowledgement phase involves a series of acknowledgements in reverse order, from the last data node to the client node, culminating in the client node informing the name node of the successful writing of all blocks, thus completing the write mechanism.', 'The pipeline setup involves obtaining the IP addresses of three data nodes for copying the file, and in case of unavailability of data nodes, the client node requests new IP addresses from the name node, ensuring the pipeline is set up.']}, {'end': 4234.426, 'segs': [{'end': 4189.036, 'src': 'embed', 'start': 4144.283, 'weight': 0, 'content': [{'end': 4146.506, 'text': 'And there we have got an app master.', 'start': 4144.283, 'duration': 2.223}, {'end': 4152.33, 'text': 'So the app master is assigned whenever the resource manager receives a request for a MapReduce job.', 'start': 4146.926, 'duration': 5.404}, {'end': 4166.439, 'text': 'So then only an app master is launched which monitors if the MapReduce job is going on fine and reports and negotiates with the resource manager to ask for resources which might be needed in order to perform that particular MapReduce job.', 'start': 4152.729, 'duration': 13.71}, {'end': 4169.502, 'text': 'So this is again a master-slave architecture,', 'start': 4166.979, 'duration': 2.523}, {'end': 4176.352, 'text': 'where the resource manager is the master and the node manager is a slave which is responsible for looking after the app master and the container.', 'start': 4169.502, 'duration': 6.85}, {'end': 4178.194, 'text': 'So this is YARN.', 'start': 4176.832, 'duration': 1.362}, {'end': 4183.261, 'text': 'Now let us go ahead and take a look at the entire MapReduce job workflow.', 'start': 4179.095, 'duration': 4.166}, {'end': 4189.036, 'text': 'So what happens, the client node submits a MapReduce job to the resource manager.', 'start': 4184.133, 'duration': 4.903}], 'summary': 'Yarn architecture includes a resource manager, app master, and node manager in a master-slave model, with the client node submitting mapreduce jobs.', 'duration': 44.753, 'max_score': 4144.283, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4144283.jpg'}], 'start': 2965.962, 'title': 'Mapreduce in hadoop', 'summary': 'Delves into mapreduce in hadoop, showcasing a significant reduction in processing time from 4 hours to 1 hour and 2 minutes through parallel task distribution and map-reduce phases, explaining steps, code overview, program execution, and yarn components.', 'chapters': [{'end': 3195.742, 'start': 2965.962, 'title': 'Understanding mapreduce in hadoop', 'summary': 'Illustrates the concept of mapreduce in hadoop, where a professor demonstrates a significant reduction in processing time from four hours to one hour and two minutes by distributing tasks among students and using the map and reduce phases to process data in parallel, ultimately achieving an effective solution for processing large data sets.', 'duration': 229.78, 'highlights': ['The professor demonstrates a significant reduction in processing time from four hours to one hour and two minutes by using the map and reduce phases to process data in parallel. By distributing tasks among students and utilizing the map and reduce phases, the processing time is significantly reduced from four hours to one hour and two minutes.', 'The concept of MapReduce in Hadoop involves dividing the processing of a single file into parts and processing them simultaneously, with the reducer phase combining all the intermediate results to provide the final output in a parallel and efficient manner. MapReduce in Hadoop involves dividing the processing of a single file into parts, processing them simultaneously, and using the reducer phase to combine all the intermediate results, resulting in a parallel and efficient processing approach.', 'The map phase involves reading and processing data to produce key value pairs as intermediate output, while the reduce phase aggregates the intermediate results from the map jobs to produce the final output in the form of key value pairs. The map phase reads and processes data to produce key value pairs as intermediate output, and the reduce phase aggregates the intermediate results from the map jobs to generate the final output in the form of key value pairs.']}, {'end': 3521.412, 'start': 3196.162, 'title': 'Mapreduce in hadoop', 'summary': 'Explains the mapreduce process in hadoop, detailing the steps of mapping, sorting, shuffling, reducing, and producing the final output, demonstrating a word count program and outlining the major parts of a mapreduce program.', 'duration': 325.25, 'highlights': ['The chapter explains the MapReduce process in Hadoop, detailing the steps of mapping, sorting, shuffling, reducing, and producing the final output. It provides a clear understanding of the MapReduce process and its key components.', 'Demonstrating a word count program and outlining the major parts of a MapReduce program. It includes running a word count program and outlining the major parts of a MapReduce program: mapper code, reducer code, and driver code.', 'Detailing the steps of mapping, sorting, shuffling, reducing, and producing the final output. It provides a step-by-step explanation of the MapReduce process, including mapping, sorting, shuffling, reducing, and final output generation.', 'Outlining the major parts of a MapReduce program: mapper code, reducer code, and driver code. It explains the major parts of a MapReduce program, including the mapper code, reducer code, and driver code.', 'Running a word count program. It involves running a word count program as part of the MapReduce demonstration.']}, {'end': 3716.662, 'start': 3521.412, 'title': 'Mapreduce code overview', 'summary': 'Discusses the mapper, reducer, and driver code in a mapreduce program, highlighting the process of shuffling, sorting, and the output format, with emphasis on word count and configuration details.', 'duration': 195.25, 'highlights': ['The reducer code processes the output of shuffling and sorting, summing up the frequency of each word and producing the final word count output. The reducer code processes the output of shuffling and sorting, summing up the frequency of each word and producing the final word count output.', 'The driver code contains configuration details of the MapReduce job, specifying the job name, input and output data types, mapper and reducer classes, and input/output formats. The driver code contains configuration details of the MapReduce job, specifying the job name, input and output data types, mapper and reducer classes, and input/output formats.', 'The mapper code handles the initial mapping of words to their frequencies, preparing the input for the reducer. The mapper code handles the initial mapping of words to their frequencies, preparing the input for the reducer.']}, {'end': 4234.426, 'start': 3717.142, 'title': 'Mapreduce program execution and yarn components', 'summary': 'Explains the execution of a mapreduce program practically, including setting up hdfs, creating directories, moving files, running the mapreduce program, and reviewing the output. it also covers the components of yarn, including resource manager, node manager, app master, and container, and the workflow of a mapreduce job.', 'duration': 517.284, 'highlights': ['The chapter explains the execution of a MapReduce program practically, including setting up HDFS, creating directories, moving files, running the MapReduce program, and reviewing the output.', 'The YARN components, including resource manager, node manager, app master, and container, are described, along with their responsibilities in the MapReduce process.', 'The workflow of a MapReduce job, from client node submission to resource manager interaction, container launch, app master negotiation, and task execution, is detailed.']}], 'duration': 1268.464, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA2965962.jpg', 'highlights': ['Significant reduction in processing time from 4 hours to 1 hour and 2 minutes through parallel task distribution and map-reduce phases.', 'MapReduce in Hadoop involves dividing the processing of a single file into parts and processing them simultaneously, with the reducer phase combining all the intermediate results to provide the final output in a parallel and efficient manner.', 'Map phase reads and processes data to produce key value pairs as intermediate output, while the reduce phase aggregates the intermediate results from the map jobs to produce the final output in the form of key value pairs.', 'Clear understanding of the MapReduce process and its key components.', 'Step-by-step explanation of the MapReduce process, including mapping, sorting, shuffling, reducing, and final output generation.', 'Running a word count program as part of the MapReduce demonstration.', 'Reducer code processes the output of shuffling and sorting, summing up the frequency of each word and producing the final word count output.', 'Driver code contains configuration details of the MapReduce job, specifying the job name, input and output data types, mapper and reducer classes, and input/output formats.', 'Mapper code handles the initial mapping of words to their frequencies, preparing the input for the reducer.', 'Explanation of the execution of a MapReduce program practically, including setting up HDFS, creating directories, moving files, running the MapReduce program, and reviewing the output.', 'Description of YARN components, including resource manager, node manager, app master, and container, along with their responsibilities in the MapReduce process.', 'Detailed workflow of a MapReduce job, from client node submission to resource manager interaction, container launch, app master negotiation, and task execution.']}, {'end': 4949.22, 'segs': [{'end': 4450.951, 'src': 'embed', 'start': 4411.627, 'weight': 0, 'content': [{'end': 4418.173, 'text': 'So what the node manager does is that it sends the node status or how each of the node is performing a single MapReduce job,', 'start': 4411.627, 'duration': 6.546}, {'end': 4420.316, 'text': 'and it sends a report to the resource manager.', 'start': 4418.173, 'duration': 2.143}, {'end': 4427.158, 'text': 'And when a resource manager receives a job request or a MapReduce job request from a client,', 'start': 4420.956, 'duration': 6.202}, {'end': 4430.98, 'text': 'what it does it asks a node manager to launch an app master.', 'start': 4427.158, 'duration': 3.822}, {'end': 4434.861, 'text': 'Now there is only one single app master for each application.', 'start': 4431.16, 'duration': 3.701}, {'end': 4443.385, 'text': 'So it is only launched when it gets a request or a MapReduce job from the client and it is terminated as soon as the MapReduce job is completed.', 'start': 4435.262, 'duration': 8.123}, {'end': 4450.951, 'text': 'So the app master is responsible for collecting all the resources that is needed in order to perform that MapReduce job from the resource manager.', 'start': 4443.925, 'duration': 7.026}], 'summary': 'Node manager reports node status to resource manager for mapreduce jobs.', 'duration': 39.324, 'max_score': 4411.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4411627.jpg'}, {'end': 4858.516, 'src': 'embed', 'start': 4815.161, 'weight': 1, 'content': [{'end': 4822.205, 'text': 'So, obviously, when you run that PIG command, that one line PIG command, the compiler implicitly converts it into a MapReduce code.', 'start': 4815.161, 'duration': 7.044}, {'end': 4827.668, 'text': 'but you have to only write one single PIG command and it will perform analytics on your data.', 'start': 4822.205, 'duration': 5.463}, {'end': 4831.937, 'text': "So we've got Spark over here, which is used for near real-time processing.", 'start': 4828.408, 'duration': 3.529}, {'end': 4836.589, 'text': "And for machine learning, we've got two more tools, the Spark MLlib and Mahoot.", 'start': 4832.619, 'duration': 3.97}, {'end': 4842.491, 'text': "So again, we've got tools like Zookeeper and Ambari, which is used for management and coordination.", 'start': 4837.469, 'duration': 5.022}, {'end': 4848.793, 'text': 'So Apache Ambari is a tool for provisioning, managing, and monitoring the Apache Hadoop clusters.', 'start': 4842.951, 'duration': 5.842}, {'end': 4854.415, 'text': 'And over here, Uzi is a workflow scheduler system in order to manage Apache Hadoop jobs.', 'start': 4849.533, 'duration': 4.882}, {'end': 4858.516, 'text': 'And this is very scalable, reliable, and an extensible system.', 'start': 4854.955, 'duration': 3.561}], 'summary': 'Pig command converts to mapreduce, spark for real-time processing, ambari for hadoop cluster management.', 'duration': 43.355, 'max_score': 4815.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4815161.jpg'}, {'end': 4934.132, 'src': 'embed', 'start': 4904.552, 'weight': 6, 'content': [{'end': 4909.015, 'text': "And we'll understand it by taking into account and analyzing an Olympic data set.", 'start': 4904.552, 'duration': 4.463}, {'end': 4913.619, 'text': "So let us see what we're going to do with this data set and how this data set looks like.", 'start': 4909.556, 'duration': 4.063}, {'end': 4922.789, 'text': "So we have an Olympic dataset and we're going to use a Hadoop tool which is known as PIC in order to make some analysis about this dataset.", 'start': 4914.687, 'duration': 8.102}, {'end': 4926.95, 'text': 'Now let me tell you a little bit about PIC before going ahead with this use case.', 'start': 4923.569, 'duration': 3.381}, {'end': 4934.132, 'text': 'So PIC is a very powerful and a very popular tool that has been widely used for big data analytics.', 'start': 4927.29, 'duration': 6.842}], 'summary': 'Analyzing an olympic dataset using the hadoop tool pic for big data analytics.', 'duration': 29.58, 'max_score': 4904.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4904552.jpg'}], 'start': 4234.606, 'title': 'Mapreduce workflow and hadoop architecture', 'summary': 'Delves into the mapreduce workflow, covering the roles of app master and yarn child, along with buffer management involving data writing, spill to disk, and merging. it also explains hadoop and yarn architecture, cluster modes, and the benefits of using multi-node cluster mode for production.', 'chapters': [{'end': 4366.132, 'start': 4234.606, 'title': 'Mapreduce workflow and buffer management', 'summary': 'Explains the workflow of a mapreduce job, including the role of the app master, yarn child, and the buffer management process, where data is written, spilled to disk, and merged, with key details such as buffer size, spill threshold, and sort factor.', 'duration': 131.526, 'highlights': ['The app master receives resources from the resource manager to complete the MapReduce job and launches a container, which then triggers the actual MapReduce task, resulting in the final output.', 'The circular memory buffer in each map task has a default size of 100 MB, and the buffer size can be changed by modifying the MapReduce.task.io.sort.mb property.', 'When the buffer fills up to 80%, a background thread starts to spill the contents to the disk, and the map task may block until the spill is complete if the buffer fills up during this time.', 'The spill files created during the process are merged into a single partition and sorted output file, with the maximum number of streams or spill files to merge at once controlled by the mapReduce.task.io.sort.factor property, where the default is 10.']}, {'end': 4949.22, 'start': 4366.512, 'title': 'Understanding hadoop and yarn architecture', 'summary': 'Explains the mapreduce process, yarn architecture, hadoop architecture, cluster modes, hadoop ecosystem tools, and a use case analysis, emphasizing the benefits of using multi-node cluster mode for production.', 'duration': 582.708, 'highlights': ['Hadoop ecosystem tools such as Flume and Scoop are used for ingesting data into HDFS to cope with data velocity, while Hive and PIG are utilized for analytics, and Spark for near real-time processing and machine learning. The Hadoop ecosystem includes tools like Flume, Scoop, Hive, PIG, and Spark, each serving specific functions such as data ingestion, analytics, and near real-time processing.', 'The chapter emphasizes the benefits of using multi-node cluster mode for production, highlighting the importance of distributing tasks and performing them in parallel to maximize the benefits of big data. The chapter emphasizes the importance of using multi-node cluster mode for production, emphasizing the need to distribute tasks and perform them in parallel to leverage the benefits of big data.', 'The YARN architecture is explained, detailing its components such as the resource manager, node manager, app master, and container, highlighting their roles in managing and processing MapReduce jobs. The YARN architecture is explained in detail, highlighting the roles of its components such as resource manager, node manager, app master, and container in managing and processing MapReduce jobs.', 'The MapReduce process is outlined, illustrating the merging of intermediate results from different maps and the role of the reducer in providing the final result. The MapReduce process is explained, focusing on the merging of intermediate results from different maps and the role of the reducer in providing the final result.', 'The different modes of Hadoop cluster, including multi-node, pseudo-distributed, standalone, and serial distributed, are described, with an emphasis on the suitability of multi-node cluster mode for production purposes. The different modes of Hadoop cluster, including multi-node, pseudo-distributed, standalone, and serial distributed, are described, highlighting the suitability of multi-node cluster mode for production.']}], 'duration': 714.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4234606.jpg', 'highlights': ['The app master receives resources from the resource manager to complete the MapReduce job and launches a container, triggering the actual MapReduce task.', 'The circular memory buffer in each map task has a default size of 100 MB, and the buffer size can be changed by modifying the MapReduce.task.io.sort.mb property.', 'The buffer fills up to 80%, a background thread starts to spill the contents to the disk, and the map task may block until the spill is complete if the buffer fills up during this time.', 'The spill files created during the process are merged into a single partition and sorted output file, with the maximum number of streams or spill files to merge at once controlled by the mapReduce.task.io.sort.factor property, where the default is 10.', 'The YARN architecture is explained, detailing its components such as the resource manager, node manager, app master, and container, highlighting their roles in managing and processing MapReduce jobs.', 'The MapReduce process is outlined, illustrating the merging of intermediate results from different maps and the role of the reducer in providing the final result.', 'The chapter emphasizes the benefits of using multi-node cluster mode for production, highlighting the importance of distributing tasks and performing them in parallel to maximize the benefits of big data.', 'The different modes of Hadoop cluster, including multi-node, pseudo-distributed, standalone, and serial distributed, are described, with an emphasis on the suitability of multi-node cluster mode for production purposes.']}, {'end': 6091.419, 'segs': [{'end': 5062.174, 'src': 'embed', 'start': 5035.695, 'weight': 1, 'content': [{'end': 5041.199, 'text': 'Then we have got the age of the athlete, the country which an athlete belongs to.', 'start': 5035.695, 'duration': 5.504}, {'end': 5044.662, 'text': 'the year of Olympics when the athlete played.', 'start': 5041.199, 'duration': 3.463}, {'end': 5050.747, 'text': 'the closing date is the date when the ending ceremony was held for that particular Olympic year.', 'start': 5044.662, 'duration': 6.085}, {'end': 5054.189, 'text': 'the sport which an athlete is associated to.', 'start': 5050.747, 'duration': 3.442}, {'end': 5062.174, 'text': 'the number of gold medals won by him or her, number of silver medals, number of bronze medals, the total medals won by a particular athlete.', 'start': 5054.189, 'duration': 7.985}], 'summary': "Collects athlete's age, country, year, closing date, sport, gold, silver, bronze, total medals.", 'duration': 26.479, 'max_score': 5035.695, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA5035695.jpg'}, {'end': 5316.886, 'src': 'embed', 'start': 5291.095, 'weight': 6, 'content': [{'end': 5295.657, 'text': 'So let me go back to our data set and let me show you why I have mentioned 2 and 9 here.', 'start': 5291.095, 'duration': 4.562}, {'end': 5301.04, 'text': 'So this is our data set and the index of all the fields it starts from 0.', 'start': 5296.297, 'duration': 4.743}, {'end': 5302.58, 'text': 'So athlete is at 0th index.', 'start': 5301.04, 'duration': 1.54}, {'end': 5308.883, 'text': 'Age is at one, country is at two, and total medals is at six.', 'start': 5304.541, 'duration': 4.342}, {'end': 5316.886, 'text': "So we only need the country and the total medals, and that's why we've mentioned the indexes of the country field and the total medals field only.", 'start': 5309.003, 'duration': 7.883}], 'summary': 'Data set indexes: country (index 2) and total medals (index 6) are mentioned for extraction.', 'duration': 25.791, 'max_score': 5291.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA5291095.jpg'}, {'end': 6091.419, 'src': 'embed', 'start': 6067.28, 'weight': 0, 'content': [{'end': 6069.301, 'text': "I hope that you've all learned about Hadoop.", 'start': 6067.28, 'duration': 2.021}, {'end': 6073.565, 'text': 'But if you have any queries or any doubts kindly leave it on the comment section below.', 'start': 6069.682, 'duration': 3.883}, {'end': 6077.608, 'text': "This video will be uploaded on your elements and I'll see you next time.", 'start': 6074.165, 'duration': 3.443}, {'end': 6078.809, 'text': 'Till then happy learning.', 'start': 6077.688, 'duration': 1.121}, {'end': 6081.351, 'text': 'I hope you enjoyed listening to this video.', 'start': 6079.71, 'duration': 1.641}, {'end': 6086.615, 'text': 'Please be kind enough to like it and you can comment any of your doubts and queries and we will reply to them at the earliest.', 'start': 6081.731, 'duration': 4.884}, {'end': 6090.478, 'text': 'Do look out for more videos in our playlist and subscribe to our Edureka channel to learn more.', 'start': 6087.096, 'duration': 3.382}, {'end': 6091.419, 'text': 'Happy learning.', 'start': 6090.959, 'duration': 0.46}], 'summary': 'Introduction to hadoop with a request for interaction and engagement with the video content.', 'duration': 24.139, 'max_score': 6067.28, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA6067280.jpg'}], 'start': 4949.38, 'title': 'Pig and olympic dataset analysis', 'summary': 'Introduces pig, highlighting its ease of use compared to mapreduce with 10 lines of pig code being equal to 200 lines of mapreduce code, and outlines the analysis to be conducted on the olympic dataset including finding the top 10 countries with the highest medals, total gold medals by country, and identifying countries with most medals in swimming. it also demonstrates using pig to analyze a dataset of olympic medalists, filtering relevant fields, grouping and counting the total medals for each country, and finally sorting and selecting the top 10 countries based on the total medals achieved, showcasing the ease and power of pig for data analysis.', 'chapters': [{'end': 5095.251, 'start': 4949.38, 'title': 'Introduction to pig and olympic dataset analysis', 'summary': 'Introduces pig, a tool using pig latin for data processing, highlighting its ease of use compared to mapreduce with 10 lines of pig code being equal to 200 lines of mapreduce code, and outlines the analysis to be conducted on the olympic dataset including finding the top 10 countries with the highest medals, total gold medals by country, and identifying countries with most medals in swimming.', 'duration': 145.871, 'highlights': ["PIG's efficiency: 10 lines of PIG code is almost equal to 200 lines of MapReduce code. PIG code's efficiency is highlighted by the fact that 10 lines of PIG code is equivalent to 200 lines of MapReduce code, showcasing the significant reduction in code length for data processing.", 'Ease of use of PIG: PIG is popular for being easy to learn and efficient in dealing with large datasets. The popularity of PIG is attributed to its ease of learning and its efficacy in handling large datasets, making it a favorable tool for data processing.', 'Analysis plan for Olympic dataset: Identifying top 10 countries with the highest medals, total gold medals by country, and countries with the most medals in swimming. The planned analysis for the Olympic dataset includes identifying the top 10 countries with the highest medals, determining the total gold medals by country, and identifying the countries with the most medals in swimming, showcasing the comprehensive nature of the dataset analysis.']}, {'end': 5637.368, 'start': 5095.672, 'title': 'Analyzing top 10 countries with highest medals', 'summary': 'Demonstrates using pig to analyze a dataset of olympic medalists, filtering relevant fields, grouping and counting the total medals for each country, and finally sorting and selecting the top 10 countries based on the total medals achieved.', 'duration': 541.696, 'highlights': ['The chapter demonstrates using PIG to analyze a dataset of Olympic medalists, filtering relevant fields, grouping and counting the total medals for each country, and finally sorting and selecting the top 10 countries based on the total medals achieved.', 'The speaker shows the process of loading the dataset into PIG, using delimiter sign T for fields separated by tabs, and confirming the successful loading of the dataset.', 'A code is written to filter and retain only the country and total medals fields from the dataset, using respective indexes, and the resulting intermediate data is displayed.', "The process of grouping all the same countries together using a variable and the command 'group.country_final by country' is demonstrated, and the resulting intermediate data is displayed.", "The speaker explains that each PIG code implicitly gets transformed into a MapReduce code, and proceeds to count the grouped countries using the built-in function 'count', displaying the final result.", 'The final result of the top 10 countries with the total number of medals is displayed, and the speaker demonstrates sorting the countries in descending order based on total medals achieved, and selecting the top 10 countries.', 'The final list of the top 10 countries with their respective total number of medals is displayed, and the speaker successfully stores this result in the output directory.', 'The chapter concludes by mentioning the possibility of finding answers to other questions using a similar approach.']}, {'end': 6091.419, 'start': 5637.808, 'title': 'Pig analysis use case', 'summary': 'Demonstrates using pig to analyze olympic data, including finding the top 10 countries with the most gold medals, identifying the countries with the highest number of swimming medals, and storing the results, showcasing the ease and power of pig for data analysis.', 'duration': 453.611, 'highlights': ['PIG analysis to find top 10 countries with the most gold medals The chapter demonstrates using PIG to find the top 10 countries that won the highest number of gold medals, showcasing the power of PIG for data analysis.', 'PIG analysis to identify countries with the most swimming medals The chapter also showcases using PIG to identify the countries that have won the most number of medals in swimming, demonstrating the ease of performing complex analytics using PIG.', 'Storing the analysis results using PIG The chapter illustrates how to store the final analysis results in the output directory using PIG, emphasizing the simplicity and efficiency of PIG for data storage.']}], 'duration': 1142.039, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/mafw2-CVYnA/pics/mafw2-CVYnA4949380.jpg', 'highlights': ["PIG's efficiency: 10 lines of PIG code is almost equal to 200 lines of MapReduce code.", 'Ease of use of PIG: PIG is popular for being easy to learn and efficient in dealing with large datasets.', 'Analysis plan for Olympic dataset: Identifying top 10 countries with the highest medals, total gold medals by country, and countries with the most medals in swimming.', 'Using PIG to analyze a dataset of Olympic medalists, filtering relevant fields, grouping and counting the total medals for each country, and finally sorting and selecting the top 10 countries based on the total medals achieved.', 'PIG analysis to find top 10 countries with the most gold medals.', 'PIG analysis to identify countries with the most swimming medals.', 'Storing the analysis results using PIG.']}], 'highlights': ['The tutorial will focus on the big data growth drivers, reasons for the conversion of data into big data, an overview of big data, Hadoop solution, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.', 'Significant reduction in processing time from 4 hours to 1 hour and 2 minutes through parallel task distribution and map-reduce phases.', 'The writing process of block A and block B will happen at the same time, with the blocks being copied at the same time, and the writing mechanism taking place in three steps sequentially, and as many blocks the file contains, all the blocks will be copied at the same time in sequential steps onto data nodes.', 'Unstructured data, comprising 90% of data, is essential but traditional systems cannot process it.', 'The chapter covers an introduction to Hadoop, including the reasons for the growth of big data, an overview of big data and Hadoop, master-slave architecture, Hadoop core components, HDFS data storage mechanism, and the MapReduce programming model.', 'The chapter explains the master-slave architecture of Hadoop using a relatable scenario involving a project manager and his team, providing a clear analogy for understanding the architecture.', 'The chapter explores the impact of technology, internet of things, and social media on the growth of big data, emphasizing the exponential rise in data volume due to the widespread use of gadgets, smart appliances, and social media platforms.', 'The mention of 217 new users every 60 seconds showcases the rapid growth and widespread usage of mobile phones, reflecting the significant impact of mobile technology on data generation and consumption.', 'The discussion on the incapabilities of traditional systems to process and store the increasing volume of data highlights the pressing need for alternative solutions to effectively manage and utilize large datasets, emphasizing the challenges posed by the volume of data.', 'The emphasis on the importance of unstructured data, particularly in social media platforms, underscores its value in providing insights for businesses, showcasing the importance of utilizing unstructured data for customer profiling and targeted advertising, with a focus on its potential impact on business decisions.']}