title
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka
description
🔥 Edureka Hadoop Training: https://www.edureka.co/big-data-hadoop-training-certification
This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/LRyX6W ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial:
Hadoop Interview Questions on:
1) Big Data & Hadoop
2) HDFS
3) MapReduce
4) Apache Hive
5) Apache Pig
6) Apache HBase and Sqoop
Subscribe to our channel to get video updates. Hit the subscribe button above.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
--------------------Edureka Big Data Training and Certifications------------------------
🔵 Edureka Hadoop Training: http://bit.ly/2YBlw29
🔵 Edureka Spark Training: http://bit.ly/2PeHvc9
🔵 Edureka Kafka Training: http://bit.ly/34e7Riy
🔵 Edureka Cassandra Training: http://bit.ly/2E9AK54
🔵 Edureka Talend Training: http://bit.ly/2YzYIjg
🔵 Edureka Hadoop Administration Training: http://bit.ly/2YE8Nf9
PG in Big Data Engineering with NIT Rourkela : https://www.edureka.co/post-graduate/big-data-engineering (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies)
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview
How it Works?
1. This is a 5 Week Instructor led Online Course, 40 hours of assignment and 30 hours of project work
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. At the end of the training you will have to undergo a 2-hour LIVE Practical Exam based on which we will provide you a Grade and a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
Edureka’s Big Data and Hadoop online training is designed to help you become a top Hadoop developer. During this course, our expert Hadoop instructors will help you:
1. Master the concepts of HDFS and MapReduce framework
2. Understand Hadoop 2.x Architecture
3. Setup Hadoop Cluster and write Complex MapReduce programs
4. Learn data loading techniques using Sqoop and Flume
5. Perform data analytics using Pig, Hive and YARN
6. Implement HBase and MapReduce integration
7. Implement Advanced Usage and Indexing
8. Schedule jobs using Oozie
9. Implement best practices for Hadoop development
10. Work on a real life Project on Big Data Analytics
11. Understand Spark and its Ecosystem
12. Learn how to work in RDD in Spark
- - - - - - - - - - - - - -
Who should go for this course?
If you belong to any of the following groups, knowledge of Big Data and Hadoop is crucial for you if you want to progress in your career:
1. Analytics professionals
2. BI /ETL/DW professionals
3. Project managers
4. Testing professionals
5. Mainframe professionals
6. Software developers and architects
7. Recent graduates passionate about building successful career in Big Data
- - - - - - - - - - - - - -
Why Learn Hadoop?
Big Data! A Worldwide Problem?
According to Wikipedia, "Big data is collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However, it is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success!
The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is - Hadoop, which is not less than a panacea for all those companies working with BIG DATA in a variety of applications and has become an integral part for storing, handling, evaluating and retrieving hundreds of terabytes, and even petabytes of data.
For more information, Please write back to us at sales@edureka.in or call us at IND: 9606058406 / US: 18338555775 (toll-free).
Customer Review:
Michael Harkins, System Architect, Hortonworks says: “The courses are top rate. The best part is live instruction, with playback. But my favorite feature is viewing a previous class. Also, they are always there to answer questions, and prompt when you open an issue if you are having any trouble. Added bonus ~ you get lifetime access to the course you took!!! Edureka lets you go back later, when your boss says "I want this ASAP!" ~ This is the killer education app... I've take two courses, and I'm taking two more.”
detail
{'title': 'Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka', 'heatmap': [{'end': 2062.893, 'start': 1897.374, 'weight': 0.912}, {'end': 2316.955, 'start': 2227.191, 'weight': 0.717}, {'end': 4209.691, 'start': 4038.303, 'weight': 0.709}, {'end': 5033.941, 'start': 4781.193, 'weight': 0.841}, {'end': 5453.57, 'start': 5360.734, 'weight': 1}, {'end': 5941.269, 'start': 5683.052, 'weight': 0.77}, {'end': 7014.265, 'start': 6930.672, 'weight': 0.872}], 'summary': "Covers hadoop interview questions, emphasizing big data's rapid growth and job opportunities. it discusses data types, hadoop vs rdbms, components, architecture, archiving, mapreduce internals, hive performance, and career growth, with examples of growth in job opportunities and the importance of cloud data certification.", 'chapters': [{'end': 605.229, 'segs': [{'end': 157.609, 'src': 'embed', 'start': 123.921, 'weight': 2, 'content': [{'end': 125.743, 'text': 'You will make best of the session in that case.', 'start': 123.921, 'duration': 1.822}, {'end': 127.164, 'text': "But it's completely fine.", 'start': 126.083, 'duration': 1.081}, {'end': 128.406, 'text': 'just listen to everything.', 'start': 127.164, 'duration': 1.242}, {'end': 130.507, 'text': "So let's move further.", 'start': 128.925, 'duration': 1.582}, {'end': 135.69, 'text': 'so starting with few basic stuff, like when this Hadoop started right.', 'start': 130.507, 'duration': 5.183}, {'end': 146.637, 'text': "so it's not a very old technology, it's pretty new and basically, in last five years you are seeing a tremendous growth happening in this area.", 'start': 135.69, 'duration': 10.947}, {'end': 150.363, 'text': 'I can tell you whenever I teach in my batches,', 'start': 147.341, 'duration': 3.022}, {'end': 157.609, 'text': "I clearly mention that this you are all sitting on a time bomb because you're learning at the very right time.", 'start': 150.363, 'duration': 7.246}], 'summary': 'Hadoop has seen tremendous growth in the last 5 years, making it a valuable skill to learn.', 'duration': 33.688, 'max_score': 123.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg123921.jpg'}, {'end': 281.87, 'src': 'embed', 'start': 231.93, 'weight': 0, 'content': [{'end': 243.854, 'text': "So if we notice the percentage of basically jobs which have increased in from 2012 to 2017 I'm just taking the time frame of, let's say, four,", 'start': 231.93, 'duration': 11.924}, {'end': 250.956, 'text': 'five years you can notice for Hadoop developer it has grown for more than around 247%.', 'start': 243.854, 'duration': 7.102}, {'end': 254.64, 'text': 'Hadoop the administrator.', 'start': 250.956, 'duration': 3.684}, {'end': 259.144, 'text': '245%, and definitely 297% is the highest for Hadoop architect.', 'start': 254.64, 'duration': 4.504}, {'end': 264.268, 'text': 'and the reason is very obvious, because the reason is Apache Spark has picked up in the market.', 'start': 259.144, 'duration': 5.124}, {'end': 265.289, 'text': 'like anything.', 'start': 264.268, 'duration': 1.021}, {'end': 272.476, 'text': "it is the next hot cake going on in the market, and that's the reason with lot of companies are migrating towards Apache Spark.", 'start': 265.289, 'duration': 7.187}, {'end': 275.579, 'text': 'So this is definitely one of the thing which is happening.', 'start': 272.916, 'duration': 2.663}, {'end': 281.87, 'text': "Moving further, can I exclude MapReduce from a resume because I'm not good at code Java right now.", 'start': 276.303, 'duration': 5.567}], 'summary': 'From 2012 to 2017, jobs for hadoop developer increased by 247%, hadoop administrator by 245%, and hadoop architect by 297%, largely due to the rising popularity of apache spark.', 'duration': 49.94, 'max_score': 231.93, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg231930.jpg'}, {'end': 371.678, 'src': 'embed', 'start': 343.327, 'weight': 5, 'content': [{'end': 351.156, 'text': "Apache Spark is a separate domain in itself that's the reason we have not included it but you can expect maybe some separate session for that as well.", 'start': 343.327, 'duration': 7.829}, {'end': 354.1, 'text': 'To start with interview.', 'start': 352.578, 'duration': 1.522}, {'end': 362.215, 'text': 'Now, when we talk about big data, right the very basic questions, lot of time pops up.', 'start': 354.873, 'duration': 7.342}, {'end': 367.397, 'text': "what are five B's available in big data?", 'start': 362.215, 'duration': 5.182}, {'end': 369.558, 'text': 'can anybody answer that?', 'start': 367.397, 'duration': 2.161}, {'end': 371.678, 'text': 'okay, so Ravi want to answer this?', 'start': 369.558, 'duration': 2.12}], 'summary': "Discussion about apache spark and big data with emphasis on the five b's in big data.", 'duration': 28.351, 'max_score': 343.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg343327.jpg'}, {'end': 427.467, 'src': 'embed', 'start': 398.788, 'weight': 3, 'content': [{'end': 402.149, 'text': 'Velocity is how much fast is growing.', 'start': 398.788, 'duration': 3.361}, {'end': 405.4, 'text': 'Okay, very much.', 'start': 403.499, 'duration': 1.901}, {'end': 406.9, 'text': 'one good answer.', 'start': 405.4, 'duration': 1.5}, {'end': 416.843, 'text': 'when big data started, IBM gave a definition with just easy the three ways for volume variety, velocity.', 'start': 406.9, 'duration': 9.943}, {'end': 417.984, 'text': 'so what was in volume?', 'start': 416.843, 'duration': 1.141}, {'end': 423.306, 'text': 'Volume was, when we talk about like, in terms of amount of data, what we are dealing with.', 'start': 418.204, 'duration': 5.102}, {'end': 427.467, 'text': "right, for example, today's Facebook is dealing with very huge amount of data.", 'start': 423.306, 'duration': 4.161}], 'summary': "Velocity is a key aspect of big data, exemplified by facebook's substantial data volume.", 'duration': 28.679, 'max_score': 398.788, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg398788.jpg'}, {'end': 583.515, 'src': 'embed', 'start': 553.446, 'weight': 4, 'content': [{'end': 556.289, 'text': 'now, somebody said that I want to visualize the data.', 'start': 553.446, 'duration': 2.843}, {'end': 559.652, 'text': 'so visualization is should also be one of the B factor.', 'start': 556.289, 'duration': 3.363}, {'end': 565.457, 'text': 'somebody started saying that I want to see the vocabulary of the data or the validity of the data.', 'start': 559.652, 'duration': 5.805}, {'end': 572.223, 'text': "now they started keep on adding their B's, but majorly, if you talk about there are four B's, which carries a good value, okay.", 'start': 565.457, 'duration': 6.766}, {'end': 577.488, 'text': 'and usually in any interview, they will not expect you to know all the weeks, basically, if you know it all, good,', 'start': 572.583, 'duration': 4.905}, {'end': 580.492, 'text': 'but they will be just expecting you to kind of understand that.', 'start': 577.488, 'duration': 3.004}, {'end': 583.515, 'text': 'okay, do you know at least four weeks which are important?', 'start': 580.492, 'duration': 3.023}], 'summary': "In data visualization, understanding four key 'b factors' is important for interviews.", 'duration': 30.069, 'max_score': 553.446, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg553446.jpg'}], 'start': 0.169, 'title': 'Hadoop interview questions and big data fundamentals', 'summary': "Provides insights into hadoop interview questions and emphasizes the rapid growth of the big data domain, with job opportunities for hadoop developers, administrators, and architects increasing by 297%, 245%, and 247% respectively from 2012 to 2017. it also covers the fundamentals of big data including the 5 vs and the importance of understanding at least the 4 major components with examples from facebook and twitter's data growth.", 'chapters': [{'end': 322.129, 'start': 0.169, 'title': 'Hadoop interview questions', 'summary': 'Provides insights into hadoop interview questions, emphasizing the rapid growth of the big data domain, with job opportunities for hadoop developers, administrators, and architects increasing by 297%, 245%, and 247% respectively from 2012 to 2017.', 'duration': 321.96, 'highlights': ['The big data domain is moving up, with job market increasing by 297% for Hadoop architects, 245% for administrators, and 247% for developers from 2012 to 2017. The job market for Hadoop developers, administrators, and architects has shown significant growth, with job opportunities increasing by 247%, 245%, and 297% respectively from 2012 to 2017.', "Apache Spark's popularity has led to a significant migration, creating job opportunities in the market. Apache Spark's rising popularity has resulted in a notable migration by companies, leading to increased job opportunities in the market.", 'The session emphasizes the importance of Hadoop knowledge, highlighting its relevance for both beginners and experienced individuals. The session stresses the significance of Hadoop knowledge, underscoring its relevance for individuals at varying levels of experience, including beginners and experienced professionals.']}, {'end': 605.229, 'start': 322.209, 'title': 'Big data fundamentals', 'summary': "Covers the fundamentals of big data including the 5 vs (volume, variety, velocity, veracity, value), emphasizing the importance of understanding at least the 4 major components, with examples from facebook and twitter's data growth.", 'duration': 283.02, 'highlights': ['The 5 Vs of big data (Volume, Variety, Velocity, Veracity, Value) are important to understand, with Volume referring to the amount of data, Variety encompassing different types of data such as structured, unstructured, and semi-structured, and Velocity indicating the speed of data growth with examples from companies like Facebook and Twitter. The explanation of the 5 Vs of big data, emphasizing the importance of understanding each component and providing examples of data growth from companies like Facebook and Twitter.', 'The importance of understanding at least the 4 major components of big data (Volume, Variety, Velocity, Veracity) is emphasized for interviews, with a focus on recognizing their significance in the context of data analysis and decision-making. Emphasizing the importance of understanding at least the 4 major components of big data for interviews and recognizing their significance in data analysis and decision-making.', 'Interview preparation for big data may involve understanding the 5 Vs and being able to discuss at least 4 key components, demonstrating a solid grasp of the fundamental concepts. Highlighting the importance of understanding the 5 Vs and at least 4 key components for interview preparation and demonstrating a solid grasp of fundamental concepts.']}], 'duration': 605.06, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg169.jpg', 'highlights': ['Job market for Hadoop architects, administrators, and developers increased by 297%, 245%, and 247% from 2012 to 2017.', "Apache Spark's rising popularity resulted in notable migration by companies, creating increased job opportunities.", 'The session stresses the significance of Hadoop knowledge for individuals at varying levels of experience.', 'Understanding the 5 Vs of big data is important, with examples from companies like Facebook and Twitter.', 'Emphasizing the importance of understanding at least the 4 major components of big data for interviews and decision-making.', 'Interview preparation for big data may involve understanding the 5 Vs and at least 4 key components.']}, {'end': 1420.531, 'segs': [{'end': 893.785, 'src': 'embed', 'start': 861.556, 'weight': 2, 'content': [{'end': 865.357, 'text': "that's where they created a third category called as semi-structure,", 'start': 861.556, 'duration': 3.801}, {'end': 872.62, 'text': 'so that they can keep it in the middle the data which is sounding something of the structure type, or as a unstructured type.', 'start': 865.357, 'duration': 7.263}, {'end': 876.822, 'text': 'they started if they created a new category called as semi-structure data.', 'start': 872.62, 'duration': 4.202}, {'end': 883.481, 'text': 'everybody here on this part what basically is structured data semi-structured data unstructured data okay moving further.', 'start': 877.378, 'duration': 6.103}, {'end': 893.785, 'text': 'Now I have another question for you how Hadoop differs from your traditional processing system using RDBMS?', 'start': 883.961, 'duration': 9.824}], 'summary': 'A new semi-structured data category was created to manage data falling between structured and unstructured types, while also discussing the difference between hadoop and traditional rdbms.', 'duration': 32.229, 'max_score': 861.556, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg861556.jpg'}, {'end': 1186.487, 'src': 'embed', 'start': 1161.073, 'weight': 0, 'content': [{'end': 1168.159, 'text': 'so lot of companies actually came up like this kind of idea, that they are expert in Hadoop and if you require any support,', 'start': 1161.073, 'duration': 7.086}, {'end': 1169.86, 'text': 'we will help you with that.', 'start': 1168.679, 'duration': 1.181}, {'end': 1172.161, 'text': 'so that is one thing which is happening now.', 'start': 1169.86, 'duration': 2.301}, {'end': 1174.662, 'text': 'RDBMS can only deal with structured data.', 'start': 1172.161, 'duration': 2.501}, {'end': 1179.864, 'text': 'right, basically, when we talk about RDBMS, we cannot deal with unstructured kind of data.', 'start': 1174.662, 'duration': 5.202}, {'end': 1186.487, 'text': 'so you can write, not deal with it as well, because, like somebody just argued with me, he said that you know,', 'start': 1179.864, 'duration': 6.623}], 'summary': 'Companies provide hadoop expertise for support, while rdbms handles only structured data.', 'duration': 25.414, 'max_score': 1161.073, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1161073.jpg'}, {'end': 1266.826, 'src': 'embed', 'start': 1221.109, 'weight': 1, 'content': [{'end': 1230.372, 'text': 'but when it comes to your Hadoop system, it can be unstructured, semi-structured, as well as structured data right.', 'start': 1221.109, 'duration': 9.263}, {'end': 1235.794, 'text': 'because if we take an example of high, you know they deal with structured data right.', 'start': 1230.372, 'duration': 5.422}, {'end': 1237.335, 'text': 'so that is one thing.', 'start': 1235.794, 'duration': 1.541}, {'end': 1241.256, 'text': 'which is there RDBMS, you work just on a single machine, right.', 'start': 1237.335, 'duration': 3.921}, {'end': 1247.399, 'text': "so let's say, you work on a single laptop where you have RDBMS installed and working, but when it comes to Hadoop,", 'start': 1241.256, 'duration': 6.143}, {'end': 1250.3, 'text': 'you are working in a distributed fashion, right.', 'start': 1247.399, 'duration': 2.901}, {'end': 1253.481, 'text': 'there will be multiple machines which can be involved.', 'start': 1250.3, 'duration': 3.181}, {'end': 1263.065, 'text': 'in this case, right, and as I said, in RDBMS, mostly when the data is small, your speed will be very fast.', 'start': 1253.481, 'duration': 9.584}, {'end': 1266.826, 'text': 'your computation is going to be very quick.', 'start': 1263.065, 'duration': 3.761}], 'summary': 'Hadoop system can handle unstructured, semi-structured, and structured data, enabling distributed computation for faster processing.', 'duration': 45.717, 'max_score': 1221.109, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1221109.jpg'}], 'start': 606.01, 'title': 'Data types and hadoop vs rdbms', 'summary': 'Covers structured, unstructured, and semi-structured data, including historical context, challenges, and impact of modern platforms. it also discusses hadoop vs rdbms, highlighting differences, data handling, processing methods, cost implications, and key components of hadoop.', 'chapters': [{'end': 883.481, 'start': 606.01, 'title': 'Structured, unstructured, and semi-structured data', 'summary': 'Discusses the concepts of structured, unstructured, and semi-structured data, highlighting the historical context, challenges, and the emergence of different data types, with examples from companies like oracle and ibm, and the impact of modern platforms like facebook. it also explains the characteristics and limitations of each data type, leading to the creation of the third category of semi-structured data.', 'duration': 277.471, 'highlights': ['Companies like Oracle and IBM emerged in the 1970s and 1980s to address the challenges of dealing with structured data, which was relatively small but still posed difficulties in storage and management. During the 1970s and 1980s, companies like Oracle and IBM entered the market to tackle the challenges associated with structured data, which, despite being relatively small, presented storage and management difficulties.', 'Modern platforms like Facebook introduced unstructured data types such as videos, audios, and pictures, posing challenges for traditional RDBMS systems due to the absence of a structured format. The advent of modern platforms like Facebook brought about the introduction of unstructured data types like videos, audios, and pictures, creating challenges for traditional RDBMS systems due to the lack of a structured format.', 'The emergence of semi-structured data was necessitated by the need to categorize data that exhibited characteristics of both structured and unstructured data, such as XML and JSON files, which possess patterns but do not fully align with the capabilities of RDBMS systems. The introduction of semi-structured data stemmed from the necessity to classify data displaying traits of both structured and unstructured data, exemplified by XML and JSON files, which demonstrate patterns but do not entirely align with the functionalities of RDBMS systems.']}, {'end': 1420.531, 'start': 883.961, 'title': 'Hadoop vs rdbms: key differences and components', 'summary': 'Discusses the key differences between hadoop and traditional rdbms, including the ability to handle different types of data, processing methods, and cost implications. it also covers the components of hadoop, such as hdfs, yarn, and namenode, and their respective services.', 'duration': 536.57, 'highlights': ["Hadoop can store and process any type of data, while RDBMS can store only relational data, making it suitable for distributed storage and processing, parallel processing, and large data. Hadoop's capability to handle diverse data types and enable distributed storage and processing makes it suitable for large-scale parallel processing, which is a significant difference from RDBMS.", 'Hadoop cannot replace RDBMS due to limitations in supporting asset properties and operations like create, delete, and update, especially for small data and structured data efficiency. The inability of Hadoop to replace RDBMS is attributed to limitations in supporting asset properties and operations, particularly for small and structured data, which affects efficiency and data management.', 'RDBMS is a licensed software, while Hadoop is open source, leading to different support models and cost implications for users. The distinction in cost models between RDBMS and Hadoop, with the former being licensed and the latter being open source, results in different support and cost implications for users, with Hadoop being supported by companies in the open source community.', 'Hadoop can handle unstructured, semi-structured, and structured data, while RDBMS primarily deals with structured data, highlighting the versatility of Hadoop in data management. The versatility of Hadoop in handling unstructured, semi-structured, and structured data sets it apart from RDBMS, which primarily focuses on structured data, showcasing the broader data management capabilities of Hadoop.', 'Hadoop operates in a distributed fashion across multiple machines, while RDBMS typically operates on a single machine, impacting computation speed and scalability. The distributed nature of Hadoop, working across multiple machines, contrasts with the single-machine operation of RDBMS, resulting in differences in computation speed and scalability.']}], 'duration': 814.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg606010.jpg', 'highlights': ['Modern platforms introduced unstructured data types, posing challenges for traditional RDBMS systems.', "Hadoop's capability to handle diverse data types and enable distributed storage and processing makes it suitable for large-scale parallel processing.", 'The introduction of semi-structured data stemmed from the necessity to classify data displaying traits of both structured and unstructured data.', 'The versatility of Hadoop in handling unstructured, semi-structured, and structured data sets it apart from RDBMS.', 'The distributed nature of Hadoop contrasts with the single-machine operation of RDBMS, impacting computation speed and scalability.', 'The distinction in cost models between RDBMS and Hadoop results in different support and cost implications for users.']}, {'end': 1897.374, 'segs': [{'end': 1529.27, 'src': 'embed', 'start': 1503.259, 'weight': 0, 'content': [{'end': 1508.904, 'text': 'but you know behind that, you know, since there might be lot of cleverness, I think, right behind the scenes.', 'start': 1503.259, 'duration': 5.645}, {'end': 1514.526, 'text': "so let's say these three people are reporting to this clever boss.", 'start': 1508.904, 'duration': 5.622}, {'end': 1515.006, 'text': 'what happens?', 'start': 1514.526, 'duration': 0.48}, {'end': 1517.227, 'text': 'Boss, get products.', 'start': 1515.427, 'duration': 1.8}, {'end': 1518.347, 'text': 'boss, get product.', 'start': 1517.227, 'duration': 1.12}, {'end': 1519.728, 'text': "let's give the project.", 'start': 1518.347, 'duration': 1.381}, {'end': 1520.328, 'text': 'now what happens?', 'start': 1519.728, 'duration': 0.6}, {'end': 1522.228, 'text': 'Boss will be getting a project.', 'start': 1520.768, 'duration': 1.46}, {'end': 1525.389, 'text': 'these people are reporting to this boss.', 'start': 1522.228, 'duration': 3.161}, {'end': 1526.229, 'text': 'so what they will do?', 'start': 1525.389, 'duration': 0.84}, {'end': 1529.27, 'text': 'So boss usually will distribute the project.', 'start': 1526.469, 'duration': 2.801}], 'summary': 'Three employees report to a clever boss who distributes projects efficiently.', 'duration': 26.011, 'max_score': 1503.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1503259.jpg'}, {'end': 1639.999, 'src': 'embed', 'start': 1613.157, 'weight': 1, 'content': [{'end': 1617.279, 'text': 'so as soon as she will not hear anything, she will just hear that.', 'start': 1613.157, 'duration': 4.122}, {'end': 1619.96, 'text': 'you know, boss is telling me promotion word.', 'start': 1617.279, 'duration': 2.681}, {'end': 1623.341, 'text': 'okay, he will just hear promotion word and he will be very happy in his mindset.', 'start': 1619.96, 'duration': 3.381}, {'end': 1627.025, 'text': 'suddenly his boss will throw a time bomb because boss is clever.', 'start': 1623.721, 'duration': 3.304}, {'end': 1628.907, 'text': 'so boss, usually what he will say.', 'start': 1627.025, 'duration': 1.882}, {'end': 1633.752, 'text': 'actually, we just discussed, right, that you are going to take up some senior responsibilities.', 'start': 1628.907, 'duration': 4.845}, {'end': 1635.254, 'text': 'so can you do one thing now?', 'start': 1633.752, 'duration': 1.502}, {'end': 1638.237, 'text': 'can you basically take the backup of late project?', 'start': 1635.254, 'duration': 2.983}, {'end': 1639.999, 'text': 'now, what happened in this case?', 'start': 1638.237, 'duration': 1.762}], 'summary': 'Employee hears promotion, but boss gives extra work instead.', 'duration': 26.842, 'max_score': 1613.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1613157.jpg'}, {'end': 1781.892, 'src': 'embed', 'start': 1749.731, 'weight': 3, 'content': [{'end': 1756.355, 'text': 'when we talk about boss, the first thing is now in Hadoop, what every human is going to be replaced by machines.', 'start': 1749.731, 'duration': 6.624}, {'end': 1757.596, 'text': 'now the first component.', 'start': 1756.355, 'duration': 1.241}, {'end': 1758.736, 'text': 'what we were talking about?', 'start': 1757.596, 'duration': 1.14}, {'end': 1762.238, 'text': 'right. so if we talk about the first component, it was name.', 'start': 1758.736, 'duration': 3.502}, {'end': 1766.281, 'text': 'the boss is representing name node.', 'start': 1762.238, 'duration': 4.043}, {'end': 1768.742, 'text': 'second component was data node.', 'start': 1766.281, 'duration': 2.461}, {'end': 1772.184, 'text': 'these employees who are working basically for that.', 'start': 1768.742, 'duration': 3.442}, {'end': 1781.892, 'text': 'boss, you can represent them as data node, the node where you are doing all the processing, because employee do the work right, employee do the work,', 'start': 1772.184, 'duration': 9.708}], 'summary': 'In hadoop, machines replace humans. components: name node and data node.', 'duration': 32.161, 'max_score': 1749.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1749731.jpg'}, {'end': 1827.177, 'src': 'embed', 'start': 1797.182, 'weight': 4, 'content': [{'end': 1799.843, 'text': 'basically secondary name node and all those stuff.', 'start': 1797.182, 'duration': 2.661}, {'end': 1805.885, 'text': 'so basically, we require some like this is the backup for the data part right for this data file?', 'start': 1799.843, 'duration': 6.042}, {'end': 1808.827, 'text': 'p1, p2, p3, but what about this boss back?', 'start': 1805.885, 'duration': 2.942}, {'end': 1811.648, 'text': 'so basically, we want to create some backups for that.', 'start': 1808.827, 'duration': 2.821}, {'end': 1814.869, 'text': "so that's the reason we keep like passing name, node and all those stuff here.", 'start': 1811.648, 'duration': 3.221}, {'end': 1818.311, 'text': "So basically that is the backup part that's one of the component.", 'start': 1815.609, 'duration': 2.702}, {'end': 1823.555, 'text': "Now that is means like, let's say, in your company they have also backup a boss like.", 'start': 1818.611, 'duration': 4.944}, {'end': 1827.177, 'text': 'in case of this boss, leave then I should have all the details.', 'start': 1823.555, 'duration': 3.622}], 'summary': 'Implementing backups for data components like p1, p2, p3 to ensure reliability and redundancy.', 'duration': 29.995, 'max_score': 1797.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1797182.jpg'}, {'end': 1882.787, 'src': 'embed', 'start': 1855.886, 'weight': 5, 'content': [{'end': 1860.047, 'text': 'similarly, in your Hadoop, what your resource manager is going to schedule everything up?', 'start': 1855.886, 'duration': 4.161}, {'end': 1862.288, 'text': 'Now the last part.', 'start': 1860.508, 'duration': 1.78}, {'end': 1863.828, 'text': 'right node manager.', 'start': 1862.288, 'duration': 1.54}, {'end': 1867.909, 'text': 'now do you think you will be able to walk on this project without any skill set?', 'start': 1863.828, 'duration': 4.081}, {'end': 1871.23, 'text': 'no, right, you require some skill set to work on that project.', 'start': 1867.909, 'duration': 3.321}, {'end': 1874.451, 'text': 'right. so that skill set you can relate it like a node manager.', 'start': 1871.23, 'duration': 3.221}, {'end': 1882.787, 'text': 'which is managing your own load, means you can delete it like your skill set, which is helping you to solve the project right.', 'start': 1874.98, 'duration': 7.807}], 'summary': 'Hadoop resource manager schedules tasks, requiring skill set like a node manager.', 'duration': 26.901, 'max_score': 1855.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1855886.jpg'}], 'start': 1420.871, 'title': 'Dealing with a cunning boss and understanding hadoop components', 'summary': 'Discusses the challenges of working under a cunning boss who exploits vulnerabilities for project completion and explains hadoop components like name node, data node, metadata, backup, resource manager, and node manager.', 'chapters': [{'end': 1691.956, 'start': 1420.871, 'title': 'Dealing with a cunning boss', 'summary': 'Illustrates the challenges of working under a cunning boss who manipulates employees by leveraging promotions and responsibilities to exploit their vulnerabilities and ensure project completion.', 'duration': 271.085, 'highlights': ['The boss manipulates employees by leveraging promotions and responsibilities to ensure project completion. The boss manipulates employees by leveraging promotions and responsibilities to ensure project completion.', 'Employees are manipulated using promotions and responsibilities to exploit their vulnerabilities. Employees are manipulated using promotions and responsibilities to exploit their vulnerabilities.', 'Employees are coerced into taking on additional work by using promotions and senior responsibilities as leverage. Employees are coerced into taking on additional work by using promotions and senior responsibilities as leverage.', 'The boss employs clever tactics to manipulate employees and ensure project delivery. The boss employs clever tactics to manipulate employees and ensure project delivery.']}, {'end': 1897.374, 'start': 1692.592, 'title': 'Understanding hadoop components', 'summary': 'Explains the components of hadoop, drawing parallels between the roles of a boss and the components of hadoop, such as name node, data node, metadata, backup, resource manager, and node manager.', 'duration': 204.782, 'highlights': ['The components of Hadoop are explained by drawing parallels with the roles and responsibilities of a boss, with the name node representing the boss, data node as the employees doing the work, and metadata as the data about the data, similar to what a boss keeps (Relevance: 5)', 'The need for backups in Hadoop is highlighted, with the secondary name node and other components serving as backups for the data, akin to how a company would require a backup in case the boss is absent (Relevance: 4)', "The role of the resource manager in Hadoop is likened to the boss's skill set in scheduling jobs, while the node manager is compared to the skill set required by employees to work on projects, emphasizing the importance of managing workload and skills in both scenarios (Relevance: 3)"]}], 'duration': 476.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1420871.jpg', 'highlights': ['The boss manipulates employees by leveraging promotions and responsibilities to ensure project completion.', 'Employees are coerced into taking on additional work by using promotions and senior responsibilities as leverage.', 'The boss employs clever tactics to manipulate employees and ensure project delivery.', 'The components of Hadoop are explained by drawing parallels with the roles and responsibilities of a boss, with the name node representing the boss, data node as the employees doing the work, and metadata as the data about the data, similar to what a boss keeps (Relevance: 5)', 'The need for backups in Hadoop is highlighted, with the secondary name node and other components serving as backups for the data, akin to how a company would require a backup in case the boss is absent (Relevance: 4)', "The role of the resource manager in Hadoop is likened to the boss's skill set in scheduling jobs, while the node manager is compared to the skill set required by employees to work on projects, emphasizing the importance of managing workload and skills in both scenarios (Relevance: 3)"]}, {'end': 2904.355, 'segs': [{'end': 2062.893, 'src': 'heatmap', 'start': 1897.374, 'weight': 0.912, 'content': [{'end': 1904.099, 'text': 'I hope with this example now it will make your life easy to remember all these components, because this is a very important questions.', 'start': 1897.374, 'duration': 6.725}, {'end': 1909.102, 'text': 'in basically interview questions they generally ask this question what are the configuration files?', 'start': 1904.099, 'duration': 5.003}, {'end': 1911.444, 'text': 'what basically are the main components?', 'start': 1909.102, 'duration': 2.342}, {'end': 1918.889, 'text': "so you should be aware of this, and that's the reason I've explained you with this analogy, so that you get some idea and you can relate to it,", 'start': 1911.444, 'duration': 7.445}, {'end': 1922.572, 'text': 'that if you have to explain in interview, you need not remember all this stuff.', 'start': 1918.889, 'duration': 3.683}, {'end': 1923.893, 'text': "okay, let's move further.", 'start': 1922.572, 'duration': 1.321}, {'end': 1930.075, 'text': 'Now. so can I ask this question now what are the main Hadoop configuration files?', 'start': 1925.106, 'duration': 4.969}, {'end': 1932.999, 'text': 'So, basically, now we are talking about configuration file.', 'start': 1930.135, 'duration': 2.864}, {'end': 1935.684, 'text': 'this is related to mostly Hadoop administration.', 'start': 1932.999, 'duration': 2.685}, {'end': 1942.552, 'text': 'Now, when you talk about Hadoop administrator, right, so there will be few files which you need to configure.', 'start': 1936.75, 'duration': 5.802}, {'end': 1949.553, 'text': 'so I hope there are people in this batch in this session also where people must have done some Hadoop administrator first, right.', 'start': 1942.552, 'duration': 7.001}, {'end': 1954.215, 'text': "so I'm assuming that you people know that people who do not know that, let's not worry about this.", 'start': 1949.553, 'duration': 4.662}, {'end': 1955.915, 'text': "that the reason I'm answering this directly.", 'start': 1954.215, 'duration': 1.7}, {'end': 1958.656, 'text': 'So there are few important files here.', 'start': 1956.355, 'duration': 2.301}, {'end': 1961.157, 'text': 'one is Hadoop environment.sh.', 'start': 1958.656, 'duration': 2.501}, {'end': 1964.819, 'text': 'this is where you kind of mention all your environmental variables.', 'start': 1961.157, 'duration': 3.662}, {'end': 1967.28, 'text': 'for example, where is your Java home?', 'start': 1964.819, 'duration': 2.461}, {'end': 1968.96, 'text': 'where is your Hadoop home?', 'start': 1967.28, 'duration': 1.68}, {'end': 1971.482, 'text': 'all those things you define in Hadoop env.sh.', 'start': 1968.96, 'duration': 2.522}, {'end': 1978.71, 'text': "So this file basically you define where you let's say your name node is going to run.", 'start': 1973.203, 'duration': 5.507}, {'end': 1986.119, 'text': 'So you need to tell the address of your name node where you want to run that maybe you want to run some machine at 9004.', 'start': 1979.411, 'duration': 6.708}, {'end': 1989.103, 'text': 'So you will be telling all that in your four site XML.', 'start': 1986.119, 'duration': 2.984}, {'end': 1995.445, 'text': 'When we talk about HDFS site XML, here we talk about what should be the replication factor.', 'start': 1989.623, 'duration': 5.822}, {'end': 1998.426, 'text': 'where should physically my data notes should be present?', 'start': 1995.445, 'duration': 2.981}, {'end': 2001.247, 'text': 'where physically my name note should be present?', 'start': 1998.426, 'duration': 2.821}, {'end': 2005.869, 'text': 'all those things to define in HDFS site dot XML.', 'start': 2001.247, 'duration': 4.622}, {'end': 2011.698, 'text': 'yarn side and map red side basically defines the map jobs right.', 'start': 2006.494, 'duration': 5.204}, {'end': 2013.659, 'text': 'what kind of cluster you are going to use?', 'start': 2011.698, 'duration': 1.961}, {'end': 2019.503, 'text': "are you going to use that's in the source or you're going to use yarn or you're going to run a local distributed mode?", 'start': 2013.659, 'duration': 5.844}, {'end': 2021.844, 'text': 'all those things you will be defining there.', 'start': 2019.503, 'duration': 2.341}, {'end': 2025.587, 'text': 'also, you will be defining where your resource manager should be running.', 'start': 2021.844, 'duration': 3.743}, {'end': 2031.871, 'text': 'it should be running on this machine or 9000, one port or whatever port you want to define, right.', 'start': 2025.587, 'duration': 6.284}, {'end': 2040.377, 'text': 'so those information you will be defining in yarn side or map red site dot XML, not last.', 'start': 2031.871, 'duration': 8.506}, {'end': 2044.08, 'text': 'two files are masters and slaves file.', 'start': 2040.377, 'duration': 3.703}, {'end': 2050.025, 'text': 'in masters file we usually mention where my secondary name load was written.', 'start': 2044.08, 'duration': 5.945}, {'end': 2053.007, 'text': "when I say secondary name load, it's like a backup.", 'start': 2050.025, 'duration': 2.982}, {'end': 2058.091, 'text': 'not exactly I should call it as a backup, but I should call it as a snapshot of the name load.', 'start': 2053.007, 'duration': 5.084}, {'end': 2062.213, 'text': "it's something like this, like somebody just copying the metadata.", 'start': 2058.791, 'duration': 3.422}, {'end': 2062.893, 'text': "that's it.", 'start': 2062.213, 'duration': 0.68}], 'summary': 'Explanation of main hadoop configuration files and their purposes for hadoop administration.', 'duration': 165.519, 'max_score': 1897.374, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1897374.jpg'}, {'end': 2053.007, 'src': 'embed', 'start': 1954.215, 'weight': 0, 'content': [{'end': 1955.915, 'text': "that the reason I'm answering this directly.", 'start': 1954.215, 'duration': 1.7}, {'end': 1958.656, 'text': 'So there are few important files here.', 'start': 1956.355, 'duration': 2.301}, {'end': 1961.157, 'text': 'one is Hadoop environment.sh.', 'start': 1958.656, 'duration': 2.501}, {'end': 1964.819, 'text': 'this is where you kind of mention all your environmental variables.', 'start': 1961.157, 'duration': 3.662}, {'end': 1967.28, 'text': 'for example, where is your Java home?', 'start': 1964.819, 'duration': 2.461}, {'end': 1968.96, 'text': 'where is your Hadoop home?', 'start': 1967.28, 'duration': 1.68}, {'end': 1971.482, 'text': 'all those things you define in Hadoop env.sh.', 'start': 1968.96, 'duration': 2.522}, {'end': 1978.71, 'text': "So this file basically you define where you let's say your name node is going to run.", 'start': 1973.203, 'duration': 5.507}, {'end': 1986.119, 'text': 'So you need to tell the address of your name node where you want to run that maybe you want to run some machine at 9004.', 'start': 1979.411, 'duration': 6.708}, {'end': 1989.103, 'text': 'So you will be telling all that in your four site XML.', 'start': 1986.119, 'duration': 2.984}, {'end': 1995.445, 'text': 'When we talk about HDFS site XML, here we talk about what should be the replication factor.', 'start': 1989.623, 'duration': 5.822}, {'end': 1998.426, 'text': 'where should physically my data notes should be present?', 'start': 1995.445, 'duration': 2.981}, {'end': 2001.247, 'text': 'where physically my name note should be present?', 'start': 1998.426, 'duration': 2.821}, {'end': 2005.869, 'text': 'all those things to define in HDFS site dot XML.', 'start': 2001.247, 'duration': 4.622}, {'end': 2011.698, 'text': 'yarn side and map red side basically defines the map jobs right.', 'start': 2006.494, 'duration': 5.204}, {'end': 2013.659, 'text': 'what kind of cluster you are going to use?', 'start': 2011.698, 'duration': 1.961}, {'end': 2019.503, 'text': "are you going to use that's in the source or you're going to use yarn or you're going to run a local distributed mode?", 'start': 2013.659, 'duration': 5.844}, {'end': 2021.844, 'text': 'all those things you will be defining there.', 'start': 2019.503, 'duration': 2.341}, {'end': 2025.587, 'text': 'also, you will be defining where your resource manager should be running.', 'start': 2021.844, 'duration': 3.743}, {'end': 2031.871, 'text': 'it should be running on this machine or 9000, one port or whatever port you want to define, right.', 'start': 2025.587, 'duration': 6.284}, {'end': 2040.377, 'text': 'so those information you will be defining in yarn side or map red site dot XML, not last.', 'start': 2031.871, 'duration': 8.506}, {'end': 2044.08, 'text': 'two files are masters and slaves file.', 'start': 2040.377, 'duration': 3.703}, {'end': 2050.025, 'text': 'in masters file we usually mention where my secondary name load was written.', 'start': 2044.08, 'duration': 5.945}, {'end': 2053.007, 'text': "when I say secondary name load, it's like a backup.", 'start': 2050.025, 'duration': 2.982}], 'summary': 'Config files like hadoop env.sh, hdfs site xml, yarn/map red site xml define system configurations for hadoop setup.', 'duration': 98.792, 'max_score': 1954.215, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1954215.jpg'}, {'end': 2127.993, 'src': 'embed', 'start': 2097.535, 'weight': 5, 'content': [{'end': 2100.557, 'text': 'now these are very much kind of tool specific.', 'start': 2097.535, 'duration': 3.022}, {'end': 2103.699, 'text': "so that's the reason they will not be called as the main Hadoop configuration file.", 'start': 2100.557, 'duration': 3.142}, {'end': 2109.082, 'text': 'when somebody said main Hadoop configuration file, your answer would be these seven files.', 'start': 2104.198, 'duration': 4.884}, {'end': 2110.983, 'text': 'you need to remember these seven files.', 'start': 2109.082, 'duration': 1.901}, {'end': 2117.768, 'text': 'basically, these seven files are the one which you will mention if you are going for Hadoop administrator interview.', 'start': 2110.983, 'duration': 6.785}, {'end': 2120.81, 'text': 'expect good number of questions in this.', 'start': 2117.768, 'duration': 3.042}, {'end': 2127.993, 'text': 'they will ask to explain each and every file how basically you will be, what you do in which file, which I just explain.', 'start': 2121.988, 'duration': 6.005}], 'summary': 'The main hadoop configuration file consists of seven specific files crucial for hadoop administrator interviews.', 'duration': 30.458, 'max_score': 2097.535, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2097535.jpg'}, {'end': 2227.191, 'src': 'embed', 'start': 2201.981, 'weight': 6, 'content': [{'end': 2207.684, 'text': 'you are creating one backup or two backups of basically that block P1 and that is called replication.', 'start': 2201.981, 'duration': 5.703}, {'end': 2217.87, 'text': 'This is how Hadoop is ensuring that in case of any failure also there should be no mistake.', 'start': 2208.144, 'duration': 9.726}, {'end': 2224.11, 'text': "we should not be using data because in case it's very much one machine fields also, it's okay, it will all work.", 'start': 2218.708, 'duration': 5.402}, {'end': 2224.99, 'text': 'fine for me.', 'start': 2224.11, 'duration': 0.88}, {'end': 2227.191, 'text': 'so that is what going to happen.', 'start': 2224.99, 'duration': 2.201}], 'summary': 'Hadoop ensures data replication for fault tolerance and reliability.', 'duration': 25.21, 'max_score': 2201.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2201981.jpg'}, {'end': 2316.955, 'src': 'heatmap', 'start': 2227.191, 'weight': 0.717, 'content': [{'end': 2230.531, 'text': 'so very good, lot of people have given me the right answer in this case.', 'start': 2227.191, 'duration': 3.34}, {'end': 2234.833, 'text': 'so block replication is the answer in this case.', 'start': 2230.531, 'duration': 4.302}, {'end': 2240.415, 'text': 'as you can see in this example also, my block one is replicated three times, if you notice here.', 'start': 2234.833, 'duration': 5.582}, {'end': 2243.458, 'text': 'right. so block one is replicated three times.', 'start': 2240.415, 'duration': 3.043}, {'end': 2246.621, 'text': 'similarly, if you notice, block two is also replicated three times.', 'start': 2243.458, 'duration': 3.163}, {'end': 2254.528, 'text': 'so these are four different machines and I have replicated block one, block two, block three, block four, block five.', 'start': 2246.621, 'duration': 7.907}, {'end': 2256.51, 'text': 'okay. so this is what is happening.', 'start': 2254.528, 'duration': 1.982}, {'end': 2259.473, 'text': 'edit log and fs image is used to recreate the image.', 'start': 2256.51, 'duration': 2.963}, {'end': 2263.396, 'text': 'fs image and edit log are two different things.', 'start': 2260.053, 'duration': 3.343}, {'end': 2265.618, 'text': 'these basically are two different things.', 'start': 2263.396, 'duration': 2.222}, {'end': 2267.579, 'text': 'you cannot rate with replication.', 'start': 2265.618, 'duration': 1.961}, {'end': 2270.462, 'text': 'if you want to know about that, let me just answer you here.', 'start': 2267.579, 'duration': 2.883}, {'end': 2271.562, 'text': 'so what happens?', 'start': 2270.462, 'duration': 1.1}, {'end': 2272.443, 'text': 'what are physically?', 'start': 2271.562, 'duration': 0.881}, {'end': 2276.726, 'text': 'this fs image and edit log is now what happens?', 'start': 2272.443, 'duration': 4.283}, {'end': 2278.588, 'text': 'basically, you have a name node.', 'start': 2276.726, 'duration': 1.862}, {'end': 2282.811, 'text': 'right, you have a name node where you keep the data in name node.', 'start': 2278.588, 'duration': 4.223}, {'end': 2286.214, 'text': 'anybody have answered where you keep the data in name node?', 'start': 2282.811, 'duration': 3.403}, {'end': 2291.789, 'text': 'no, not fs image file initially, where you keep the data in name node and not SDFS detection.', 'start': 2286.946, 'duration': 4.843}, {'end': 2293.43, 'text': "I'm asking in name node.", 'start': 2291.789, 'duration': 1.641}, {'end': 2295.612, 'text': 'this can also be an interview question.', 'start': 2293.43, 'duration': 2.182}, {'end': 2298.734, 'text': 'very good, Anand, we keep it in memory.', 'start': 2295.612, 'duration': 3.122}, {'end': 2300.395, 'text': 'why let me answer this part?', 'start': 2298.734, 'duration': 1.661}, {'end': 2308.53, 'text': "Let's say there is one client came up, this is one client, this is client C2, this is client C3, let's say there are multiple clients.", 'start': 2300.805, 'duration': 7.725}, {'end': 2316.955, 'text': "Now, what is happening in this case is, let's say, if this was my name node and let's say my data in name node is.", 'start': 2309.37, 'duration': 7.585}], 'summary': "In the discussion, block replication is emphasized, with blocks being replicated three times and stored in the name node's memory.", 'duration': 89.764, 'max_score': 2227.191, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2227191.jpg'}, {'end': 2389.77, 'src': 'embed', 'start': 2359.767, 'weight': 8, 'content': [{'end': 2365.725, 'text': 'you need to basically bring the block to the memory and then basically do the stuff and give the output now.', 'start': 2359.767, 'duration': 5.958}, {'end': 2368.446, 'text': 'this is something which we want to avoid.', 'start': 2365.725, 'duration': 2.721}, {'end': 2370.886, 'text': 'in order to avoid what they came up with.', 'start': 2368.446, 'duration': 2.44}, {'end': 2380.448, 'text': 'the idea is that whatever you are going, whatever metadata you are going to create, should directly be created and kept in memory.', 'start': 2370.886, 'duration': 9.562}, {'end': 2382.488, 'text': 'okay, that is what the idea they came up.', 'start': 2380.448, 'duration': 2.04}, {'end': 2389.77, 'text': 'they are not going to keep any data in the disk right now, should be directly kept in the memory, which brings another question for you,', 'start': 2382.488, 'duration': 7.282}], 'summary': 'Avoid storing data in disk, keep in memory for faster processing.', 'duration': 30.003, 'max_score': 2359.767, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2359767.jpg'}, {'end': 2759.847, 'src': 'embed', 'start': 2731.93, 'weight': 7, 'content': [{'end': 2737.555, 'text': 'now we keep a backup in the disk and that, whatever backup we are keeping, we call it as a reference.', 'start': 2731.93, 'duration': 5.625}, {'end': 2742.137, 'text': 'Now FS image backup is always taken in 24 hour slot.', 'start': 2738.075, 'duration': 4.062}, {'end': 2747.1, 'text': 'now the another problem started with this what happened if I lose the data in 23rd hour?', 'start': 2742.137, 'duration': 4.963}, {'end': 2752.583, 'text': 'in that case, I should again create a smaller version of the file called as edit log.', 'start': 2747.1, 'duration': 5.483}, {'end': 2754.404, 'text': 'okay, that will also be a file.', 'start': 2752.583, 'duration': 1.821}, {'end': 2759.847, 'text': 'now these things will be added and will be basically given what can be the RAM size.', 'start': 2754.404, 'duration': 5.443}], 'summary': 'Fs image backup taken every 24 hours to prevent data loss.', 'duration': 27.917, 'max_score': 2731.93, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2731930.jpg'}, {'end': 2859.439, 'src': 'embed', 'start': 2837.041, 'weight': 10, 'content': [{'end': 2845.127, 'text': 'yeah, because if you will have small data right, if you will have small files, definitely your metadata is going to be for kind of too much.', 'start': 2837.041, 'duration': 8.086}, {'end': 2850.992, 'text': "right, your metadata entry will be too many and that's how you will be kind of filling up your RAM right.", 'start': 2845.127, 'duration': 5.865}, {'end': 2856.737, 'text': 'of your basically name node, we just learned that every metadata is stored in the RAM of the name node.', 'start': 2850.992, 'duration': 5.745}, {'end': 2859.439, 'text': 'Now, what are the solution for it?', 'start': 2856.757, 'duration': 2.682}], 'summary': 'Small files lead to excessive metadata, filling up ram. solution needed.', 'duration': 22.398, 'max_score': 2837.041, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2837041.jpg'}], 'start': 1897.374, 'title': 'Hadoop configuration and namenode architecture', 'summary': 'Covers hadoop configuration files overview, discussing hadoop environment.sh, four site xml, hdfs site xml, yarn side, map red side, masters file, and slaves file. it details hdfs fault tolerance mechanisms such as block replication, fs image, and edit log, and explains hadoop namenode architecture, including ram volatility and handling small files in hdfs.', 'chapters': [{'end': 2073.737, 'start': 1897.374, 'title': 'Hadoop configuration files overview', 'summary': 'Discusses the main hadoop configuration files, including hadoop environment.sh, four site xml, hdfs site xml, yarn side, map red side, masters file, and slaves file, with details about what each file contains and their significance in hadoop administration.', 'duration': 176.363, 'highlights': ['Hadoop environment.sh is used to define environmental variables like Java home and Hadoop home. It is where all environmental variables are defined, such as Java home and Hadoop home.', 'Four site XML is used to specify the address of the name node and other configuration details. It specifies the address of the name node and other configuration details.', 'HDFS site XML is used to define replication factor and the location of data and name nodes. It defines the replication factor and the location of data and name nodes.', 'Yarn side and map red side are used to define the type of cluster, map jobs, and resource manager location. They define the type of cluster, map jobs, and resource manager location.', 'Masters file mentions the location of the secondary name node and serves as a backup for the main node. It mentions the location of the secondary name node and serves as a backup for the main node.']}, {'end': 2583.179, 'start': 2073.737, 'title': 'Hadoop configuration and hdfs fault tolerance', 'summary': 'Details essential hadoop configuration files and emphasizes the importance of understanding them for hadoop administrator interviews. it also explains how hdfs ensures fault tolerance through block replication and the use of fs image and edit log, and addresses the challenges and solutions related to maintaining metadata in memory and disk backups.', 'duration': 509.442, 'highlights': ['The seven major Hadoop configuration files are crucial for Hadoop administrator interviews, with the expectation of detailed explanations for each file, and the commonly asked favorite question about them.', 'HDFS ensures fault tolerance through block replication, where each block is replicated to prevent data loss in case of node failure, with an example showing three replications of each block.', 'The fs image and edit log are used to maintain metadata in memory and disk backups, with the edit log recording activities for 24 hours before being combined with the fs image to prevent data loss.', 'The metadata is kept in memory to avoid expensive input/output operations, with a backup of the metadata being taken at intervals of time and stored in disk as fs image to prevent data loss due to volatile RAM.', 'The concept of maintaining metadata in memory and disk backups, and the use of fs image and edit log to prevent data loss, is explained in detail to ensure clarity and understanding.', 'The interval for taking backups and the process of combining the edit log with the fs image to prevent data loss are explained, addressing the potential confusion and ensuring a clear understanding of the concepts.']}, {'end': 2904.355, 'start': 2583.179, 'title': 'Hadoop namenode architecture', 'summary': 'Explains the architecture of hadoop namenode, detailing the storage of fsimage and edit log for backup, ram volatility, block size impact on memory, and handling small files in hdfs.', 'duration': 321.176, 'highlights': ['The NameNode stores fsimage and edit log for backup, with the edit log capturing 24-hour activity and creating a smaller version to track changes, impacting RAM volatility. The NameNode stores fsimage and edit log for backup, capturing 24-hour activity and creating a smaller version to track changes, impacting RAM volatility.', 'Block size impacts memory, with smaller files leading to excessive metadata entries and RAM overload, requiring a larger block size or merging small files to handle reasonable metadata. Block size impacts memory, with smaller files leading to excessive metadata entries and RAM overload, requiring a larger block size or merging small files to handle reasonable metadata.', 'The architecture of Hadoop NameNode involves managing backup data, addressing RAM volatility, and optimizing block size to handle metadata effectively. The architecture of Hadoop NameNode involves managing backup data, addressing RAM volatility, and optimizing block size to handle metadata effectively.']}], 'duration': 1006.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg1897374.jpg', 'highlights': ['Hadoop environment.sh defines environmental variables like Java home and Hadoop home.', 'Four site XML specifies the address of the name node and other configuration details.', 'HDFS site XML defines the replication factor and the location of data and name nodes.', 'Yarn side and map red side define the type of cluster, map jobs, and resource manager location.', 'Masters file mentions the location of the secondary name node and serves as a backup for the main node.', 'The seven major Hadoop configuration files are crucial for Hadoop administrator interviews.', 'HDFS ensures fault tolerance through block replication to prevent data loss in case of node failure.', 'The fs image and edit log are used to maintain metadata in memory and disk backups.', 'The metadata is kept in memory to avoid expensive input/output operations.', 'The NameNode stores fsimage and edit log for backup, capturing 24-hour activity and creating a smaller version to track changes.', 'Block size impacts memory, with smaller files leading to excessive metadata entries and RAM overload.', 'The architecture of Hadoop NameNode involves managing backup data, addressing RAM volatility, and optimizing block size.']}, {'end': 3864.632, 'segs': [{'end': 2954.724, 'src': 'embed', 'start': 2928.252, 'weight': 0, 'content': [{'end': 2938.734, 'text': 'you can create a dot hr file which is called as Hadoop archive, so you can bring all the small files into one folder together,', 'start': 2928.252, 'duration': 10.482}, {'end': 2940.735, 'text': 'kind of zipping it together.', 'start': 2938.734, 'duration': 2.001}, {'end': 2943.515, 'text': "now. basically, with that, what's gonna happen?", 'start': 2940.735, 'duration': 2.78}, {'end': 2947.016, 'text': "it's gonna just keep only one metadata for it.", 'start': 2943.515, 'duration': 3.501}, {'end': 2949.62, 'text': 'if the metadata entry is going to be reduced.', 'start': 2947.358, 'duration': 2.262}, {'end': 2950.461, 'text': 'how to do that?', 'start': 2949.62, 'duration': 0.841}, {'end': 2954.724, 'text': 'this is the command I do archive now hyphen, archive name.', 'start': 2950.461, 'duration': 4.263}], 'summary': 'Create hadoop archive to combine small files, reducing metadata entries.', 'duration': 26.472, 'max_score': 2928.252, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2928252.jpg'}, {'end': 3030.862, 'src': 'embed', 'start': 3004.011, 'weight': 1, 'content': [{'end': 3008.875, 'text': 'So, using a default block size configuration and default replication factor,', 'start': 3004.011, 'duration': 4.864}, {'end': 3014.34, 'text': 'then how many blocks will be created in total and what would be the size of this block?', 'start': 3008.875, 'duration': 5.465}, {'end': 3016.822, 'text': 'Before you answer this, can I get an answer?', 'start': 3014.36, 'duration': 2.462}, {'end': 3023.208, 'text': "what is the default replication factor and what is the default block size if I'm talking about Hadoop 2.x?", 'start': 3016.822, 'duration': 6.386}, {'end': 3026.751, 'text': 'Very good, so replication, as everybody said, 3MB.', 'start': 3024.189, 'duration': 2.562}, {'end': 3028.862, 'text': 'what is the size?', 'start': 3027.401, 'duration': 1.461}, {'end': 3030.862, 'text': 'very good, 128 MB.', 'start': 3028.862, 'duration': 2}], 'summary': 'Hadoop 2.x defaults to 128mb block size and 3x replication factor.', 'duration': 26.851, 'max_score': 3004.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3004011.jpg'}, {'end': 3101.707, 'src': 'embed', 'start': 3072.923, 'weight': 2, 'content': [{'end': 3076.464, 'text': 'they are going to be five block because the replication is five.', 'start': 3072.923, 'duration': 3.541}, {'end': 3081.045, 'text': "now, sorry, replication factor is three, so it's gonna be five into three block.", 'start': 3076.464, 'duration': 4.581}, {'end': 3084.142, 'text': 'okay, this is how basically will be calculating.', 'start': 3081.741, 'duration': 2.401}, {'end': 3086.703, 'text': 'this is very famous interview question.', 'start': 3084.142, 'duration': 2.561}, {'end': 3100.427, 'text': 'moving further, how to copy a file into HDFS with a different block size to that of existing block size configuration.', 'start': 3086.703, 'duration': 13.724}, {'end': 3101.707, 'text': 'can I get an answer?', 'start': 3100.427, 'duration': 1.28}], 'summary': 'Replication factor is 3, resulting in 5 x 3 blocks. explains hdfs file copying with different block size.', 'duration': 28.784, 'max_score': 3072.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3072923.jpg'}, {'end': 3363.417, 'src': 'embed', 'start': 3336.656, 'weight': 3, 'content': [{'end': 3342.437, 'text': "okay, so it can basically keep informing the name of that's the reason this has been usually done by administrators,", 'start': 3336.656, 'duration': 5.781}, {'end': 3345.318, 'text': 'because they keep on monitoring the health of the data nodes.', 'start': 3342.437, 'duration': 2.881}, {'end': 3347.429, 'text': 'data block, name node.', 'start': 3345.748, 'duration': 1.681}, {'end': 3349.61, 'text': "they're also responsible for this block right.", 'start': 3347.429, 'duration': 2.181}, {'end': 3355.533, 'text': 'so this is what they keep on doing in order to make sure that they use block scanner basically to do that.', 'start': 3349.61, 'duration': 5.923}, {'end': 3360.856, 'text': 'okay, there is one more way to check the replication factor.', 'start': 3355.533, 'duration': 5.323}, {'end': 3363.417, 'text': 'anybody know what is that?', 'start': 3360.856, 'duration': 2.561}], 'summary': "Administrators monitor data nodes' health using block scanner and check replication factor.", 'duration': 26.761, 'max_score': 3336.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3336656.jpg'}, {'end': 3408.309, 'src': 'embed', 'start': 3377.007, 'weight': 5, 'content': [{'end': 3380.571, 'text': "I'm talking about, let's say, some file got under replicated.", 'start': 3377.007, 'duration': 3.564}, {'end': 3381.832, 'text': 'in that case, how?', 'start': 3380.571, 'duration': 1.261}, {'end': 3385.035, 'text': 'who will kind of inform network?', 'start': 3381.832, 'duration': 3.203}, {'end': 3388.038, 'text': "let's say, blocks scanner is not there.", 'start': 3385.035, 'duration': 3.003}, {'end': 3391.822, 'text': 'there is something called as Hadoop load balancer.', 'start': 3388.038, 'duration': 3.784}, {'end': 3402.905, 'text': "I'm not sure if you have read about that, I do load balancer that basically ensures that if your data blocks are not up right,", 'start': 3393.058, 'duration': 9.847}, {'end': 3408.309, 'text': "if they're under-applicated enough, that basically informs that, okay, this is under-applicated.", 'start': 3402.905, 'duration': 5.404}], 'summary': 'Hadoop load balancer informs about under-replicated data blocks.', 'duration': 31.302, 'max_score': 3377.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3377007.jpg'}, {'end': 3511.134, 'src': 'embed', 'start': 3485.757, 'weight': 4, 'content': [{'end': 3490.88, 'text': 'once one client is writing, it will be kind of file, will be kind of lock for other client.', 'start': 3485.757, 'duration': 5.123}, {'end': 3492.641, 'text': 'once the client have written.', 'start': 3490.88, 'duration': 1.761}, {'end': 3498.565, 'text': 'after that only the other clients can write, but everybody can read concurrently.', 'start': 3492.641, 'duration': 5.924}, {'end': 3502.048, 'text': 'that is one thing which is very important in Hadoop.', 'start': 3498.565, 'duration': 3.483}, {'end': 3508.332, 'text': 'so writing at the same time is not possible concurrently, but reading is possible.', 'start': 3502.048, 'duration': 6.284}, {'end': 3511.134, 'text': "that's how HDFS is basically created.", 'start': 3508.332, 'duration': 2.802}], 'summary': 'Hdfs allows concurrent reading but not writing, ensuring data integrity and consistency.', 'duration': 25.377, 'max_score': 3485.757, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3485757.jpg'}, {'end': 3612.474, 'src': 'embed', 'start': 3582.952, 'weight': 6, 'content': [{'end': 3588.176, 'text': 'there will be two name node running one will be active name node and one will be passive name node.', 'start': 3582.952, 'duration': 5.224}, {'end': 3589.797, 'text': 'So what happens is,', 'start': 3588.576, 'duration': 1.221}, {'end': 3598.123, 'text': "let's say this is my active name node which is running okay and what these data nodes are there at for reporting to the active name node.", 'start': 3589.797, 'duration': 8.326}, {'end': 3601.846, 'text': 'Now we also create a passive name node.', 'start': 3598.664, 'duration': 3.182}, {'end': 3604.388, 'text': 'now this passive name node also.', 'start': 3601.846, 'duration': 2.542}, {'end': 3606.189, 'text': 'these data nodes will be reporting.', 'start': 3604.388, 'duration': 1.801}, {'end': 3612.474, 'text': 'this passive name node will not be doing anything, but it will just keep on collecting the data from your data nodes.', 'start': 3606.189, 'duration': 6.285}], 'summary': 'Two name nodes running, one active and one passive, with data nodes reporting to the active name node.', 'duration': 29.522, 'max_score': 3582.952, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3582952.jpg'}, {'end': 3746.687, 'src': 'embed', 'start': 3721.936, 'weight': 7, 'content': [{'end': 3733.837, 'text': 'but When it comes to the active and passive name node, passive name node is going to ensure that it is not only collecting the data metadata but it,', 'start': 3721.936, 'duration': 11.901}, {'end': 3739.561, 'text': 'as soon as active name node is down, passive name node start acting like a active name node.', 'start': 3733.837, 'duration': 5.724}, {'end': 3741.843, 'text': 'Clear on this difference, Gautam?', 'start': 3740.482, 'duration': 1.361}, {'end': 3744.926, 'text': 'No, Anand, you have had this question right?', 'start': 3743.224, 'duration': 1.702}, {'end': 3746.687, 'text': 'Clear about this question.', 'start': 3745.506, 'duration': 1.181}], 'summary': 'Passive name node acts as an active name node when active name node is down.', 'duration': 24.751, 'max_score': 3721.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3721936.jpg'}], 'start': 2904.355, 'title': 'Hadoop archiving, block size, block scanner, and high availability', 'summary': "Explores creating hadoop archives to consolidate small files, reducing metadata entries, calculating block numbers and sizes, understanding block scanner's role in data integrity and monitoring, and explaining high availability of name node in hadoop.", 'chapters': [{'end': 3215.525, 'start': 2904.355, 'title': 'Hadoop archiving and block size configuration', 'summary': 'Explores the creation of hadoop archives to consolidate small files, reducing metadata entries, and discusses the calculation of block numbers and sizes based on default replication factor and block size configuration, as well as the method to copy a file into hdfs with a different block size.', 'duration': 311.17, 'highlights': ['Creation of Hadoop archives to consolidate small files, reducing metadata entries In Hadoop, a dot har file can be created to bring small files together, reducing metadata entries and functioning akin to zipping files together.', 'Calculation of block numbers and sizes based on default replication factor and block size configuration Based on default replication factor (3MB) and block size (128MB) in Hadoop 2.x, the calculation for splitting a 514MB file results in 15 blocks, considering the replication factor and block size distribution.', "Method to copy a file into HDFS with a different block size To change the block size configuration when copying a file into HDFS, the parameter 'dfs.blocksize' can be utilized to define the bytes, allowing for the specification of a different block size."]}, {'end': 3532.872, 'start': 3216.374, 'title': 'Understanding block scanner in hdfs', 'summary': 'Introduces the concept of block scanner in hdfs, highlighting its role in ensuring data integrity and monitoring the health of data nodes, while also explaining the limitations on concurrent writing in hdfs.', 'duration': 316.498, 'highlights': ['The block scanner in HDFS is responsible for checking the integrity of data blocks and ensuring the health of data nodes by monitoring and rectifying issues such as under-replication, data block corruption, and low replica values.', 'Concurrent writing by multiple clients is not allowed in HDFS to prevent file inconsistency, as it locks the file for writing by one client at a time, while allowing concurrent reading by multiple clients.', 'Hadoop load balancer is another method used to check under-replication or over-replication of data blocks in the absence of block scanner, ensuring the balance and proper replication of data blocks.']}, {'end': 3864.632, 'start': 3532.872, 'title': 'High availability of name node in hadoop', 'summary': 'Explains the concept of high availability of name node in hadoop, where an active and passive name node configuration ensures immediate failover without downtime, contrasting it with the previous secondary name node approach.', 'duration': 331.76, 'highlights': ['Hadoop ensures high availability through an active and passive name node configuration, where the passive name node immediately takes over in case of the active node failure, minimizing downtime.', 'The passive name node collects metadata and acts as a backup, ensuring seamless failover without manual intervention, contrasting it with the previous approach of secondary name node that required manual data copying and significant downtime.', 'The concept of high availability is a common interview question, emphasizing the importance of understanding the active and passive name node configuration in Hadoop.', 'The session includes a break before delving into MapReduce questions, with the instructor confirming the availability of video recordings for the participants.']}], 'duration': 960.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg2904355.jpg', 'highlights': ['Creation of Hadoop archives to consolidate small files, reducing metadata entries', 'Calculation of block numbers and sizes based on default replication factor and block size configuration', 'Method to copy a file into HDFS with a different block size', 'The block scanner in HDFS is responsible for checking the integrity of data blocks and ensuring the health of data nodes', 'Concurrent writing by multiple clients is not allowed in HDFS to prevent file inconsistency', 'Hadoop load balancer is another method used to check under-replication or over-replication of data blocks', 'Hadoop ensures high availability through an active and passive name node configuration', 'The passive name node collects metadata and acts as a backup, ensuring seamless failover without manual intervention']}, {'end': 4393.922, 'segs': [{'end': 3983.309, 'src': 'embed', 'start': 3893.051, 'weight': 0, 'content': [{'end': 3894.172, 'text': 'so can you give me this answer?', 'start': 3893.051, 'duration': 1.121}, {'end': 3907.901, 'text': 'Can you explain the process of spilling in map reduce? Spills to temp folder LFS or when it spill and from where it spill?', 'start': 3895.413, 'duration': 12.488}, {'end': 3910.523, 'text': 'Can I get this answer??', 'start': 3909.762, 'duration': 0.761}, {'end': 3920.49, 'text': 'very good, what usually happens is the output of your mapper task.', 'start': 3912.984, 'duration': 7.506}, {'end': 3921.711, 'text': 'it goes to your wrap.', 'start': 3920.49, 'duration': 1.221}, {'end': 3924.717, 'text': 'now what basically gonna happen?', 'start': 3922.435, 'duration': 2.282}, {'end': 3929.399, 'text': 'they have kept a specific size of that thing.', 'start': 3924.717, 'duration': 4.682}, {'end': 3932.822, 'text': "so let's say they keep, let's say this, 100 MB.", 'start': 3929.399, 'duration': 3.423}, {'end': 3935.383, 'text': 'now 100 MB of data will be kept.', 'start': 3932.822, 'duration': 2.561}, {'end': 3936.724, 'text': "let's say in that.", 'start': 3935.383, 'duration': 1.341}, {'end': 3937.705, 'text': 'but they have.', 'start': 3936.724, 'duration': 0.981}, {'end': 3939.266, 'text': 'they will be keeping a threshold.', 'start': 3937.705, 'duration': 1.561}, {'end': 3941.267, 'text': 'so it will slowly keep on filling up.', 'start': 3939.266, 'duration': 2.001}, {'end': 3942.928, 'text': 'slowly keep on filling up.', 'start': 3941.267, 'duration': 1.661}, {'end': 3944.169, 'text': 'then what will happen?', 'start': 3942.928, 'duration': 1.241}, {'end': 3954.574, 'text': "as soon as it will reach a threshold, let's say 80% of that gram of 100 MB spread, it will start filling that output to the local disk notice.", 'start': 3944.169, 'duration': 10.405}, {'end': 3960.256, 'text': "here I'm not saying HDFS, I am saying local disk.", 'start': 3954.574, 'duration': 5.682}, {'end': 3962.816, 'text': 'okay, to local disk only.', 'start': 3960.256, 'duration': 2.56}, {'end': 3964.817, 'text': 'we will be keeping this data.', 'start': 3962.816, 'duration': 2.001}, {'end': 3968.378, 'text': 'local disk means your C drive D drive wherever you want to keep up.', 'start': 3964.817, 'duration': 3.561}, {'end': 3970.859, 'text': 'so this is how they have designed it.', 'start': 3968.378, 'duration': 2.481}, {'end': 3983.309, 'text': 'So, as soon as the mapper output in the memory reach to a threshold limit, it starts filling that mapper data to your local disk,', 'start': 3971.28, 'duration': 12.029}], 'summary': 'In mapreduce, when mapper output reaches 80% of 100mb, it spills to local disk.', 'duration': 90.258, 'max_score': 3893.051, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3893051.jpg'}, {'end': 4147.354, 'src': 'embed', 'start': 4073.959, 'weight': 3, 'content': [{'end': 4075.04, 'text': 'so it makes sense.', 'start': 4073.959, 'duration': 1.081}, {'end': 4076.601, 'text': 'record is single line.', 'start': 4075.04, 'duration': 1.561}, {'end': 4080.573, 'text': 'block is set by sdfs, input split is logically split.', 'start': 4077.031, 'duration': 3.542}, {'end': 4081.953, 'text': 'very good, very good.', 'start': 4080.573, 'duration': 1.38}, {'end': 4084.515, 'text': 'so what usually happens, right.', 'start': 4081.953, 'duration': 2.562}, {'end': 4092.098, 'text': "so let's say, when we talk about block, so let's say you have default space is 128 MB, so that will be called as a physical block.", 'start': 4084.515, 'duration': 7.583}, {'end': 4095.16, 'text': 'okay, when we talk about input split, right.', 'start': 4092.098, 'duration': 3.062}, {'end': 4103.924, 'text': "so let's say, if your data is of 130 MB now, in that case don't you think it makes sense to have a logical split of 130 MB here?", 'start': 4095.16, 'duration': 8.764}, {'end': 4105.665, 'text': 'so that will be your input split.', 'start': 4103.924, 'duration': 1.741}, {'end': 4110.287, 'text': 'and record is, when you do MapReduce programming right.', 'start': 4106.145, 'duration': 4.142}, {'end': 4114.189, 'text': 'when you do MapReduce programming, how your mapper take the data.', 'start': 4110.287, 'duration': 3.902}, {'end': 4116.33, 'text': 'it takes line by line right.', 'start': 4114.189, 'duration': 2.141}, {'end': 4117.93, 'text': 'it takes line by line.', 'start': 4116.33, 'duration': 1.6}, {'end': 4119.792, 'text': 'that line is called record.', 'start': 4117.93, 'duration': 1.862}, {'end': 4127.055, 'text': 'So one line of data which it picks up in the mapper phase is called your record.', 'start': 4120.453, 'duration': 6.602}, {'end': 4133.92, 'text': 'Okay, very famous question on this part, so you can say block is a physical division.', 'start': 4127.694, 'duration': 6.226}, {'end': 4138.185, 'text': 'by logical division are called your input splits and records.', 'start': 4133.92, 'duration': 4.265}, {'end': 4142.087, 'text': 'okay, because the logical division is what your MapReduce program do.', 'start': 4138.185, 'duration': 3.902}, {'end': 4147.354, 'text': 'That brings me to another question, again relate to MapReduce.', 'start': 4143.529, 'duration': 3.825}], 'summary': 'In mapreduce, block is 128mb, input split is logically split, and record is line by line.', 'duration': 73.395, 'max_score': 4073.959, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4073959.jpg'}, {'end': 4259.822, 'src': 'heatmap', 'start': 4038.303, 'weight': 5, 'content': [{'end': 4043.729, 'text': 'if you have done this map reduce part, then you must be knowing this part.', 'start': 4038.303, 'duration': 5.426}, {'end': 4048.133, 'text': 'difference between blocks, input split and records.', 'start': 4043.729, 'duration': 4.404}, {'end': 4049.819, 'text': 'Very good, Jyothi.', 'start': 4048.698, 'duration': 1.121}, {'end': 4050.98, 'text': 'so Jyothi is answering.', 'start': 4049.819, 'duration': 1.161}, {'end': 4054.343, 'text': 'record is a single line of data.', 'start': 4050.98, 'duration': 3.363}, {'end': 4055.624, 'text': 'right, very good.', 'start': 4054.343, 'duration': 1.281}, {'end': 4057.986, 'text': 'Kanan is saying block is hard.', 'start': 4055.624, 'duration': 2.362}, {'end': 4060.508, 'text': 'cut of data, just 128 MB.', 'start': 4057.986, 'duration': 2.522}, {'end': 4064.111, 'text': 'very good, right, block is equal to 128 MB.', 'start': 4060.508, 'duration': 3.603}, {'end': 4067.794, 'text': 'if record is 130 MB, input split will happen.', 'start': 4064.111, 'duration': 3.683}, {'end': 4071.357, 'text': 'very good, block is based on block size, input split.', 'start': 4067.794, 'duration': 3.563}, {'end': 4073.959, 'text': 'make sure that the line is not broken.', 'start': 4071.357, 'duration': 2.602}, {'end': 4075.04, 'text': 'so it makes sense.', 'start': 4073.959, 'duration': 1.081}, {'end': 4076.601, 'text': 'record is single line.', 'start': 4075.04, 'duration': 1.561}, {'end': 4080.573, 'text': 'block is set by sdfs, input split is logically split.', 'start': 4077.031, 'duration': 3.542}, {'end': 4081.953, 'text': 'very good, very good.', 'start': 4080.573, 'duration': 1.38}, {'end': 4084.515, 'text': 'so what usually happens, right.', 'start': 4081.953, 'duration': 2.562}, {'end': 4092.098, 'text': "so let's say, when we talk about block, so let's say you have default space is 128 MB, so that will be called as a physical block.", 'start': 4084.515, 'duration': 7.583}, {'end': 4095.16, 'text': 'okay, when we talk about input split, right.', 'start': 4092.098, 'duration': 3.062}, {'end': 4103.924, 'text': "so let's say, if your data is of 130 MB now, in that case don't you think it makes sense to have a logical split of 130 MB here?", 'start': 4095.16, 'duration': 8.764}, {'end': 4105.665, 'text': 'so that will be your input split.', 'start': 4103.924, 'duration': 1.741}, {'end': 4110.287, 'text': 'and record is, when you do MapReduce programming right.', 'start': 4106.145, 'duration': 4.142}, {'end': 4114.189, 'text': 'when you do MapReduce programming, how your mapper take the data.', 'start': 4110.287, 'duration': 3.902}, {'end': 4116.33, 'text': 'it takes line by line right.', 'start': 4114.189, 'duration': 2.141}, {'end': 4117.93, 'text': 'it takes line by line.', 'start': 4116.33, 'duration': 1.6}, {'end': 4119.792, 'text': 'that line is called record.', 'start': 4117.93, 'duration': 1.862}, {'end': 4127.055, 'text': 'So one line of data which it picks up in the mapper phase is called your record.', 'start': 4120.453, 'duration': 6.602}, {'end': 4133.92, 'text': 'Okay, very famous question on this part, so you can say block is a physical division.', 'start': 4127.694, 'duration': 6.226}, {'end': 4138.185, 'text': 'by logical division are called your input splits and records.', 'start': 4133.92, 'duration': 4.265}, {'end': 4142.087, 'text': 'okay, because the logical division is what your MapReduce program do.', 'start': 4138.185, 'duration': 3.902}, {'end': 4147.354, 'text': 'That brings me to another question, again relate to MapReduce.', 'start': 4143.529, 'duration': 3.825}, {'end': 4152.559, 'text': 'what is the role of record reader in Hadoop MapReduce?', 'start': 4147.354, 'duration': 5.205}, {'end': 4160.818, 'text': 'what is the role of record reader in Hadoop MapReduce?', 'start': 4155.094, 'duration': 5.724}, {'end': 4164.76, 'text': 'Make sure to read the complete record.', 'start': 4162.499, 'duration': 2.261}, {'end': 4167.182, 'text': 'no, no, how.', 'start': 4164.76, 'duration': 2.422}, {'end': 4170.144, 'text': 'that is how mapper reads a record.', 'start': 4167.182, 'duration': 2.962}, {'end': 4173.386, 'text': 'good, but Digambar, can you give be a little more explicit?', 'start': 4170.144, 'duration': 3.242}, {'end': 4176.006, 'text': "you're coming close to the answer.", 'start': 4173.386, 'duration': 2.62}, {'end': 4180.49, 'text': 'can I get more answer also, Digambar, can you just be a little more explicit?', 'start': 4176.006, 'duration': 4.484}, {'end': 4182.691, 'text': 'that is how mapper reader and record.', 'start': 4180.49, 'duration': 2.201}, {'end': 4183.392, 'text': "you're coming close.", 'start': 4182.691, 'duration': 0.701}, {'end': 4187.952, 'text': 'what about others?', 'start': 4185.071, 'duration': 2.881}, {'end': 4189.852, 'text': 'what is record?', 'start': 4187.952, 'duration': 1.9}, {'end': 4191.792, 'text': 'record is single line.', 'start': 4189.852, 'duration': 1.94}, {'end': 4193.433, 'text': 'we have read, just understood it.', 'start': 4191.792, 'duration': 1.641}, {'end': 4197.114, 'text': 'so what should be record reader parsing that single line?', 'start': 4193.433, 'duration': 3.681}, {'end': 4198.635, 'text': 'very good work, right.', 'start': 4197.114, 'duration': 1.521}, {'end': 4204.336, 'text': "so don't you think that single write what you are reading and how mapper convert your data.", 'start': 4198.635, 'duration': 5.701}, {'end': 4206.097, 'text': 'it converts into key value pair, right.', 'start': 4204.336, 'duration': 1.761}, {'end': 4209.691, 'text': 'so it initially takes an input as a key value pair.', 'start': 4206.69, 'duration': 3.001}, {'end': 4215.652, 'text': 'so when that conversion is happening that is done basically by record reader.', 'start': 4209.691, 'duration': 5.961}, {'end': 4216.552, 'text': 'look at this.', 'start': 4215.652, 'duration': 0.9}, {'end': 4220.453, 'text': "see, let's say, this is the data it will be getting converted to key value pair,", 'start': 4216.552, 'duration': 3.901}, {'end': 4227.235, 'text': 'where key is called your offset and value is first line right or second line or third line right.', 'start': 4220.453, 'duration': 6.782}, {'end': 4230.736, 'text': 'so this is done by record reader.', 'start': 4227.235, 'duration': 3.501}, {'end': 4234.817, 'text': 'okay, but this is what record reader do now.', 'start': 4230.736, 'duration': 4.081}, {'end': 4238.972, 'text': 'What is the significance of counters in MapReduce?', 'start': 4235.49, 'duration': 3.482}, {'end': 4244.014, 'text': 'Significance of counters in MapReduce?', 'start': 4241.333, 'duration': 2.681}, {'end': 4251.758, 'text': 'Okay, give statistics of data.', 'start': 4248.796, 'duration': 2.962}, {'end': 4254.299, 'text': 'counters will be done in name node.', 'start': 4251.758, 'duration': 2.541}, {'end': 4257.781, 'text': 'okay?. Counters to validate the data read.', 'start': 4254.299, 'duration': 3.482}, {'end': 4259.822, 'text': 'to calculate bad records good.', 'start': 4257.781, 'duration': 2.041}], 'summary': 'Understanding mapreduce concepts: blocks, input splits, records, and roles in hadoop mapreduce.', 'duration': 53.132, 'max_score': 4038.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4038303.jpg'}], 'start': 3864.632, 'title': 'Mapreduce internals', 'summary': 'Delves into the process of spilling in mapreduce, where mapper output is stored after reaching a threshold limit, and it also covers the distinction between blocks, input splits, and records in hadoop mapreduce, highlighting block size, input split conditions, role of record reader, and the significance of counters.', 'chapters': [{'end': 4013.26, 'start': 3864.632, 'title': 'Process of spilling in map reduce', 'summary': 'Discusses the process of spilling in map reduce, where the output of the mapper task is stored in a local disk after reaching a threshold limit, typically around 80% of the specified size, providing insights into the internal workings of map reduce.', 'duration': 148.628, 'highlights': ['The output of the mapper task is stored in the local disk after reaching a threshold limit, typically around 80% of the specified size, during the filling phase in Map Reduce.', 'The specified size for the output data is typically 100 MB, and when it reaches around 80% of this size, the data starts filling up in the local disk, not in HDFS.', 'The process of spilling in Map Reduce provides insights into the internal workings of the system, demonstrating how the mapper output is managed and stored.']}, {'end': 4393.922, 'start': 4013.26, 'title': 'Difference between blocks, input splits, and records', 'summary': 'Discusses the difference between blocks, input splits, and records in hadoop mapreduce, where a block is equal to 128 mb, input split occurs if a record is 130 mb, and records are single lines of data. it also covers the role of the record reader in hadoop mapreduce and the significance of counters in mapreduce.', 'duration': 380.662, 'highlights': ['A block is equal to 128 MB. The chapter explains that a block in Hadoop is equivalent to 128 MB, serving as a physical division of data.', 'Input split occurs if a record is 130 MB. It is highlighted that an input split is logically split and occurs when a record is 130 MB in size, ensuring that the line is not broken during processing.', 'Records are single lines of data. The chapter defines records as single lines of data used in MapReduce programming, where the mapper processes the data line by line.', 'The role of the record reader is to convert data into key-value pairs. The role of the record reader in Hadoop MapReduce is explained as the process responsible for converting input data into key-value pairs, where the key is the offset and the value represents lines of data.', 'Counters in MapReduce provide statistics about data operations. The significance of counters in MapReduce is outlined, indicating that they offer statistics about various data operations, including identifying bad records and other operations, and the ability to print the statistics in the console.']}], 'duration': 529.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg3864632.jpg', 'highlights': ['The process of spilling in Map Reduce provides insights into the internal workings of the system, demonstrating how the mapper output is managed and stored.', 'The output of the mapper task is stored in the local disk after reaching a threshold limit, typically around 80% of the specified size, during the filling phase in Map Reduce.', 'The specified size for the output data is typically 100 MB, and when it reaches around 80% of this size, the data starts filling up in the local disk, not in HDFS.', 'A block is equal to 128 MB. The chapter explains that a block in Hadoop is equivalent to 128 MB, serving as a physical division of data.', 'Input split occurs if a record is 130 MB. It is highlighted that an input split is logically split and occurs when a record is 130 MB in size, ensuring that the line is not broken during processing.', 'The role of the record reader is to convert data into key-value pairs. The role of the record reader in Hadoop MapReduce is explained as the process responsible for converting input data into key-value pairs, where the key is the offset and the value represents lines of data.', 'Counters in MapReduce provide statistics about data operations. The significance of counters in MapReduce is outlined, indicating that they offer statistics about various data operations, including identifying bad records and other operations, and the ability to print the statistics in the console.', 'Records are single lines of data. The chapter defines records as single lines of data used in MapReduce programming, where the mapper processes the data line by line.']}, {'end': 5180.242, 'segs': [{'end': 4468.782, 'src': 'embed', 'start': 4393.922, 'weight': 6, 'content': [{'end': 4397.004, 'text': 'how many times March, April and all are offering.', 'start': 4393.922, 'duration': 3.082}, {'end': 4399.846, 'text': "let's say, I want to identify all the statistics.", 'start': 4397.004, 'duration': 2.842}, {'end': 4405.677, 'text': 'I can do that with the help of counters very easily, okay.', 'start': 4399.846, 'duration': 5.831}, {'end': 4408.997, 'text': 'so this is the purpose of your counters.', 'start': 4405.677, 'duration': 3.32}, {'end': 4420.44, 'text': 'Now, moving further, why the output of math pass was filled into the local disk and not in HDSS?', 'start': 4410.418, 'duration': 10.022}, {'end': 4421.24, 'text': 'Good question.', 'start': 4420.64, 'duration': 0.6}, {'end': 4422.641, 'text': 'now can you answer me this?', 'start': 4421.24, 'duration': 1.401}, {'end': 4430.423, 'text': 'Remember, we talked about yes, we have just discussed that question, right, so we have just seen spilling.', 'start': 4422.681, 'duration': 7.742}, {'end': 4431.303, 'text': 'we have seen spilling.', 'start': 4430.423, 'duration': 0.88}, {'end': 4435.101, 'text': 'How can we assess counters?', 'start': 4433.78, 'duration': 1.321}, {'end': 4437.462, 'text': 'There are basically libraries available for that.', 'start': 4435.181, 'duration': 2.281}, {'end': 4438.863, 'text': 'there are classes available.', 'start': 4437.462, 'duration': 1.401}, {'end': 4440.424, 'text': 'get counters.', 'start': 4438.863, 'duration': 1.561}, {'end': 4442.566, 'text': 'is the member function to get that?', 'start': 4440.424, 'duration': 2.142}, {'end': 4444.687, 'text': "this is how, basically, we'll be assessing it.", 'start': 4442.566, 'duration': 2.121}, {'end': 4448.89, 'text': 'so counter is the class in which we have member functions, predefined functions, using that.', 'start': 4444.687, 'duration': 4.203}, {'end': 4454.453, 'text': "you can assess that, because it's an intermediate output.", 'start': 4448.89, 'duration': 5.563}, {'end': 4459.677, 'text': "okay, intermediate output, that's fine, but why we are not keeping an SGFS?", 'start': 4454.453, 'duration': 5.224}, {'end': 4460.977, 'text': "that's my question.", 'start': 4459.677, 'duration': 1.3}, {'end': 4462.779, 'text': 'when I can keep intermediate output in SGFS?', 'start': 4460.977, 'duration': 1.802}, {'end': 4468.782, 'text': 'very good, very good Ramya, very good, Narasimha.', 'start': 4464.739, 'duration': 4.043}], 'summary': 'Discussion on using counters to assess statistics and intermediate output storage.', 'duration': 74.86, 'max_score': 4393.922, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4393922.jpg'}, {'end': 4588.585, 'src': 'embed', 'start': 4559.46, 'weight': 3, 'content': [{'end': 4564.421, 'text': 'it will start running a duplicated task for it to ensure that basically,', 'start': 4559.46, 'duration': 4.961}, {'end': 4570.742, 'text': 'that duplicate task runs faster and once it will finish it will kill all the duplicate tasks.', 'start': 4564.421, 'duration': 6.321}, {'end': 4576.583, 'text': 'so it is just kind of making sure that, because it can happen right, maybe your task is waiting due to some resource,', 'start': 4570.742, 'duration': 5.841}, {'end': 4578.703, 'text': 'it got blocked due to any reason.', 'start': 4576.583, 'duration': 2.12}, {'end': 4588.585, 'text': 'so in that case it will start immediately a duplicate task and making sure that your job finished quickly.', 'start': 4578.703, 'duration': 9.882}], 'summary': 'Duplicate task runs faster to ensure quick job completion.', 'duration': 29.125, 'max_score': 4559.46, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4559460.jpg'}, {'end': 4689.292, 'src': 'embed', 'start': 4662.186, 'weight': 4, 'content': [{'end': 4666.05, 'text': 'basically, in method one, you can just basically increase the size itself.', 'start': 4662.186, 'duration': 3.864}, {'end': 4667.271, 'text': 'it will be all good.', 'start': 4666.05, 'duration': 1.221}, {'end': 4668.452, 'text': 'or what you can do.', 'start': 4667.271, 'duration': 1.181}, {'end': 4678.461, 'text': 'you can go to your input format class and in that you can just update this property you can make this is splitable to be returning first.', 'start': 4668.452, 'duration': 10.009}, {'end': 4686.09, 'text': "usually people prefer method one, because that's the easiest method, right, you need not update the Java code basically to achieve all this.", 'start': 4678.906, 'duration': 7.184}, {'end': 4687.932, 'text': 'so what we will do for this?', 'start': 4686.09, 'duration': 1.842}, {'end': 4689.292, 'text': 'can you please tell us?', 'start': 4687.932, 'duration': 1.36}], 'summary': 'Method one allows increasing size without updating java code, preferred by many.', 'duration': 27.106, 'max_score': 4662.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4662186.jpg'}, {'end': 4761.158, 'src': 'embed', 'start': 4727.617, 'weight': 5, 'content': [{'end': 4733.178, 'text': "Is it legal to set the number of reducer task to zero? That's question number one.", 'start': 4727.617, 'duration': 5.561}, {'end': 4735.999, 'text': 'Where the output will be stored in this case?', 'start': 4733.198, 'duration': 2.801}, {'end': 4741.08, 'text': 'Is it legal to set the number of reducer task to zero?', 'start': 4737.079, 'duration': 4.001}, {'end': 4745.201, 'text': 'Is it legal? Definitely legal, right?', 'start': 4741.98, 'duration': 3.221}, {'end': 4748.981, 'text': 'Everybody have your scope? In scope.', 'start': 4746.201, 'duration': 2.78}, {'end': 4750.242, 'text': 'was there any reducer?', 'start': 4748.981, 'duration': 1.261}, {'end': 4752.742, 'text': 'Was there any reducer in scope?', 'start': 4751.102, 'duration': 1.64}, {'end': 4757.063, 'text': 'No scope, only use mapper.', 'start': 4754.202, 'duration': 2.861}, {'end': 4759.557, 'text': 'no reducer right.', 'start': 4757.856, 'duration': 1.701}, {'end': 4761.158, 'text': "that's itself a tool right.", 'start': 4759.557, 'duration': 1.601}], 'summary': 'It is legal to set the number of reducer tasks to zero, resulting in no reducers being used, only the mapper.', 'duration': 33.541, 'max_score': 4727.617, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4727617.jpg'}, {'end': 5033.941, 'src': 'heatmap', 'start': 4781.193, 'weight': 0.841, 'content': [{'end': 4788.102, 'text': "but it's not mandatory that all the problem statement in the word require aggregation right.", 'start': 4781.193, 'duration': 6.909}, {'end': 4793.084, 'text': 'so those problems where you do not require aggregation, like I said, scoop.', 'start': 4788.102, 'duration': 4.982}, {'end': 4794.665, 'text': 'because in scoop what happens?', 'start': 4793.084, 'duration': 1.581}, {'end': 4798.966, 'text': 'you copy the data from RDBMS to your HDBFS or vice versa.', 'start': 4794.665, 'duration': 4.301}, {'end': 4801.067, 'text': 'now, are you doing any sort of aggregation?', 'start': 4798.966, 'duration': 2.101}, {'end': 4806.169, 'text': 'no, right, you are just copying the file from RDBMS to your HDBFS system.', 'start': 4801.067, 'duration': 5.102}, {'end': 4807.529, 'text': 'no aggregation required.', 'start': 4806.169, 'duration': 1.36}, {'end': 4812.299, 'text': 'so in those cases you will be having reducer as low.', 'start': 4807.529, 'duration': 4.77}, {'end': 4820.204, 'text': 'Output where to store? Definitely whatever mapper output is coming wherever you are telling it will be stored in that in CFS location.', 'start': 4812.799, 'duration': 7.405}, {'end': 4823.287, 'text': "So that's how you will be using it.", 'start': 4820.505, 'duration': 2.782}, {'end': 4830.011, 'text': 'So definitely the answer is yes and basically wherever the mapper output you are keeping there it will be getting stored.', 'start': 4823.307, 'duration': 6.704}, {'end': 4839.566, 'text': 'What is the role of application master in MapReduce job? Can I get this answer? This is a very famous interview question.', 'start': 4830.792, 'duration': 8.774}, {'end': 4849.954, 'text': "What is the role of application master in MapReduce job? To assign the task, okay, that's it.", 'start': 4840.307, 'duration': 9.647}, {'end': 4854.377, 'text': 'Only to assign the task, sets the input split, okay.', 'start': 4850.454, 'duration': 3.923}, {'end': 4861.822, 'text': "It manages the application fired and keep track of subprocess, that's it, nothing else.", 'start': 4855.337, 'duration': 6.485}, {'end': 4867.266, 'text': 'To get the resources needed for the task, very good Ramya, now you are coming here.', 'start': 4862.763, 'duration': 4.503}, {'end': 4871.073, 'text': 'you are coming close to create task until antenna turns?', 'start': 4867.591, 'duration': 3.482}, {'end': 4875.336, 'text': 'Yes, what basically application master do?', 'start': 4871.353, 'duration': 3.983}, {'end': 4884.102, 'text': 'First thing is, application master is kind of deciding that how many resources it need.', 'start': 4875.536, 'duration': 8.566}, {'end': 4894.309, 'text': 'okay, and it can basically now inform resource manager that I require this many resource and give me this many resource to execute that.', 'start': 4884.102, 'duration': 10.207}, {'end': 4897.87, 'text': 'then, after that, resource manager give back the container right.', 'start': 4894.309, 'duration': 3.561}, {'end': 4903.951, 'text': 'if you might have gone through a yarn architecture right there, you might have understood all this portion right.', 'start': 4897.87, 'duration': 6.081}, {'end': 4905.111, 'text': 'so basically, what happens?', 'start': 4903.951, 'duration': 1.16}, {'end': 4909.072, 'text': 'it first basically find out how many resources are required.', 'start': 4905.111, 'duration': 3.961}, {'end': 4917.237, 'text': 'Secondly, what it want to do is it basically want to find out, once it happens, when it gets all the container,', 'start': 4909.707, 'duration': 7.53}, {'end': 4920.742, 'text': 'it kind of basically get them working together.', 'start': 4917.237, 'duration': 3.505}, {'end': 4924.387, 'text': 'it collect the output back and return it to basically the master node.', 'start': 4920.742, 'duration': 3.645}, {'end': 4928.128, 'text': 'So it is doing multiple things in MapReduce.', 'start': 4924.747, 'duration': 3.381}, {'end': 4930.348, 'text': "It's basically not doing just one task.", 'start': 4928.348, 'duration': 2}, {'end': 4934.149, 'text': 'It is also checking which is a very important role of it.', 'start': 4930.689, 'duration': 3.46}, {'end': 4943.852, 'text': 'It is checking that how many resources I require and kind of helping the resource manager to take that decision.', 'start': 4934.269, 'duration': 9.583}, {'end': 4946.733, 'text': 'So this is what the same thing being explained here.', 'start': 4944.152, 'duration': 2.581}, {'end': 4950.573, 'text': 'The role of your application master.', 'start': 4947.193, 'duration': 3.38}, {'end': 4952.494, 'text': 'Which brings me to another question.', 'start': 4951.114, 'duration': 1.38}, {'end': 4956.241, 'text': 'what do you mean by over mode?', 'start': 4953.24, 'duration': 3.001}, {'end': 4964.822, 'text': "So when your MapReduce job runs I don't I'm not sure whether you have noticed that or not there is something called as over mode.", 'start': 4956.781, 'duration': 8.041}, {'end': 4967.823, 'text': "it's something comes in your console.", 'start': 4964.822, 'duration': 3.001}, {'end': 4970.064, 'text': 'if you have noticed that, so can you tell me?', 'start': 4967.823, 'duration': 2.241}, {'end': 4973.124, 'text': 'what is that over mode?', 'start': 4970.064, 'duration': 3.06}, {'end': 4976.625, 'text': 'is there any advantage of switching on the over mode?', 'start': 4973.124, 'duration': 3.501}, {'end': 4981.546, 'text': 'what basically over mode is going to do runs on application master.', 'start': 4976.625, 'duration': 4.921}, {'end': 4981.866, 'text': 'very good.', 'start': 4981.546, 'duration': 0.32}, {'end': 4985.287, 'text': 'Can I get some more answers, more insight on this?', 'start': 4982.366, 'duration': 2.921}, {'end': 4988.368, 'text': 'Can I get some more answers??', 'start': 4987.348, 'duration': 1.02}, {'end': 4998.912, 'text': "Yes, so let's say if you have a small job, if you have a small job, in that case you are basically again.", 'start': 4988.968, 'duration': 9.944}, {'end': 5001.393, 'text': 'application master need to be anywhere now.', 'start': 4998.912, 'duration': 2.481}, {'end': 5005.954, 'text': 'application master will request and then it will allocate container.', 'start': 5001.393, 'duration': 4.561}, {'end': 5009.255, 'text': 'so container sending and all basically creating container.', 'start': 5005.954, 'duration': 3.301}, {'end': 5010.876, 'text': 'all those things are time consuming.', 'start': 5009.255, 'duration': 1.621}, {'end': 5017.875, 'text': 'if the jobs are small, what your application master can do?', 'start': 5011.673, 'duration': 6.202}, {'end': 5020.936, 'text': 'application master can start a JVM.', 'start': 5017.875, 'duration': 3.061}, {'end': 5022.697, 'text': 'in itself, okay.', 'start': 5020.936, 'duration': 1.761}, {'end': 5026.838, 'text': 'so basically, the application master can decide to complete your job.', 'start': 5022.697, 'duration': 4.141}, {'end': 5031.16, 'text': 'because the job is small, it may decide to complete the job on its own.', 'start': 5026.838, 'duration': 4.322}, {'end': 5033.941, 'text': 'in that case, we call it as over mode.', 'start': 5031.16, 'duration': 2.781}], 'summary': 'Scoop is used for non-aggregation data transfer. application master assigns tasks in mapreduce job and can run in over mode for small jobs.', 'duration': 252.748, 'max_score': 4781.193, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4781193.jpg'}, {'end': 4861.822, 'src': 'embed', 'start': 4830.792, 'weight': 2, 'content': [{'end': 4839.566, 'text': 'What is the role of application master in MapReduce job? Can I get this answer? This is a very famous interview question.', 'start': 4830.792, 'duration': 8.774}, {'end': 4849.954, 'text': "What is the role of application master in MapReduce job? To assign the task, okay, that's it.", 'start': 4840.307, 'duration': 9.647}, {'end': 4854.377, 'text': 'Only to assign the task, sets the input split, okay.', 'start': 4850.454, 'duration': 3.923}, {'end': 4861.822, 'text': "It manages the application fired and keep track of subprocess, that's it, nothing else.", 'start': 4855.337, 'duration': 6.485}], 'summary': 'Application master assigns tasks and manages subprocesses in mapreduce job.', 'duration': 31.03, 'max_score': 4830.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4830792.jpg'}, {'end': 5088.371, 'src': 'embed', 'start': 5058.582, 'weight': 1, 'content': [{'end': 5064.144, 'text': 'your application master will start acting like a JVM and will finish the job.', 'start': 5058.582, 'duration': 5.562}, {'end': 5069.466, 'text': 'so thus it will save you to by creating container, making your performance better.', 'start': 5064.144, 'duration': 5.322}, {'end': 5074.547, 'text': 'so usually what people do is whenever they will be having some sort of,', 'start': 5069.466, 'duration': 5.081}, {'end': 5082.49, 'text': 'whenever they will be having small jobs right and they want to improve the performance, they usually kind of enable the uber mode.', 'start': 5074.547, 'duration': 7.943}, {'end': 5088.371, 'text': 'so now the application master itself start executing the task and that way they improve the performance.', 'start': 5082.49, 'duration': 5.881}], 'summary': 'Enabling uber mode allows the application master to execute small jobs, improving performance by acting like a jvm and creating containers.', 'duration': 29.789, 'max_score': 5058.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5058582.jpg'}, {'end': 5186.265, 'src': 'embed', 'start': 5157.472, 'weight': 0, 'content': [{'end': 5162.914, 'text': 'like these are some small files, it basically now combined all the files together.', 'start': 5157.472, 'duration': 5.442}, {'end': 5169.117, 'text': 'now, because these small files got combined together, now my execution time will be passed.', 'start': 5162.914, 'duration': 6.203}, {'end': 5173.299, 'text': 'this is basically one practical thing which is being represented.', 'start': 5169.117, 'duration': 4.182}, {'end': 5180.242, 'text': 'this performance, can you see like the small files was taking this much of time and basically, with this, when you combine this,', 'start': 5173.299, 'duration': 6.943}, {'end': 5186.265, 'text': 'it actually started taking less time, so that improving the performance of your system.', 'start': 5180.242, 'duration': 6.023}], 'summary': 'Combining small files reduced execution time, improving system performance.', 'duration': 28.793, 'max_score': 5157.472, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5157472.jpg'}], 'start': 4393.922, 'title': 'Counters, outputs, hdfs replication and mapreduce', 'summary': 'Covers the use of counters to identify statistics, intermediate output handling, hdfs replication impact, speculative execution, file splitting prevention, reducer tasks, application master role, uber mode, and performance enhancement using combined file input format in mapreduce jobs.', 'chapters': [{'end': 4468.782, 'start': 4393.922, 'title': 'Assessing counters and intermediate outputs', 'summary': 'Discusses the purpose of counters for identifying statistics, the output of math pass being filled into the local disk, and the assessment of counters using predefined functions, with a focus on why intermediate output is not kept in sgfs.', 'duration': 74.86, 'highlights': ['The purpose of counters is to identify statistics, which can be done easily with the help of counters.', 'The output of math pass was filled into the local disk instead of HDSS, prompting a question about the reason for this choice.', "Assessing counters involves using predefined member functions and libraries, such as 'get counters', to evaluate intermediate outputs.", 'Questioning the decision to not keep intermediate output in SGFS is highlighted, prompting a discussion on the topic.']}, {'end': 5180.242, 'start': 4468.782, 'title': 'Hdfs replication and mapreduce optimization', 'summary': 'Discusses the impacts of hdfs replication factor, speculative execution, file splitting prevention, reducer tasks, application master role, uber mode, and performance enhancement using combined file input format in mapreduce jobs.', 'duration': 711.46, 'highlights': ['Speculative Execution The speculative execution duplicates slow tasks to ensure faster job completion, addressing resource blockages.', 'Preventing File Splitting Increasing the minimum split size or updating the input format class are methods to prevent file splitting, useful in scenarios where file splitting is not needed, such as when dealing with only one data node.', 'Number of Reducer Tasks Setting the number of reducer tasks to zero is legal for problems not requiring aggregation, and the mapper output is stored in the specified location.', 'Application Master Role The application master manages resource allocation, container creation, and output collection, playing a crucial role in the MapReduce job execution.', 'Uber Mode Uber mode allows the application master to act like a JVM for small jobs, improving performance by reducing time-consuming processes, and can be enabled by setting a specific property to true.', 'Performance Enhancement with Combined File Input Format Combined file input format packages small files together, reducing execution time and improving job performance when dealing with many small files.']}], 'duration': 786.32, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg4393922.jpg', 'highlights': ['Performance Enhancement with Combined File Input Format packages small files together, reducing execution time and improving job performance when dealing with many small files.', 'Uber Mode allows the application master to act like a JVM for small jobs, improving performance by reducing time-consuming processes, and can be enabled by setting a specific property to true.', 'Application Master Role manages resource allocation, container creation, and output collection, playing a crucial role in the MapReduce job execution.', 'Speculative Execution duplicates slow tasks to ensure faster job completion, addressing resource blockages.', 'Preventing File Splitting by increasing the minimum split size or updating the input format class are methods to prevent file splitting, useful in scenarios where file splitting is not needed, such as when dealing with only one data node.', 'Number of Reducer Tasks setting the number of reducer tasks to zero is legal for problems not requiring aggregation, and the mapper output is stored in the specified location.', "Assessing counters involves using predefined member functions and libraries, such as 'get counters', to evaluate intermediate outputs.", 'The purpose of counters is to identify statistics, which can be done easily with the help of counters.', 'The output of math pass was filled into the local disk instead of HDSS, prompting a question about the reason for this choice.', 'Questioning the decision to not keep intermediate output in SGFS is highlighted, prompting a discussion on the topic.']}, {'end': 6968.89, 'segs': [{'end': 5282.928, 'src': 'embed', 'start': 5180.242, 'weight': 0, 'content': [{'end': 5186.265, 'text': 'it actually started taking less time, so that improving the performance of your system.', 'start': 5180.242, 'duration': 6.023}, {'end': 5195.559, 'text': 'So this is what this is how basically you can improve the performance if you have multiple small files.', 'start': 5186.837, 'duration': 8.722}, {'end': 5199.059, 'text': "Now let's move to Hive.", 'start': 5196.519, 'duration': 2.54}, {'end': 5201.94, 'text': 'Hive is very important topic.', 'start': 5199.059, 'duration': 2.881}, {'end': 5209.021, 'text': 'now, question for you where the data of Hive table gets stored?', 'start': 5201.94, 'duration': 7.081}, {'end': 5213.322, 'text': 'I know that you are going to get SGFS, but where is the location of that?', 'start': 5209.101, 'duration': 4.221}, {'end': 5215.622, 'text': 'Where is the location for that?', 'start': 5214.182, 'duration': 1.44}, {'end': 5219.139, 'text': 'but what is that default folder next version?', 'start': 5216.397, 'duration': 2.742}, {'end': 5228.728, 'text': 'very good, now you are coming close, so by default, to keep it in slash user slash hive slash warehouse.', 'start': 5219.139, 'duration': 9.589}, {'end': 5236.414, 'text': 'so this is the location where, by default, all your hive table gets stored.', 'start': 5228.728, 'duration': 7.686}, {'end': 5242.479, 'text': 'if you want to change this, you can go to your highside.xml and can update this setting as well.', 'start': 5236.414, 'duration': 6.065}, {'end': 5250.998, 'text': 'another question Why HDFS is not used by Hive Metastore for storage?', 'start': 5242.479, 'duration': 8.519}, {'end': 5264.285, 'text': 'What I mean is if you might have read in your course that you keep your Hive, basically your Metastore, in your RDPMS right? Not in HDFS.', 'start': 5251.638, 'duration': 12.647}, {'end': 5266.786, 'text': 'what is the reason behind that?', 'start': 5264.285, 'duration': 2.501}, {'end': 5273.125, 'text': 'why we not keep all these things in my HDFS?', 'start': 5267.583, 'duration': 5.542}, {'end': 5277.626, 'text': 'why your meta store is created in your RDBMS?', 'start': 5273.125, 'duration': 4.501}, {'end': 5282.928, 'text': 'why you are configuring your meta store in RDBMS for I?', 'start': 5277.626, 'duration': 5.302}], 'summary': 'Improving system performance by storing hive tables in default /user/hive/warehouse location. hive metastore is stored in rdbms, not in hdfs for efficiency.', 'duration': 102.686, 'max_score': 5180.242, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5180242.jpg'}, {'end': 5453.57, 'src': 'heatmap', 'start': 5360.734, 'weight': 1, 'content': [{'end': 5362.795, 'text': 'okay, you are saying referring to the same thing.', 'start': 5360.734, 'duration': 2.061}, {'end': 5363.396, 'text': 'that is correct.', 'start': 5362.795, 'duration': 0.601}, {'end': 5364.617, 'text': 'then you are good.', 'start': 5363.396, 'duration': 1.221}, {'end': 5367.399, 'text': 'can you do this basically this in your HDFS?', 'start': 5364.617, 'duration': 2.782}, {'end': 5371.362, 'text': 'no, right, this itself is a good answer to explain this right.', 'start': 5367.399, 'duration': 3.963}, {'end': 5372.15, 'text': 'so with.', 'start': 5371.69, 'duration': 0.46}, {'end': 5376.473, 'text': 'and second thing is definitely in RDBMS the syncing time is going to be faster enough.', 'start': 5372.15, 'duration': 4.323}, {'end': 5377.934, 'text': 'those are the other factors.', 'start': 5376.473, 'duration': 1.461}, {'end': 5382.838, 'text': 'but first thing is basically your CRUD operation cannot be done in SDFS.', 'start': 5377.934, 'duration': 4.904}, {'end': 5387.581, 'text': "that's the reason we will not be able to keep in SDFS.", 'start': 5382.838, 'duration': 4.743}, {'end': 5396.707, 'text': 'forget about other factors, right, they do not make any sense, in fact, because my first property itself is failing of basically CRUD operation.', 'start': 5387.581, 'duration': 9.126}, {'end': 5400.273, 'text': "Let's see some scenario question.", 'start': 5398.132, 'duration': 2.141}, {'end': 5404.856, 'text': 'usually in Hive you will find some scenario questions coming up now.', 'start': 5400.273, 'duration': 4.583}, {'end': 5413.081, 'text': 'scenario question is suppose I have installed Apache Hive on top of my Hadoop cluster.', 'start': 5404.856, 'duration': 8.225}, {'end': 5414.642, 'text': 'can you please show the last answer?', 'start': 5413.081, 'duration': 1.561}, {'end': 5418.905, 'text': 'sure, why not see this answer in the?', 'start': 5414.642, 'duration': 4.263}, {'end': 5422.527, 'text': "yeah, let's move forward now.", 'start': 5418.905, 'duration': 3.622}, {'end': 5423.748, 'text': 'can we see this question now?', 'start': 5422.527, 'duration': 1.221}, {'end': 5434.238, 'text': 'Suppose I have installed Apache Hive on top of my Hadoop cluster using default Metastore configuration.', 'start': 5424.511, 'duration': 9.727}, {'end': 5442.724, 'text': 'then what will happen if we have multiple clients trying to access Hive address in there?', 'start': 5434.238, 'duration': 8.486}, {'end': 5443.925, 'text': 'Can I get this answer?', 'start': 5443.164, 'duration': 0.761}, {'end': 5453.57, 'text': 'Suppose I have installed Apache Hive on top of my Hadoop cluster by using default metastore compilation.', 'start': 5444.605, 'duration': 8.965}], 'summary': 'Hdfs cannot handle crud operations, rdbms syncs faster, hive scenario questions arise.', 'duration': 92.836, 'max_score': 5360.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5360734.jpg'}, {'end': 5941.269, 'src': 'heatmap', 'start': 5671.098, 'weight': 4, 'content': [{'end': 5676.542, 'text': 'also, order by one of it, use reducer, other use mappers.', 'start': 5671.098, 'duration': 5.444}, {'end': 5682.246, 'text': "okay, when you use group by operation, no, that's not the way.", 'start': 5676.542, 'duration': 5.704}, {'end': 5696.503, 'text': 'actually what happens is if you have huge data set, in that case you should use basically the sort by option.', 'start': 5683.052, 'duration': 13.451}, {'end': 5703.969, 'text': 'it usually do this sorting on multiple reducer while order by do it on one reducer.', 'start': 5696.503, 'duration': 7.466}, {'end': 5707.231, 'text': 'that is basically the major difference.', 'start': 5703.969, 'duration': 3.262}, {'end': 5715.103, 'text': 'so when you have huge data set, use sort by instead of order by.', 'start': 5707.231, 'duration': 7.872}, {'end': 5717.604, 'text': 'okay, lot of people remain confused with this.', 'start': 5715.103, 'duration': 2.501}, {'end': 5719.644, 'text': "that's why this is a very tricky question.", 'start': 5717.604, 'duration': 2.04}, {'end': 5721.025, 'text': 'what people are usually?', 'start': 5719.644, 'duration': 1.381}, {'end': 5722.505, 'text': 'if you ask anyone right?', 'start': 5721.025, 'duration': 1.48}, {'end': 5727.607, 'text': "if you do not go, the answer you will tell you both do the same thing, but actually that's not the fact there is a difference.", 'start': 5722.505, 'duration': 5.102}, {'end': 5733.108, 'text': "Now another question what's the difference between partition and bucket in Hive?", 'start': 5728.547, 'duration': 4.561}, {'end': 5735.409, 'text': 'I think the most easiest question to answer.', 'start': 5733.108, 'duration': 2.301}, {'end': 5741.191, 'text': "can everybody answer this, whoever have done on Hive topic what's the difference between partition and bucket?", 'start': 5735.409, 'duration': 5.782}, {'end': 5745.181, 'text': 'this is the most easiest answer.', 'start': 5742.418, 'duration': 2.763}, {'end': 5746.943, 'text': 'can I get this answer?', 'start': 5745.181, 'duration': 1.762}, {'end': 5748.784, 'text': 'difference between partition and bucket?', 'start': 5746.943, 'duration': 1.841}, {'end': 5750.066, 'text': 'simple right.', 'start': 5748.784, 'duration': 1.282}, {'end': 5753.269, 'text': 'partition is basically at the first level right.', 'start': 5750.066, 'duration': 3.203}, {'end': 5759.675, 'text': 'when you split the data into different directly, bucket is like a sub partitions of that right.', 'start': 5753.269, 'duration': 6.406}, {'end': 5765.841, 'text': 'so even for that partition itself, when you create another sub partitions, you can call them as bucket.', 'start': 5759.675, 'duration': 6.166}, {'end': 5767.232, 'text': 'like in this case.', 'start': 5766.451, 'duration': 0.781}, {'end': 5772.397, 'text': 'can you see, the first partition is TLC department, civil department, electrical department,', 'start': 5767.232, 'duration': 5.165}, {'end': 5778.864, 'text': 'but after that we have also created some sub partitions of it, and that is your pocket.', 'start': 5772.397, 'duration': 6.467}, {'end': 5780.426, 'text': "that's basically the difference.", 'start': 5778.864, 'duration': 1.562}, {'end': 5782.228, 'text': 'Another question.', 'start': 5781.147, 'duration': 1.081}, {'end': 5784.17, 'text': "let's say this is the scenario.", 'start': 5782.228, 'duration': 1.942}, {'end': 5786.913, 'text': 'you are creating a transition table.', 'start': 5784.17, 'duration': 2.743}, {'end': 5789.655, 'text': 'Now this is the table.', 'start': 5787.674, 'duration': 1.981}, {'end': 5792.957, 'text': 'what you have like transaction table is the table you have this.', 'start': 5789.655, 'duration': 3.302}, {'end': 5796.338, 'text': 'many columns be limited field by comma.', 'start': 5792.957, 'duration': 3.381}, {'end': 5801.101, 'text': "now let's say you have inserted 50, 000 tuples in this table.", 'start': 5796.338, 'duration': 4.763}, {'end': 5811.526, 'text': 'now I want to know the total revenue generated for each month, but high is taking too much time in processing this query.', 'start': 5801.101, 'duration': 10.425}, {'end': 5816.15, 'text': 'can you tell me what the solution you are going to provide?', 'start': 5811.526, 'duration': 4.624}, {'end': 5821.795, 'text': 'this scenario is actually a very good interview scenario, very good.', 'start': 5816.15, 'duration': 5.645}, {'end': 5824.538, 'text': 'Can I get more answers?', 'start': 5823.056, 'duration': 1.482}, {'end': 5828.021, 'text': 'Very good, can I get more answers??', 'start': 5825.659, 'duration': 2.362}, {'end': 5832.185, 'text': 'You will be partitioning this table.', 'start': 5828.722, 'duration': 3.463}, {'end': 5833.646, 'text': 'how you will be partitioning?', 'start': 5832.185, 'duration': 1.461}, {'end': 5837.79, 'text': 'You will be partitioning your table with month.', 'start': 5833.946, 'duration': 3.844}, {'end': 5844.357, 'text': 'So, basically, if you partition a table, you will improve your performance.', 'start': 5839.374, 'duration': 4.983}, {'end': 5846.418, 'text': 'So these are the simple steps.', 'start': 5844.697, 'duration': 1.721}, {'end': 5853.061, 'text': 'you can create a table partition by month, set these properties to true so that you can enable your partition,', 'start': 5846.418, 'duration': 6.643}, {'end': 5859.404, 'text': 'insert the data and then you can retrieve the data where your month is going to January.', 'start': 5853.061, 'duration': 6.343}, {'end': 5865.807, 'text': 'So while partition after partition the table you can improve the performance secondly.', 'start': 5859.784, 'duration': 6.023}, {'end': 5867.884, 'text': 'can I get an answer of this?', 'start': 5866.343, 'duration': 1.541}, {'end': 5873.447, 'text': 'what is dynamic partitioning and when is it used?', 'start': 5867.884, 'duration': 5.563}, {'end': 5874.968, 'text': 'can I get this answer?', 'start': 5873.447, 'duration': 1.521}, {'end': 5879.531, 'text': 'what is dynamic partitioning and when is it used?', 'start': 5874.968, 'duration': 4.563}, {'end': 5882.813, 'text': 'that can be static partitioning, also in the right.', 'start': 5879.531, 'duration': 3.282}, {'end': 5885.895, 'text': 'so I want to know what is dynamic partitioning.', 'start': 5882.813, 'duration': 3.082}, {'end': 5892.138, 'text': 'very good, partition happens when loading the data into table.', 'start': 5885.895, 'duration': 6.243}, {'end': 5899.739, 'text': "right now I don't know that if I do a dynamic partitioning, where it is, which, how many partitions also it is going to play?", 'start': 5892.138, 'duration': 7.601}, {'end': 5906.761, 'text': 'so the value of your partition columns will be known only during your runtime,', 'start': 5899.739, 'duration': 7.022}, {'end': 5914.563, 'text': 'when you will be creating the partition that is called your dynamic partitioning okay.', 'start': 5906.761, 'duration': 7.802}, {'end': 5919.124, 'text': 'How high distribute the rows into bucket?', 'start': 5915.543, 'duration': 3.581}, {'end': 5920.985, 'text': 'can I get this answer?', 'start': 5919.124, 'duration': 1.861}, {'end': 5925.818, 'text': 'how high distributes the rows into bucket?', 'start': 5920.985, 'duration': 4.833}, {'end': 5928.66, 'text': 'very good, hash algorithm.', 'start': 5925.818, 'duration': 2.842}, {'end': 5933.904, 'text': 'okay, it uses the hash algorithm to understand this part.', 'start': 5928.66, 'duration': 5.244}, {'end': 5941.269, 'text': 'if you look, what we are doing here is now no, you will be using clustered by, but basically are in the how internally.', 'start': 5933.904, 'duration': 7.365}], 'summary': 'Use sort by for huge data sets, partition for improved performance, and dynamic partitioning for runtime-defined values.', 'duration': 36.133, 'max_score': 5671.098, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5671098.jpg'}, {'end': 5859.404, 'src': 'embed', 'start': 5828.722, 'weight': 5, 'content': [{'end': 5832.185, 'text': 'You will be partitioning this table.', 'start': 5828.722, 'duration': 3.463}, {'end': 5833.646, 'text': 'how you will be partitioning?', 'start': 5832.185, 'duration': 1.461}, {'end': 5837.79, 'text': 'You will be partitioning your table with month.', 'start': 5833.946, 'duration': 3.844}, {'end': 5844.357, 'text': 'So, basically, if you partition a table, you will improve your performance.', 'start': 5839.374, 'duration': 4.983}, {'end': 5846.418, 'text': 'So these are the simple steps.', 'start': 5844.697, 'duration': 1.721}, {'end': 5853.061, 'text': 'you can create a table partition by month, set these properties to true so that you can enable your partition,', 'start': 5846.418, 'duration': 6.643}, {'end': 5859.404, 'text': 'insert the data and then you can retrieve the data where your month is going to January.', 'start': 5853.061, 'duration': 6.343}], 'summary': 'Partition table by month to improve performance.', 'duration': 30.682, 'max_score': 5828.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5828722.jpg'}, {'end': 5933.904, 'src': 'embed', 'start': 5882.813, 'weight': 6, 'content': [{'end': 5885.895, 'text': 'so I want to know what is dynamic partitioning.', 'start': 5882.813, 'duration': 3.082}, {'end': 5892.138, 'text': 'very good, partition happens when loading the data into table.', 'start': 5885.895, 'duration': 6.243}, {'end': 5899.739, 'text': "right now I don't know that if I do a dynamic partitioning, where it is, which, how many partitions also it is going to play?", 'start': 5892.138, 'duration': 7.601}, {'end': 5906.761, 'text': 'so the value of your partition columns will be known only during your runtime,', 'start': 5899.739, 'duration': 7.022}, {'end': 5914.563, 'text': 'when you will be creating the partition that is called your dynamic partitioning okay.', 'start': 5906.761, 'duration': 7.802}, {'end': 5919.124, 'text': 'How high distribute the rows into bucket?', 'start': 5915.543, 'duration': 3.581}, {'end': 5920.985, 'text': 'can I get this answer?', 'start': 5919.124, 'duration': 1.861}, {'end': 5925.818, 'text': 'how high distributes the rows into bucket?', 'start': 5920.985, 'duration': 4.833}, {'end': 5928.66, 'text': 'very good, hash algorithm.', 'start': 5925.818, 'duration': 2.842}, {'end': 5933.904, 'text': 'okay, it uses the hash algorithm to understand this part.', 'start': 5928.66, 'duration': 5.244}], 'summary': 'Dynamic partitioning creates partitions at runtime using hash algorithm.', 'duration': 51.091, 'max_score': 5882.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5882813.jpg'}, {'end': 6095.048, 'src': 'embed', 'start': 6063.939, 'weight': 8, 'content': [{'end': 6077.111, 'text': 'I have lot of small CSV files present in input direct trainers and I want to create a single table, high table, corresponding to these files.', 'start': 6063.939, 'duration': 13.172}, {'end': 6080.154, 'text': 'the data in these files are in this format.', 'start': 6077.111, 'duration': 3.043}, {'end': 6086.42, 'text': 'now, as we know, Hadoop performance degrades when we use lot of small files.', 'start': 6080.154, 'duration': 6.266}, {'end': 6088.762, 'text': 'so how you will solve this problem?', 'start': 6086.42, 'duration': 2.342}, {'end': 6091.724, 'text': 'can anybody give me a simple answer of this?', 'start': 6089.622, 'duration': 2.102}, {'end': 6093.126, 'text': 'should be easy.', 'start': 6091.724, 'duration': 1.402}, {'end': 6095.048, 'text': 'you have multiple small files now.', 'start': 6093.126, 'duration': 1.922}], 'summary': 'To improve hadoop performance with multiple small csv files, i want to merge them into a single, high table.', 'duration': 31.109, 'max_score': 6063.939, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6063939.jpg'}, {'end': 6279.882, 'src': 'embed', 'start': 6255.318, 'weight': 9, 'content': [{'end': 6261.76, 'text': 'but when you do dump right, when you do dump, then only the execution start right because of lazy evaluation,', 'start': 6255.318, 'duration': 6.442}, {'end': 6269.556, 'text': 'then your logical plan kind of get converted to like a physical plan, means it start getting executed.', 'start': 6262.492, 'duration': 7.064}, {'end': 6279.882, 'text': "Now let's say, if you have given the wrong file path right in logical plan, it will not give you any error because there is no syntax error.", 'start': 6270.016, 'duration': 9.866}], 'summary': 'Lazy evaluation in dumping data starts execution, no syntax error for wrong file path in logical plan.', 'duration': 24.564, 'max_score': 6255.318, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6255318.jpg'}, {'end': 6499.407, 'src': 'embed', 'start': 6466.915, 'weight': 10, 'content': [{'end': 6469.676, 'text': 'Now. so there are two modes available.', 'start': 6466.915, 'duration': 2.761}, {'end': 6472.917, 'text': 'one is MapReduce mode, one is Local mode.', 'start': 6469.676, 'duration': 3.241}, {'end': 6479.899, 'text': 'so when you go with pig in MapReduce mode, so when you just type pig right, it takes you to the Gruntion.', 'start': 6472.917, 'duration': 6.982}, {'end': 6488.201, 'text': 'by default, it takes you to the MapReduce mode, which basically also states that that basically, if we are going with MapReduce mode,', 'start': 6479.899, 'duration': 8.302}, {'end': 6493.923, 'text': 'you are assessing your HDFS, while if you are using Local mode, what you need to do?', 'start': 6488.201, 'duration': 5.722}, {'end': 6499.407, 'text': 'you need to go like this hyphen x, local.', 'start': 6493.923, 'duration': 5.484}], 'summary': 'Pig has two modes: mapreduce and local. mapreduce mode assesses hdfs, while local mode uses -x local.', 'duration': 32.492, 'max_score': 6466.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6466915.jpg'}, {'end': 6689.262, 'src': 'embed', 'start': 6662.525, 'weight': 11, 'content': [{'end': 6665.946, 'text': "so in HBase you don't have concepts of name, node, you're keeping data.", 'start': 6662.525, 'duration': 3.421}, {'end': 6671.61, 'text': 'so for that to do this part, like zookeeper, play a major role here.', 'start': 6666.406, 'duration': 5.204}, {'end': 6677.814, 'text': 'so zookeeper is kind of going to act like a coordinator inside your HBase environment.', 'start': 6671.61, 'duration': 6.204}, {'end': 6683.478, 'text': "okay, it will help you to coordinate all the things, because here we don't have name, node, data, nodes, you know so,", 'start': 6677.814, 'duration': 5.664}, {'end': 6685.38, 'text': 'and there is no yarn basically here.', 'start': 6683.478, 'duration': 1.902}, {'end': 6689.262, 'text': 'so you can treat it like basically just like how yarn was handling things there.', 'start': 6685.38, 'duration': 3.882}], 'summary': 'In hbase, zookeeper acts as a coordinator, handling data without name or node concepts.', 'duration': 26.737, 'max_score': 6662.525, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6662525.jpg'}], 'start': 5180.242, 'title': 'Improving hive performance and storage', 'summary': 'Discusses the benefits of small files for system performance, default storage location for hive tables, reasons for using rdbms for hive metastore, optimization techniques for hive queries, and the difference between logical and physical plans in hadoop.', 'chapters': [{'end': 5242.479, 'start': 5180.242, 'title': 'Improving performance with small files and hive storage', 'summary': 'Discusses how having multiple small files can improve system performance and explains the default storage location for hive tables in /user/hive/warehouse, which can be modified in hive-site.xml.', 'duration': 62.237, 'highlights': ['The default location for storing Hive tables is /user/hive/warehouse, and this setting can be updated in hive-site.xml.', 'Having multiple small files can improve system performance.', 'Updating the storage location for Hive tables can be done in hive-site.xml.']}, {'end': 5671.098, 'start': 5242.479, 'title': 'Hdfs vs rdbms for hive metastore', 'summary': 'Discusses the reasons behind storing hive metastore in rdbms instead of hdfs, emphasizing the inability of hdfs to support crud operations and explaining the differences between external and managed tables in hive.', 'duration': 428.619, 'highlights': ['The major reason for storing Hive Metastore in RDBMS instead of HDFS is the inability of HDFS to support CRUD operations, including row-level insertions and deletions, whereas RDBMS provides faster syncing time and better support for CRUD operations.', 'The differences between external and managed tables in Hive are highlighted, with managed tables (internal tables) deleting both the entry in the Metastore and the data file upon table deletion, while external tables retain the data file even after the table is deleted from the Metastore.', "The chapter also touches on the use cases of 'sort by' and 'order by' in Hive, explaining when to use 'sort by' instead of 'order by' and clarifying the purpose of using only one mapper in certain scenarios."]}, {'end': 6210.508, 'start': 5671.098, 'title': 'Hive query optimization', 'summary': 'Covers optimization techniques for hive queries, including using sort by for large datasets, partitioning for performance improvement, dynamic partitioning, distributing rows into buckets using hash algorithm, consuming csv files into hive warehouse using built-in serde, solving performance degradation due to small files using sequence file, and the advantages and disadvantages of using serde.', 'duration': 539.41, 'highlights': ["When dealing with large datasets, it's advisable to use sort by instead of order by to improve query performance, as sort by performs sorting on multiple reducers while order by does it on one reducer.", 'Partitioning a table by month can improve query performance and enable efficient data retrieval.', 'Dynamic partitioning occurs when the value of partition columns is known only during runtime, and it can be used to optimize table partitioning.', 'Hive distributes rows into buckets using a hash algorithm, and the modulo operation is involved in determining the bucket for data distribution.', 'Converting small CSV files into a sequence file can improve performance by reducing the degradation caused by using numerous small files in Hadoop.', 'Using Serde for data serialization in Hive offers advantages such as data compression and easier data transfer over the network, but it can also impact performance during deserialization.']}, {'end': 6968.89, 'start': 6210.508, 'title': 'Logical vs physical plan in hadoop', 'summary': 'Discusses the difference between logical and physical plans in hadoop, including lazy evaluation, execution modes in pig, and components of hbase, such as regions, zookeeper, and bloom filter.', 'duration': 758.382, 'highlights': ['Lazy evaluation in Hadoop Explains that lazy evaluation in Hadoop causes the logical plan to convert to a physical plan when executing a dump command, leading to the execution of MapReduce jobs.', 'Execution modes in Pig Discusses the two execution modes in Pig - MapReduce mode for accessing HDFS and Local mode for accessing data from the local file system.', 'Components of HBase Details the components of HBase, including regions where data is stored, the role of Zookeeper as a coordinator, and the use of bloom filter to improve performance and throughput.']}], 'duration': 1788.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg5180242.jpg', 'highlights': ['Updating the storage location for Hive tables can be done in hive-site.xml.', 'Having multiple small files can improve system performance.', 'The default location for storing Hive tables is /user/hive/warehouse, and this setting can be updated in hive-site.xml.', 'The major reason for storing Hive Metastore in RDBMS instead of HDFS is the inability of HDFS to support CRUD operations, including row-level insertions and deletions, whereas RDBMS provides faster syncing time and better support for CRUD operations.', "When dealing with large datasets, it's advisable to use sort by instead of order by to improve query performance, as sort by performs sorting on multiple reducers while order by does it on one reducer.", 'Partitioning a table by month can improve query performance and enable efficient data retrieval.', 'Dynamic partitioning occurs when the value of partition columns is known only during runtime, and it can be used to optimize table partitioning.', 'Hive distributes rows into buckets using a hash algorithm, and the modulo operation is involved in determining the bucket for data distribution.', 'Converting small CSV files into a sequence file can improve performance by reducing the degradation caused by using numerous small files in Hadoop.', 'Lazy evaluation in Hadoop causes the logical plan to convert to a physical plan when executing a dump command, leading to the execution of MapReduce jobs.', 'Execution modes in Pig Discusses the two execution modes in Pig - MapReduce mode for accessing HDFS and Local mode for accessing data from the local file system.', 'Components of HBase Details the components of HBase, including regions where data is stored, the role of Zookeeper as a coordinator, and the use of bloom filter to improve performance and throughput.']}, {'end': 8245.617, 'segs': [{'end': 7022.475, 'src': 'embed', 'start': 6968.89, 'weight': 2, 'content': [{'end': 6975.394, 'text': 'you have mentioned to run 8 parallel map reduce task, but scope is only running 4.', 'start': 6968.89, 'duration': 6.504}, {'end': 6978.276, 'text': 'what can be the reason?', 'start': 6975.394, 'duration': 2.882}, {'end': 6980.238, 'text': 'what can be the reason?', 'start': 6978.276, 'duration': 1.962}, {'end': 6981.999, 'text': 'very good, very good, Narasimha.', 'start': 6980.238, 'duration': 1.761}, {'end': 6989.322, 'text': 'Yes, because maybe your number of course are not allowing you to run it.', 'start': 6982.777, 'duration': 6.545}, {'end': 6991.403, 'text': 'maybe you have less number of ports itself.', 'start': 6989.322, 'duration': 2.081}, {'end': 6995.406, 'text': 'in that case, scope will not be able to take you up to a talent after this aspect.', 'start': 6991.403, 'duration': 4.003}, {'end': 6997.648, 'text': 'it will only use less number of course.', 'start': 6995.406, 'duration': 2.242}, {'end': 7002.932, 'text': 'so basically, if your course are less, in that case this is bound to happen.', 'start': 6997.648, 'duration': 5.284}, {'end': 7007.355, 'text': 'Give a scope command to show all the databases in the MySQL server.', 'start': 7002.952, 'duration': 4.403}, {'end': 7011.398, 'text': 'can you give a scope command to show all the databases in MySQL server?', 'start': 7007.355, 'duration': 4.043}, {'end': 7014.265, 'text': 'it should be simple.', 'start': 7012.463, 'duration': 1.802}, {'end': 7016.147, 'text': 'series scope.', 'start': 7014.265, 'duration': 1.882}, {'end': 7018.73, 'text': 'list databases, not show databases.', 'start': 7016.147, 'duration': 2.583}, {'end': 7019.631, 'text': 'list databases.', 'start': 7018.73, 'duration': 0.901}, {'end': 7022.475, 'text': 'become a okay hyphen, hyphen, connect.', 'start': 7019.631, 'duration': 2.844}], 'summary': 'Running 4 parallel map reduce tasks due to limited resources and ports.', 'duration': 53.585, 'max_score': 6968.89, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6968890.jpg'}, {'end': 7179.866, 'src': 'embed', 'start': 7158.321, 'weight': 4, 'content': [{'end': 7167.47, 'text': "Definitely LMS will be good enough but I'm pretty sure that you will have hunger to know more, there can be no stopping for learning.", 'start': 7158.321, 'duration': 9.149}, {'end': 7171.394, 'text': 'So for that my suggestion would be keep on looking on GitHub account.', 'start': 7168.131, 'duration': 3.263}, {'end': 7177.34, 'text': 'just go to Google and type Hadoop projects on whatever topic you want to go, and you will see multiple GitHub links.', 'start': 7171.394, 'duration': 5.946}, {'end': 7179.866, 'text': 'When I say GitHub links, like a lot of people,', 'start': 7177.7, 'duration': 2.166}], 'summary': 'Suggests using github to find hadoop projects for continuous learning.', 'duration': 21.545, 'max_score': 7158.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7158321.jpg'}, {'end': 7439.829, 'src': 'embed', 'start': 7411.735, 'weight': 0, 'content': [{'end': 7420.062, 'text': "as I said, it's a hot selling tech thing for outside world, because this is one which is now grabbing of the whole market.", 'start': 7411.735, 'duration': 8.327}, {'end': 7428.246, 'text': "in fact, remember the slide number three, where I showed you that it's 297% and that could just happen just in few years, right?", 'start': 7420.062, 'duration': 8.184}, {'end': 7431.987, 'text': 'so definitely the next step, if you want to look forward, should be a purchase path.', 'start': 7428.246, 'duration': 3.741}, {'end': 7439.829, 'text': 'that would be my suggestion and even like, there are very good trainers physically who teach that, so it will not be a problem.', 'start': 7431.987, 'duration': 7.842}], 'summary': 'Hot selling tech with 297% market growth, suggests purchase path.', 'duration': 28.094, 'max_score': 7411.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7411735.jpg'}, {'end': 7637.351, 'src': 'embed', 'start': 7611.134, 'weight': 7, 'content': [{'end': 7617.518, 'text': 'I think I just answer this and so you need to start with Hadoop, then move forward to Apache Spark site.', 'start': 7611.134, 'duration': 6.384}, {'end': 7620.2, 'text': 'once you move to Apache Spark site, you will be all good.', 'start': 7617.518, 'duration': 2.682}, {'end': 7627.505, 'text': 'then, after that, if you want to still go deep, go with the topics like data sensor, which is definitely going to transform your failure,', 'start': 7620.2, 'duration': 7.305}, {'end': 7632.368, 'text': 'and this is the way you can become a good Hadoop party tech, along with some other good domain block knowledge,', 'start': 7627.505, 'duration': 4.863}, {'end': 7637.351, 'text': 'because these two together are going to currently very strong in the market.', 'start': 7632.688, 'duration': 4.663}], 'summary': 'Start with hadoop, move to apache spark, then dive into data sensor for strong hadoop and domain knowledge.', 'duration': 26.217, 'max_score': 7611.134, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7611134.jpg'}, {'end': 7941.254, 'src': 'embed', 'start': 7904.936, 'weight': 6, 'content': [{'end': 7905.797, 'text': 'that would be great.', 'start': 7904.936, 'duration': 0.861}, {'end': 7911.381, 'text': "like Cassandra is also one of the force in Adorica and I can tell you it's one of the hot panning cake in Adorica.", 'start': 7905.797, 'duration': 5.584}, {'end': 7917.964, 'text': "So the reason it is not selling cake is because basically, there's a good demand in Cassandra, also these days,", 'start': 7912.341, 'duration': 5.623}, {'end': 7925.528, 'text': 'because there are a lot of companies who are migrating from HBase to Cassandra, and the reason is because Cassandra follows a SQL kind of syntax.', 'start': 7917.964, 'duration': 7.564}, {'end': 7933.992, 'text': 'Okay so, basically, cloud data certification that we carry is a good value in comparison to Horton block, I would say,', 'start': 7927.17, 'duration': 6.822}, {'end': 7941.254, 'text': "because this cloud data certification basically it's going to all kind of a hot cake in the market again.", 'start': 7933.992, 'duration': 7.262}], 'summary': 'Cassandra is in high demand with companies migrating from hbase. cloud data certification is also gaining value in the market.', 'duration': 36.318, 'max_score': 7904.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7904936.jpg'}, {'end': 7978.459, 'src': 'embed', 'start': 7950.116, 'weight': 1, 'content': [{'end': 7953.997, 'text': "they're kind of mix match for the questions and it keeps a very good value.", 'start': 7950.116, 'duration': 3.881}, {'end': 7955.978, 'text': 'let me give an example for it.', 'start': 7953.997, 'duration': 1.981}, {'end': 7960.441, 'text': "let's say, when I hire someone right, when I get someone's CV,", 'start': 7956.358, 'duration': 4.083}, {'end': 7973.569, 'text': "and if I see that this person is clouded or certified the CCA 175 certified I 50% I make up my mind that I'm going to hire this person so you can be told the importance of this certification right.", 'start': 7960.441, 'duration': 13.128}, {'end': 7978.459, 'text': 'because we have already kind of proved himself with that exam that you know.', 'start': 7973.956, 'duration': 4.503}], 'summary': 'Certification cca 175 increases hiring probability by 50%.', 'duration': 28.343, 'max_score': 7950.116, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7950116.jpg'}, {'end': 8019.284, 'src': 'embed', 'start': 7992.944, 'weight': 5, 'content': [{'end': 7998.768, 'text': 'does Edureka provide some practice material for cracking cloud error certification, like sample questions?', 'start': 7992.944, 'duration': 5.824}, {'end': 8005.073, 'text': 'I mean, unfortunately there is no sample set available anywhere, but it depends on your trainer.', 'start': 7998.768, 'duration': 6.305}, {'end': 8008.215, 'text': 'so basically, when, if you are like, say, getting in my classes,', 'start': 8005.073, 'duration': 3.142}, {'end': 8013.699, 'text': 'you know I generally share some of the cloud error dumps as well because I am also Cloudera certified.', 'start': 8008.215, 'duration': 5.484}, {'end': 8019.284, 'text': "so I've prepared some of the Cloudera questions which my students kind of go through.", 'start': 8013.699, 'duration': 5.585}], 'summary': 'Edureka may provide cloudera practice questions for cloud error certification.', 'duration': 26.34, 'max_score': 7992.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg7992944.jpg'}], 'start': 6968.89, 'title': 'Big data technologies and career growth', 'summary': 'Covers the limitations of parallel map reduce tasks, mysql server commands, and project suggestions. it also discusses the 297% growth in the tech market, career advice, and the importance of cloud data certification, particularly cca 175, in the industry.', 'chapters': [{'end': 7411.735, 'start': 6968.89, 'title': 'Hadoop and mapreduce: scope, database commands, project suggestions', 'summary': 'Covers the limitations of running parallel map reduce tasks, database commands in mysql server, project suggestions, and the future of mapreduce and apache spark.', 'duration': 442.845, 'highlights': ['Running 8 parallel map reduce tasks may be limited to 4 due to the number of cores available, leading to reduced performance. Performance impact of running 4 parallel tasks instead of 8.', "Using 'scope' command to list all databases in MySQL server is demonstrated as 'scope list databases' and recommended for understanding databases. Demonstrated command for listing all databases in MySQL server.", 'Edureka provides multiple question blocks covering various types of questions for practice and understanding, recommended for enhancing understanding. Availability of multiple question blocks for practice and understanding.', 'Suggestion to work on real-time projects and explore GitHub for Hadoop projects to gain practical knowledge and understanding. Encouragement to explore GitHub for practical projects and knowledge.', 'Discussion on the future of MapReduce and Apache Spark, highlighting the shift in the market towards Apache Spark and the specific use cases for MapReduce. Insights into the market shift towards Apache Spark and specific use cases for MapReduce.']}, {'end': 7883.75, 'start': 7411.735, 'title': 'Tech market trends and career advice', 'summary': 'Discusses the growing tech market, with a 297% increase in a few years, suggesting a purchase path as the next step, followed by specific domain learning like statistics and data science, emphasizing the importance of hadoop, apache spark, and data sensor courses for career transformation.', 'duration': 472.015, 'highlights': ['The tech market has seen a 297% increase in just a few years, emphasizing the importance of staying updated with current trends. The speaker highlighted the significant growth in the tech market, with a 297% increase over a few years, indicating the importance of staying updated with current trends to remain competitive in the industry.', 'Suggests a purchase path as the next step and recommends specific domain learning like statistics and data science, underlining their transformative impact on careers. The speaker recommends a purchase path as the next step and suggests specific domain learning like statistics and data science, emphasizing their transformative impact on careers.', 'Emphasizes the importance of Hadoop, Apache Spark, and data sensor courses for career transformation, highlighting their strong position in the job market. The speaker emphasizes the importance of Hadoop, Apache Spark, and data sensor courses for career transformation, highlighting their strong position in the job market.']}, {'end': 8245.617, 'start': 7883.75, 'title': 'Importance of cloud data certification', 'summary': 'Emphasizes the value of cloud data certification, particularly cca 175, in the market, highlighting its impact on hiring decisions and the lack of available sample questions. it also discusses the relevance of hadoop and apache spark, as well as the demand for cassandra in the industry.', 'duration': 361.867, 'highlights': ['Cloud Data Certification, particularly CCA 175, holds significant value in the market, influencing hiring decisions with a 50% chance of being hired upon certification. The speaker emphasizes the impact of Cloud Data Certification on hiring decisions, stating that upon seeing a CCA 175 certification on a CV, there is a 50% chance of making up the mind to hire the individual without interviewing them.', 'The lack of available sample questions for Cloud Data Certification is mentioned, with the provision of practice material being dependent on the trainer. The absence of sample questions for Cloud Data Certification is highlighted, with the provision of practice material depending on the individual trainer, and the speaker shares that they personally provide such material to their students.', 'The growing demand for Cassandra is discussed, particularly in the context of companies migrating from HBase to Cassandra due to its SQL-like syntax. The conversation highlights the increasing demand for Cassandra, with many companies migrating from HBase to Cassandra, attributing it to the SQL-like syntax of Cassandra.', 'The relevance of Hadoop and Apache Spark is emphasized, with a suggestion to refer to the official documentation for learning about Spark pair RDD. The importance of Hadoop and Apache Spark is emphasized, with a recommendation to refer to the official documentation for learning about Spark pair RDD.']}], 'duration': 1276.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LgSiVWjTIUg/pics/LgSiVWjTIUg6968890.jpg', 'highlights': ['297% growth in the tech market, emphasizing the importance of staying updated with current trends.', 'CCA 175 certification on a CV increases the chance of being hired by 50%.', 'Running 8 parallel map reduce tasks may be limited to 4 due to the number of cores available, leading to reduced performance.', "Using 'scope' command to list all databases in MySQL server is demonstrated as 'scope list databases' and recommended for understanding databases.", 'Suggestion to work on real-time projects and explore GitHub for Hadoop projects to gain practical knowledge and understanding.', 'The lack of available sample questions for Cloud Data Certification is mentioned, with the provision of practice material being dependent on the trainer.', 'The growing demand for Cassandra is discussed, particularly in the context of companies migrating from HBase to Cassandra due to its SQL-like syntax.', 'Emphasizes the importance of Hadoop, Apache Spark, and data sensor courses for career transformation, highlighting their strong position in the job market.']}], 'highlights': ['Job market for Hadoop architects, administrators, and developers increased by 297%, 245%, and 247% from 2012 to 2017.', 'CCA 175 certification on a CV increases the chance of being hired by 50%.', 'The session stresses the significance of Hadoop knowledge for individuals at varying levels of experience.', 'Understanding the 5 Vs of big data is important, with examples from companies like Facebook and Twitter.', 'Emphasizing the importance of understanding at least the 4 major components of big data for interviews and decision-making.', "Apache Spark's rising popularity resulted in notable migration by companies, creating increased job opportunities.", "Hadoop's capability to handle diverse data types and enable distributed storage and processing makes it suitable for large-scale parallel processing.", 'The introduction of semi-structured data stemmed from the necessity to classify data displaying traits of both structured and unstructured data.', 'The distributed nature of Hadoop contrasts with the single-machine operation of RDBMS, impacting computation speed and scalability.', 'Hadoop load balancer is another method used to check under-replication or over-replication of data blocks', 'The process of spilling in Map Reduce provides insights into the internal workings of the system, demonstrating how the mapper output is managed and stored.', 'Performance Enhancement with Combined File Input Format packages small files together, reducing execution time and improving job performance when dealing with many small files.', 'Updating the storage location for Hive tables can be done in hive-site.xml.', 'Having multiple small files can improve system performance.', 'The default location for storing Hive tables is /user/hive/warehouse, and this setting can be updated in hive-site.xml.', 'The major reason for storing Hive Metastore in RDBMS instead of HDFS is the inability of HDFS to support CRUD operations, including row-level insertions and deletions, whereas RDBMS provides faster syncing time and better support for CRUD operations.', "When dealing with large datasets, it's advisable to use sort by instead of order by to improve query performance, as sort by performs sorting on multiple reducers while order by does it on one reducer.", 'Partitioning a table by month can improve query performance and enable efficient data retrieval.', 'Dynamic partitioning occurs when the value of partition columns is known only during runtime, and it can be used to optimize table partitioning.', 'Hive distributes rows into buckets using a hash algorithm, and the modulo operation is involved in determining the bucket for data distribution.', 'Converting small CSV files into a sequence file can improve performance by reducing the degradation caused by using numerous small files in Hadoop.', 'Lazy evaluation in Hadoop causes the logical plan to convert to a physical plan when executing a dump command, leading to the execution of MapReduce jobs.', 'Components of HBase Details the components of HBase, including regions where data is stored, the role of Zookeeper as a coordinator, and the use of bloom filter to improve performance and throughput.']}