title
Hadoop Architecture | HDFS Architecture | HDFS Tutorial | Hadoop Tutorial | Edureka
description
🔥 Edureka Hadoop Training: https://www.edureka.co/big-data-hadoop-training-certification
Check our Hadoop Architecture blog here: https://goo.gl/I6DKaf
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
This Edureka Hadoop Architecture Tutorial will help you understand the architecture of Apache Hadoop in detail. Below are the topics covered in this Hadoop Architecture Tutorial:
1) Hadoop Components
2) DFS – Distributed File System
3) HDFS Services
4) Blocks in Hadoop
5) Block Replication
6) Rack Awareness
7) HDFS Architecture
8) HDFS Read/Write Mechanisms
9) Hadoop HDFS Commands
Subscribe to our channel to get video updates. Hit the subscribe button above.
--------------------Edureka Big Data Training and Certifications------------------------
🔵 Edureka Hadoop Training: http://bit.ly/2YBlw29
🔵 Edureka Spark Training: http://bit.ly/2PeHvc9
🔵 Edureka Kafka Training: http://bit.ly/34e7Riy
🔵 Edureka Cassandra Training: http://bit.ly/2E9AK54
🔵 Edureka Talend Training: http://bit.ly/2YzYIjg
🔵 Edureka Hadoop Administration Training: http://bit.ly/2YE8Nf9
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
#edureka #edurekaHadoop #HadoopArchitecture #HDFSArchitecture #HDFSReadWrite #HadoopCommands #HDFSCommands
How it Works?
1. This is a 5 Week Instructor led Online Course, 40 hours of assignment and 30 hours of project work
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. At the end of the training you will have to undergo a 2-hour LIVE Practical Exam based on which we will provide you a Grade and a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
Edureka’s Big Data and Hadoop online training is designed to help you become a top Hadoop developer. During this course, our expert Hadoop instructors will help you:
1. Master the concepts of HDFS and MapReduce framework
2. Understand Hadoop 2.x Architecture
3. Setup Hadoop Cluster and write Complex MapReduce programs
4. Learn data loading techniques using Sqoop and Flume
5. Perform data analytics using Pig, Hive and YARN
6. Implement HBase and MapReduce integration
7. Implement Advanced Usage and Indexing
8. Schedule jobs using Oozie
9. Implement best practices for Hadoop development
10. Work on a real life Project on Big Data Analytics
11. Understand Spark and its Ecosystem
12. Learn how to work in RDD in Spark
- - - - - - - - - - - - - -
Who should go for this course?
If you belong to any of the following groups, knowledge of Big Data and Hadoop is crucial for you if you want to progress in your career:
1. Analytics professionals
2. BI /ETL/DW professionals
3. Project managers
4. Testing professionals
5. Mainframe professionals
6. Software developers and architects
7. Recent graduates passionate about building successful career in Big Data
- - - - - - - - - - - - - -
Why Learn Hadoop?
Big Data! A Worldwide Problem?
According to Wikipedia, "Big data is collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However, it is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success!
The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is - Hadoop, which is not less than a panacea for all those companies working with BIG DATA in a variety of applications and has become an integral part for storing, handling, evaluating and retrieving hundreds of terabytes, and even petabytes of data.
For more information, Please write back to us at sales@edureka.in or call us at IND: 9606058406 / US: 18338555775 (toll-free).
Customer Review:
Michael Harkins, System Architect, Hortonworks says: “The courses are top rate. The best part is live instruction, with playback. But my favorite feature is viewing a previous class. Also, they are always there to answer questions, and prompt when you open an issue if you are having any trouble. Added bonus ~ you get lifetime access to the course you took!!! Edureka lets you go back later, when your boss says "I want this ASAP!" ~ This is the killer education app... I've take two courses, and I'm taking two more.”
detail
{'title': 'Hadoop Architecture | HDFS Architecture | HDFS Tutorial | Hadoop Tutorial | Edureka', 'heatmap': [{'end': 525.1, 'start': 419.542, 'weight': 0.82}, {'end': 2168.83, 'start': 2090, 'weight': 0.834}], 'summary': "Covers hadoop architecture, hdfs, and big data processing, showcasing practical examples of reducing data processing time from 43 minutes to 4.3 minutes using multiple machines, and emphasizes on hadoop's fault tolerance mechanism involving replicating each block three times, resulting in a default replication factor of three.", 'chapters': [{'end': 276.483, 'segs': [{'end': 73.354, 'src': 'embed', 'start': 0.389, 'weight': 0, 'content': [{'end': 2.851, 'text': 'A very warm welcome to each and everyone present today.', 'start': 0.389, 'duration': 2.462}, {'end': 6.914, 'text': 'My name is Vineet and I am going to take a session on Hadoop Architecture.', 'start': 3.211, 'duration': 3.703}, {'end': 17.882, 'text': 'We have Ryan in the session, Ashish, Saurav, Subham, Reshma, Akash, Adam.', 'start': 9.976, 'duration': 7.906}, {'end': 19.623, 'text': 'Thank you all for joining in.', 'start': 18.522, 'duration': 1.101}, {'end': 23.166, 'text': 'There are many more who are joining in at the moment.', 'start': 19.843, 'duration': 3.323}, {'end': 33.503, 'text': "So we'll not be waiting for any more people, we'll just move on to the very first slide that is the agenda for today's session.", 'start': 24.278, 'duration': 9.225}, {'end': 39.267, 'text': "In today's session we'll be discussing on the components of Hadoop.", 'start': 35.665, 'duration': 3.602}, {'end': 43.649, 'text': "we'll understand what is a distributed file system and why do we need it,", 'start': 39.267, 'duration': 4.382}, {'end': 48.272, 'text': 'why is it so important and why Hadoop has implemented a distributed file system.', 'start': 43.649, 'duration': 4.623}, {'end': 52.379, 'text': "We'll understand the various services that are present in Hadoop.", 'start': 49.397, 'duration': 2.982}, {'end': 57.343, 'text': "We'll understand what is blocks in Hadoop and what is a replication factor in Hadoop.", 'start': 52.88, 'duration': 4.463}, {'end': 67.63, 'text': "We'll understand the concept of rack awareness and we'll understand what is the architecture behind HDFS that is Hadoop distributed file system.", 'start': 58.063, 'duration': 9.567}, {'end': 73.354, 'text': "At the end of the session, we'll understand what is the read and write mechanism in HDFS.", 'start': 68.351, 'duration': 5.003}], 'summary': 'Vineet discusses hadoop architecture, covering components, distributed file system, services, blocks, replication factor, rack awareness, and hdfs architecture.', 'duration': 72.965, 'max_score': 0.389, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE389.jpg'}, {'end': 192.168, 'src': 'embed', 'start': 118.899, 'weight': 1, 'content': [{'end': 125.623, 'text': 'Similarly, even though we are, if we are able to store a part of the big data, processing that big data took years.', 'start': 118.899, 'duration': 6.724}, {'end': 133.909, 'text': 'This result that you wanted in minutes was received in weeks or months, right? So the value of that result was lost.', 'start': 126.204, 'duration': 7.705}, {'end': 140.66, 'text': "So starting with this problem, we'll just move on and see how did we cope up with this problem.", 'start': 135.299, 'duration': 5.361}, {'end': 144.801, 'text': 'As you all know, Hadoop solved this big data problem.', 'start': 141.86, 'duration': 2.941}, {'end': 151.383, 'text': 'That is using Hadoop HDFS, the storing problem of big data was resolved and it was no more a problem.', 'start': 144.861, 'duration': 6.522}, {'end': 155.884, 'text': 'Similarly, Hadoop MapReduce resolved the processing part of big data.', 'start': 152.243, 'duration': 3.641}, {'end': 161.765, 'text': 'That is, you got the capability of processing big data using Hadoop, MapReduce, right?', 'start': 156.104, 'duration': 5.661}, {'end': 167.473, 'text': 'Now, Hadoop essentially is a distributed file system, but why?', 'start': 162.771, 'duration': 4.702}, {'end': 170.455, 'text': 'Why is Hadoop a distributed file system?', 'start': 168.194, 'duration': 2.261}, {'end': 174.457, 'text': 'That is the most important question that you should be wondering about.', 'start': 170.875, 'duration': 3.582}, {'end': 182.561, 'text': "Let's try and understand what is a distributed file system, at the same time understand the advantages of distributed file system.", 'start': 175.577, 'duration': 6.984}, {'end': 188.506, 'text': 'Right? Guys, if you have any question, you can ask your question in between the session.', 'start': 183.444, 'duration': 5.062}, {'end': 192.168, 'text': "I'll take logical breaks and try and answer your questions in between.", 'start': 188.766, 'duration': 3.402}], 'summary': 'Hadoop solved big data storage and processing, delivering results in minutes instead of weeks or months.', 'duration': 73.269, 'max_score': 118.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE118899.jpg'}], 'start': 0.389, 'title': 'Hadoop architecture and big data challenges', 'summary': "Covers hadoop components, distributed file system, and architecture, along with challenges in storing and processing big data. it also highlights hadoop's role in resolving storage and processing problems, with a practical example of reducing data processing time from 43 minutes to 4.3 minutes using multiple machines.", 'chapters': [{'end': 73.354, 'start': 0.389, 'title': 'Hadoop architecture session', 'summary': 'Covers the components of hadoop, distributed file system, important services, blocks and replication factor, rack awareness, and hdfs architecture, with a focus on read and write mechanisms.', 'duration': 72.965, 'highlights': ['The session will cover components of Hadoop, distributed file system, services, blocks, replication factor, rack awareness, and HDFS architecture, as well as read and write mechanisms.', 'The presenter, Vineet, will discuss the importance of a distributed file system and its implementation in Hadoop.', 'Key participants in the session include Ryan, Ashish, Saurav, Subham, Reshma, Akash, and Adam.']}, {'end': 276.483, 'start': 75.171, 'title': 'Challenges and solutions in big data', 'summary': 'Discusses the challenges of storing and processing big data, highlighting the issues with existing systems and the time taken for processing, and introduces hadoop as the solution, showcasing its role in resolving the storage and processing problems. it also explains the concept of a distributed file system and its advantages, illustrated with a practical example of reducing data processing time from 43 minutes to 4.3 minutes using multiple machines.', 'duration': 201.312, 'highlights': ['The challenges of storing and processing big data are addressed, emphasizing the inability of existing systems to accommodate the data and the significant time delay in processing, with results taking weeks or months instead of minutes.', 'The introduction of Hadoop as a solution to big data problems is highlighted, specifically addressing the storage issue through Hadoop HDFS and the processing challenge through Hadoop MapReduce.', 'The concept of a distributed file system and its advantages are explained, emphasizing the ability to reduce data processing time from 43 minutes to 4.3 minutes using multiple machines.']}], 'duration': 276.094, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE389.jpg', 'highlights': ['The session will cover components of Hadoop, distributed file system, services, blocks, replication factor, rack awareness, and HDFS architecture, as well as read and write mechanisms.', 'The challenges of storing and processing big data are addressed, emphasizing the inability of existing systems to accommodate the data and the significant time delay in processing, with results taking weeks or months instead of minutes.', 'The presenter, Vineet, will discuss the importance of a distributed file system and its implementation in Hadoop.', 'The introduction of Hadoop as a solution to big data problems is highlighted, specifically addressing the storage issue through Hadoop HDFS and the processing challenge through Hadoop MapReduce.', 'The concept of a distributed file system and its advantages are explained, emphasizing the ability to reduce data processing time from 43 minutes to 4.3 minutes using multiple machines.', 'Key participants in the session include Ryan, Ashish, Saurav, Subham, Reshma, Akash, and Adam.']}, {'end': 589.033, 'segs': [{'end': 300.376, 'src': 'embed', 'start': 276.643, 'weight': 0, 'content': [{'end': 283.386, 'text': 'And that is why the time that was taken to process one TB of data reduced to 1 10th, that is 4.3 minutes.', 'start': 276.643, 'duration': 6.743}, {'end': 284.227, 'text': 'Very simple.', 'start': 283.586, 'duration': 0.641}, {'end': 294.792, 'text': 'Similarly, when we consider big data, that data gets divided into multiple chunks of data and we actually process that data separately,', 'start': 284.827, 'duration': 9.965}, {'end': 300.376, 'text': 'and that is why Hadoop has chosen a distributed file system over a centralized file system.', 'start': 294.792, 'duration': 5.584}], 'summary': "Processing time for 1 tb of data reduced to 4.3 minutes due to hadoop's distributed file system.", 'duration': 23.733, 'max_score': 276.643, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE276643.jpg'}, {'end': 377.564, 'src': 'embed', 'start': 347.132, 'weight': 2, 'content': [{'end': 351.353, 'text': "Now it's time we move on to the next slide, that is Hadoop components.", 'start': 347.132, 'duration': 4.221}, {'end': 356.335, 'text': 'As we already saw, Hadoop HDFS solved the problem of storing big data.', 'start': 351.553, 'duration': 4.782}, {'end': 366.122, 'text': 'Okay, so the very first component of Hadoop is Hadoop HDFS and the second part was processing big data which was solved by Hadoop MapReduce.', 'start': 357.54, 'duration': 8.582}, {'end': 371.603, 'text': 'So the two main components of Hadoop are Hadoop HDFS and Hadoop MapReduce.', 'start': 366.462, 'duration': 5.141}, {'end': 377.564, 'text': "In today's session we are going to focus on Hadoop HDFS that is Hadoop Distributed File System.", 'start': 372.083, 'duration': 5.481}], 'summary': 'Hadoop hdfs and mapreduce are key components for big data processing.', 'duration': 30.432, 'max_score': 347.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE347132.jpg'}, {'end': 442.646, 'src': 'embed', 'start': 419.542, 'weight': 3, 'content': [{'end': 426.887, 'text': 'Now HDFS as a whole has got two major demons or you can call them as processes or threads right?,', 'start': 419.542, 'duration': 7.345}, {'end': 432.291, 'text': 'Which are nothing but a Java process that is running within a JVM.', 'start': 426.967, 'duration': 5.324}, {'end': 438.144, 'text': 'okay?. So HDFS has got two main components or two main demons, that is, name node and data node.', 'start': 432.291, 'duration': 5.853}, {'end': 442.646, 'text': 'Name. node is a master daemon that runs on the master machine.', 'start': 439.064, 'duration': 3.582}], 'summary': 'Hdfs consists of two main components: name node and data node, each running as a java process within a jvm.', 'duration': 23.104, 'max_score': 419.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE419542.jpg'}, {'end': 525.1, 'src': 'heatmap', 'start': 419.542, 'weight': 0.82, 'content': [{'end': 426.887, 'text': 'Now HDFS as a whole has got two major demons or you can call them as processes or threads right?,', 'start': 419.542, 'duration': 7.345}, {'end': 432.291, 'text': 'Which are nothing but a Java process that is running within a JVM.', 'start': 426.967, 'duration': 5.324}, {'end': 438.144, 'text': 'okay?. So HDFS has got two main components or two main demons, that is, name node and data node.', 'start': 432.291, 'duration': 5.853}, {'end': 442.646, 'text': 'Name. node is a master daemon that runs on the master machine.', 'start': 439.064, 'duration': 3.582}, {'end': 449.489, 'text': 'that is a high-end machine essentially, and data node is a slave machine which runs on a commodity hardware.', 'start': 442.646, 'duration': 6.843}, {'end': 454.272, 'text': 'They can be more than one data node as slave machines are more than a master machine.', 'start': 449.85, 'duration': 4.422}, {'end': 460.295, 'text': 'Okay, so we always have one name node and multiple data nodes in running in slave machines.', 'start': 455.132, 'duration': 5.163}, {'end': 466.806, 'text': 'At the same time, we have yarn on the other hand, which has again two main daemons.', 'start': 461.863, 'duration': 4.943}, {'end': 469.788, 'text': 'One is the resource manager, which runs on the master machine.', 'start': 467.086, 'duration': 2.702}, {'end': 477.252, 'text': 'And we have node manager, which runs on the slave machine, just like the data nodes, okay? So every slave machine has got two daemons.', 'start': 470.108, 'duration': 7.144}, {'end': 479.994, 'text': 'One is the data node and the other is the node manager.', 'start': 477.432, 'duration': 2.562}, {'end': 487.137, 'text': 'as well as at the same time, the master node has got a name node running and a resource manager running.', 'start': 481.034, 'duration': 6.103}, {'end': 498.203, 'text': 'Name node is responsible for managing the data on the Hadoop distributed file system and the resource manager responsible for executing processing task over this stored data.', 'start': 487.637, 'duration': 10.566}, {'end': 502.55, 'text': 'Guys, are you clear with this? Looks like we have some questions.', 'start': 498.803, 'duration': 3.747}, {'end': 514.693, 'text': 'What is the difference between DFS and HDFS? Previously as well, there were more distributed file systems like Oracle had distributed clusters.', 'start': 503.63, 'duration': 11.063}, {'end': 525.1, 'text': 'However, there was one basic difference that is essentially a part of the MapReduce framework itself wherein the data was going to the processing.', 'start': 515.674, 'duration': 9.426}], 'summary': 'Hdfs has 2 main demons: name node and data node. yarn also has 2 daemons: resource manager and node manager.', 'duration': 105.558, 'max_score': 419.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE419542.jpg'}, {'end': 571.572, 'src': 'embed', 'start': 539.218, 'weight': 4, 'content': [{'end': 542.459, 'text': "However, Hadoop doesn't follow this kind of ideology.", 'start': 539.218, 'duration': 3.241}, {'end': 548.661, 'text': 'What Hadoop says is, instead of bringing the data to a centralized machine or the master machine,', 'start': 542.939, 'duration': 5.722}, {'end': 553.483, 'text': 'we can send the code or the processing to the data where it is stored.', 'start': 548.661, 'duration': 4.822}, {'end': 556.924, 'text': 'This way, the network bandwidth is saved.', 'start': 554.083, 'duration': 2.841}, {'end': 564.028, 'text': 'You do not have to bring in the huge big data through the network channels into a centralized machine.', 'start': 557.884, 'duration': 6.144}, {'end': 565.609, 'text': 'That data can stay where it is.', 'start': 564.068, 'duration': 1.541}, {'end': 571.572, 'text': 'However, we can send a small logic, a small code to the data and process it where it is.', 'start': 565.949, 'duration': 5.623}], 'summary': "Hadoop's approach saves network bandwidth by processing data where it's stored.", 'duration': 32.354, 'max_score': 539.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE539218.jpg'}], 'start': 276.643, 'title': 'Big data processing with hadoop', 'summary': 'Explains how hadoop reduced processing time for 1 tb of data to 4.3 minutes, using distributed file system and commodity hardware, resulting in cost-effectiveness and efficiency. it also introduces hadoop components: hdfs for storage and yarn/mapreduce for processing, enabling storage and processing of big data, along with details on hdfs and yarn components.', 'chapters': [{'end': 346.372, 'start': 276.643, 'title': 'Big data processing with hadoop', 'summary': 'Explains how the processing time for one tb of data reduced to 4.3 minutes, utilizing a distributed file system and commodity hardware in hadoop, resulting in cost-effectiveness and improved efficiency.', 'duration': 69.729, 'highlights': ['The processing time for one TB of data reduced to 4.3 minutes with a distributed file system, showcasing significant efficiency gains.', 'Hadoop utilizes commodity hardware, such as 8GB RAM and one TB hard disk, for cost-effective processing and data storage.', 'Commodity hardware, like day-to-day machines, is employed as slave machines in a distributed file system, ensuring cost-effectiveness.']}, {'end': 417.798, 'start': 347.132, 'title': 'Hadoop components overview', 'summary': 'Introduces the main components of hadoop: hdfs for storage and yarn/mapreduce for processing, enabling the storage and processing of big data.', 'duration': 70.666, 'highlights': ['Hadoop consists of two main components: HDFS for storing big data and MapReduce for processing big data.', 'The architecture of Hadoop includes two wings: storage (HDFS) and processing (YARN and MapReduce).', 'Hadoop enables the storage of big data through HDFS and the processing of the same data through YARN and MapReduce.']}, {'end': 589.033, 'start': 419.542, 'title': 'Hdfs and yarn components', 'summary': "Explains the key components of hdfs and yarn, with hdfs consisting of a name node and multiple data nodes, and yarn comprising of a resource manager and node manager. it also highlights the difference between dfs and hdfs, emphasizing hadoop's approach of processing data at its storage location to save network bandwidth.", 'duration': 169.491, 'highlights': ['The HDFS has two main components: name node (master daemon) running on the master machine and data node (slave machine) running on commodity hardware. There is one name node and multiple data nodes in the system.', 'YARN consists of a resource manager (master daemon) running on the master machine and node manager (slave machine) similar to data nodes. Each slave machine has both data node and node manager, while the master node has a name node and a resource manager running.', "Hadoop's approach differs from traditional distributed file systems by processing data at its storage location instead of pulling it back to a centralized machine, thus saving network bandwidth and reducing the load on the centralized machine."]}], 'duration': 312.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE276643.jpg', 'highlights': ['Hadoop reduced processing time for 1 TB of data to 4.3 minutes with a distributed file system, showcasing significant efficiency gains.', 'Hadoop utilizes commodity hardware, such as 8GB RAM and one TB hard disk, for cost-effective processing and data storage.', 'Hadoop consists of two main components: HDFS for storing big data and MapReduce for processing big data.', 'The HDFS has two main components: name node (master daemon) running on the master machine and data node (slave machine) running on commodity hardware.', "Hadoop's approach differs from traditional distributed file systems by processing data at its storage location instead of pulling it back to a centralized machine, thus saving network bandwidth and reducing the load on the centralized machine."]}, {'end': 947.974, 'segs': [{'end': 663.42, 'src': 'embed', 'start': 636.417, 'weight': 0, 'content': [{'end': 640.523, 'text': 'hence it is important that we understand what is a name node and a data node,', 'start': 636.417, 'duration': 4.106}, {'end': 645.931, 'text': 'because these are the two main demons that actually runs your HDFS entirely.', 'start': 640.523, 'duration': 5.408}, {'end': 650.115, 'text': 'Just focus on the diagram on the right.', 'start': 647.813, 'duration': 2.302}, {'end': 657.922, 'text': 'Okay, as you can see, there is a centralized machine name node which is controlling various data nodes that are there,', 'start': 650.595, 'duration': 7.327}, {'end': 659.744, 'text': 'which is nothing but commodity hardware.', 'start': 657.922, 'duration': 1.822}, {'end': 661.325, 'text': "I've already explained that.", 'start': 659.824, 'duration': 1.501}, {'end': 663.42, 'text': 'in the earlier slides.', 'start': 662.54, 'duration': 0.88}], 'summary': 'Understanding the role of name node and data node in hdfs is crucial for managing the distributed file system.', 'duration': 27.003, 'max_score': 636.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE636417.jpg'}, {'end': 861.852, 'src': 'embed', 'start': 834.456, 'weight': 1, 'content': [{'end': 840.719, 'text': "Okay, so Hadoop cluster is nothing but a master-slave topology in which there's a master machine, as you can see on the top.", 'start': 834.456, 'duration': 6.263}, {'end': 843.8, 'text': 'that is Hadoop cluster, where Hadoop cluster is written.', 'start': 840.719, 'duration': 3.081}, {'end': 849.623, 'text': "In this master machine you'll have your name node and the resource manager running that is the master daemons.", 'start': 843.82, 'duration': 5.803}, {'end': 856.57, 'text': 'Now this master machine is connected to all the slave machines using the core switches.', 'start': 851.227, 'duration': 5.343}, {'end': 861.852, 'text': 'It is because these data nodes are actually stored in various racks, okay?', 'start': 856.85, 'duration': 5.002}], 'summary': 'Hadoop cluster is a master-slave topology with name node, resource manager, and data nodes stored in various racks.', 'duration': 27.396, 'max_score': 834.456, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE834456.jpg'}, {'end': 947.974, 'src': 'embed', 'start': 923.103, 'weight': 2, 'content': [{'end': 932.89, 'text': 'Shubham says, do name node and resource manager reside on the same machine? Practically, in a production cluster, you will not find the same thing.', 'start': 923.103, 'duration': 9.787}, {'end': 939.492, 'text': 'The name node will be there on a different machine and will have a different server on which the resource manager is running.', 'start': 932.91, 'duration': 6.582}, {'end': 947.974, 'text': 'However, for practical purposes on your end, or if you want to do a POC on your end, that is nothing but proof of concept.', 'start': 940.132, 'duration': 7.842}], 'summary': 'In production clusters, the name node and resource manager are typically on different machines. however, for poc or testing, they can be on the same machine.', 'duration': 24.871, 'max_score': 923.103, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE923103.jpg'}], 'start': 589.454, 'title': 'Hadoop cluster processes and architecture', 'summary': 'Discusses the processes in a hadoop cluster, including the name node and data nodes, with a focus on the master-slave topology and distribution of slave machines.', 'chapters': [{'end': 947.974, 'start': 589.454, 'title': 'Hadoop cluster processes and architecture', 'summary': 'Discusses the processes running in a hadoop cluster, including the name node and data nodes, with a focus on the architecture, such as the master-slave topology and the distribution of slave machines.', 'duration': 358.52, 'highlights': ['The name node and data nodes are the main daemons running in a Hadoop Distributed File System (HDFS), with the name node storing metadata of the data on data nodes and data nodes serving read and write requests from clients. The name node and data nodes are the main daemons running in a Hadoop Distributed File System (HDFS), with the name node storing metadata of the data on data nodes and data nodes serving read and write requests from clients.', 'The Hadoop cluster follows a master-slave topology, with a centralized machine housing the name node and resource manager, connected to multiple slave machines or data nodes distributed over different racks. The Hadoop cluster follows a master-slave topology, with a centralized machine housing the name node and resource manager, connected to multiple slave machines or data nodes distributed over different racks.', 'In a production cluster, the name node and resource manager reside on different machines, but for practical purposes or proof of concept, they can be on the same machine. In a production cluster, the name node and resource manager reside on different machines, but for practical purposes or proof of concept, they can be on the same machine.']}], 'duration': 358.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE589454.jpg', 'highlights': ['The name node and data nodes are the main daemons running in a Hadoop Distributed File System (HDFS), with the name node storing metadata of the data on data nodes and data nodes serving read and write requests from clients.', 'The Hadoop cluster follows a master-slave topology, with a centralized machine housing the name node and resource manager, connected to multiple slave machines or data nodes distributed over different racks.', 'In a production cluster, the name node and resource manager reside on different machines, but for practical purposes or proof of concept, they can be on the same machine.']}, {'end': 1361.005, 'segs': [{'end': 1006.993, 'src': 'embed', 'start': 976.506, 'weight': 0, 'content': [{'end': 979.228, 'text': "Okay, we'll be exploring the architecture of HDFS.", 'start': 976.506, 'duration': 2.722}, {'end': 980.889, 'text': "So guys, don't lose me.", 'start': 979.248, 'duration': 1.641}, {'end': 982.951, 'text': 'Okay, be connected with me.', 'start': 981.37, 'duration': 1.581}, {'end': 993.708, 'text': 'Now when I say storing a file in HDFS, the data gets stored as blocks in HDFS, okay? So the entire file is not stored in HDFS.', 'start': 984.704, 'duration': 9.004}, {'end': 998.509, 'text': 'It is because, as you know, Hadoop is a distributed file system.', 'start': 993.828, 'duration': 4.681}, {'end': 1002.731, 'text': 'So if I have a file size of maybe one petabyte, okay?', 'start': 998.729, 'duration': 4.002}, {'end': 1006.993, 'text': 'So one petabyte kind of a storage is not present in one single machine.', 'start': 1003.171, 'duration': 3.822}], 'summary': 'Exploring hdfs architecture, data stored as blocks, suitable for large datasets.', 'duration': 30.487, 'max_score': 976.506, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE976506.jpg'}, {'end': 1059.959, 'src': 'embed', 'start': 1029.212, 'weight': 1, 'content': [{'end': 1031.092, 'text': "If not, I'll try and explain once again.", 'start': 1029.212, 'duration': 1.88}, {'end': 1034.785, 'text': 'Okay Great.', 'start': 1033.333, 'duration': 1.452}, {'end': 1039.488, 'text': 'So the data that gets stored into the Hadoop cluster is broken down into blocks.', 'start': 1035.465, 'duration': 4.023}, {'end': 1043.77, 'text': 'Okay Now these blocks are 128 MB in Apache 2 cluster.', 'start': 1039.928, 'duration': 3.842}, {'end': 1047.912, 'text': 'Okay However in Apache 1 it used to be 64 MB.', 'start': 1043.79, 'duration': 4.122}, {'end': 1050.174, 'text': 'Later on they upgraded the size.', 'start': 1048.532, 'duration': 1.642}, {'end': 1059.959, 'text': 'You also have the facility to increase or decrease the file size of the blocks using the configuration file that is HDFS site dot XML that is,', 'start': 1050.694, 'duration': 9.265}], 'summary': 'Data in hadoop cluster stored in 128 mb blocks, upgraded from 64 mb in apache 1.', 'duration': 30.747, 'max_score': 1029.212, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1029212.jpg'}, {'end': 1106.132, 'src': 'embed', 'start': 1075.431, 'weight': 2, 'content': [{'end': 1086.215, 'text': 'Now, if I break this file, or if I move this file into Hadoop cluster, that is 2.x, this file will get broken down into one block, that is block A,', 'start': 1075.431, 'duration': 10.784}, {'end': 1089.756, 'text': 'of 128 MB, and another block, that is of 120 MB.', 'start': 1086.215, 'duration': 3.541}, {'end': 1097.35, 'text': 'Okay, as you can see the first block was 128 MB pretty straightforward.', 'start': 1091.989, 'duration': 5.361}, {'end': 1106.132, 'text': 'That is the very first slab cuts down there and that is why the other block was of 120 MB and not 128 MB.', 'start': 1097.51, 'duration': 8.622}], 'summary': 'Moving the file into hadoop 2.x cluster breaks it into blocks of 128 mb and 120 mb.', 'duration': 30.701, 'max_score': 1075.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1075431.jpg'}, {'end': 1191.952, 'src': 'embed', 'start': 1163.87, 'weight': 3, 'content': [{'end': 1166.372, 'text': 'Okay, so every file gets replicated as well.', 'start': 1163.87, 'duration': 2.502}, {'end': 1169.734, 'text': 'That means the smaller the block size,', 'start': 1166.812, 'duration': 2.922}, {'end': 1177.12, 'text': 'the number of blocks will increase and then you will have to make copy those files separately as well as create the replicas of those files.', 'start': 1169.734, 'duration': 7.386}, {'end': 1179.181, 'text': 'So it will be an overhead to the cluster.', 'start': 1177.16, 'duration': 2.021}, {'end': 1185.066, 'text': 'At the same time, if you increase the file size of the block again,', 'start': 1180.843, 'duration': 4.223}, {'end': 1191.952, 'text': 'the block size will be too huge and there is a possibility that your commodity hardware is not able to store the entire block.', 'start': 1185.066, 'duration': 6.886}], 'summary': 'Smaller block size increases number of blocks, leading to overhead; larger block size may exceed hardware capacity.', 'duration': 28.082, 'max_score': 1163.87, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1163870.jpg'}, {'end': 1326.499, 'src': 'embed', 'start': 1294.174, 'weight': 4, 'content': [{'end': 1299.038, 'text': "It's because each block will be of 128 MB, right?", 'start': 1294.174, 'duration': 4.864}, {'end': 1310.468, 'text': 'So the entire file size can be divided into four blocks of 128 MB, and the last two MB remaining will be the last block, that is, E.', 'start': 1299.518, 'duration': 10.95}, {'end': 1317.231, 'text': 'Great So I hope you guys are very clear with what is blocks in HDFS.', 'start': 1310.468, 'duration': 6.763}, {'end': 1318.513, 'text': "Let's move on.", 'start': 1317.772, 'duration': 0.741}, {'end': 1325.539, 'text': 'Try and find an answer to another question, that is, is it safe to have just one copy of each block??', 'start': 1319.433, 'duration': 6.106}, {'end': 1326.499, 'text': 'What do you think, guys??', 'start': 1325.739, 'duration': 0.76}], 'summary': 'Hdfs uses 128 mb blocks, with 4 blocks and 2 mb remaining. it discusses block safety.', 'duration': 32.325, 'max_score': 1294.174, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1294174.jpg'}], 'start': 947.974, 'title': 'Hdfs block storage and optimal block size', 'summary': 'Delves into the storage of data in hdfs, focusing on breaking down files into blocks, with a standard block size of 128 mb in apache 2 cluster. it also emphasizes the significance of choosing an optimal file size for hdfs blocks and explores the safety concerns related to having just one copy of each block in hdfs.', 'chapters': [{'end': 1185.066, 'start': 947.974, 'title': 'Hdfs block storage', 'summary': 'Explores how data is stored in hdfs, emphasizing the concept of breaking down files into blocks, with block sizes of 128 mb in apache 2 cluster and the ability to configure block sizes using hdfs site.xml file.', 'duration': 237.092, 'highlights': ['Data stored as blocks in HDFS due to distributed file system; file broken down into chunks of data called HDFS blocks. The entire file is not stored in HDFS due to its distributed file system nature, necessitating the breaking down of data into blocks, known as HDFS blocks.', 'Block sizes of 128 MB in Apache 2 cluster, 64 MB in Apache 1; ability to configure block sizes using HDFS site.xml file. In Apache 2 cluster, block sizes are 128 MB, whereas in Apache 1 it used to be 64 MB; the block sizes can be configured using the HDFS site.xml file.', 'Impact of block size on cluster overhead; smaller block size leads to more blocks and file replications, increasing cluster overhead. Smaller block sizes lead to an increase in the number of blocks and file replications, resulting in overhead for the cluster.']}, {'end': 1361.005, 'start': 1185.066, 'title': 'Optimal block size and hdfs blocks', 'summary': 'Discusses the importance of choosing an optimal file size for hdfs blocks, such as 128 mb, and highlights the process of dividing a file into blocks, with a demonstration of how a 514 mb file is divided into five blocks of 128 mb each. it also explores the safety concerns related to having just one copy of each block in hdfs due to the vulnerability of commodity hardware.', 'duration': 175.939, 'highlights': ['The chapter discusses the process of dividing a 514 MB file into HDFS blocks, demonstrating that it will be divided into five blocks of 128 MB each.', 'It emphasizes the importance of understanding the concept of blocks in HDFS and the significance of choosing an optimal file size, such as 128 MB, for efficient storage.', 'The discussion highlights the safety concerns related to having just one copy of each block in HDFS, citing the vulnerability of commodity hardware and the potential risk of data loss if a block gets deleted.']}], 'duration': 413.031, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE947974.jpg', 'highlights': ['Data stored as blocks in HDFS due to distributed file system; file broken down into chunks of data called HDFS blocks.', 'Block sizes of 128 MB in Apache 2 cluster, 64 MB in Apache 1; ability to configure block sizes using HDFS site.xml file.', 'The chapter discusses the process of dividing a 514 MB file into HDFS blocks, demonstrating that it will be divided into five blocks of 128 MB each.', 'Impact of block size on cluster overhead; smaller block size leads to more blocks and file replications, increasing cluster overhead.', 'The discussion highlights the safety concerns related to having just one copy of each block in HDFS, citing the vulnerability of commodity hardware and the potential risk of data loss if a block gets deleted.']}, {'end': 1881.739, 'segs': [{'end': 1522.035, 'src': 'embed', 'start': 1474.084, 'weight': 0, 'content': [{'end': 1477.205, 'text': 'Every block as you can see has been replicated thrice.', 'start': 1474.084, 'duration': 3.121}, {'end': 1482.567, 'text': 'That means Hadoop follows a default replication factor of three.', 'start': 1477.845, 'duration': 4.722}, {'end': 1488.529, 'text': 'That means any file that you copy into Hadoop distributed file system gets replicated thrice.', 'start': 1482.607, 'duration': 5.922}, {'end': 1491.27, 'text': 'Okay, there will be three copies of every file.', 'start': 1488.809, 'duration': 2.461}, {'end': 1501.475, 'text': "In other words, if I say, if you copy one GB of a file into Hadoop distributed file system, you're actually storing three GB of a file in HDFS.", 'start': 1492.509, 'duration': 8.966}, {'end': 1512.162, 'text': 'Are you guys clear with this? Can this default replication factor be changed? Very good question Shubham.', 'start': 1502.676, 'duration': 9.486}, {'end': 1519.447, 'text': 'Yes, the default replication factor can also be changed using the configuration files of Hadoop.', 'start': 1512.662, 'duration': 6.785}, {'end': 1522.035, 'text': 'You can always configure that.', 'start': 1520.814, 'duration': 1.221}], 'summary': 'Hadoop default replication factor is three, files are stored as three copies in hdfs, and the factor can be changed via hadoop configuration files.', 'duration': 47.951, 'max_score': 1474.084, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1474084.jpg'}, {'end': 1610.125, 'src': 'embed', 'start': 1583.733, 'weight': 3, 'content': [{'end': 1588.415, 'text': 'Okay, I was coming from the perspective of cost that is the storage is very cheap.', 'start': 1583.733, 'duration': 4.682}, {'end': 1591.657, 'text': 'Okay, we are using commodity Hardwares here today.', 'start': 1588.755, 'duration': 2.902}, {'end': 1601.001, 'text': 'We can add as many hard disk to our data nodes as well as we can add as many data nodes as we want and hence storing the data is not a problem.', 'start': 1591.817, 'duration': 9.184}, {'end': 1607.244, 'text': 'Okay, we can store as much data as we want and hence replicating is not a problem for us.', 'start': 1601.261, 'duration': 5.983}, {'end': 1610.125, 'text': 'It is just that we should not over replicate our data.', 'start': 1607.624, 'duration': 2.501}], 'summary': 'Cost-effective storage solution with scalable data nodes for unlimited data storage and replication.', 'duration': 26.392, 'max_score': 1583.733, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1583733.jpg'}, {'end': 1660.727, 'src': 'embed', 'start': 1636.994, 'weight': 4, 'content': [{'end': 1644.884, 'text': 'Now, Hadoop actually follows a concept called rack-awareness to decide where to store which replica of a block.', 'start': 1636.994, 'duration': 7.89}, {'end': 1650.622, 'text': 'okay?. So, as you can see, there are three different racks rack one, rack two and rack three.', 'start': 1644.884, 'duration': 5.738}, {'end': 1653.323, 'text': 'Rack one has got four data nodes, one to four.', 'start': 1651.122, 'duration': 2.201}, {'end': 1656.745, 'text': 'Rack two again has got four data nodes, five, six, seven, eight.', 'start': 1653.823, 'duration': 2.922}, {'end': 1660.727, 'text': 'And rack three has got four data nodes, nine, 10, 11 and 12.', 'start': 1657.085, 'duration': 3.642}], 'summary': 'Hadoop uses rack-awareness to store replicas. three racks with 4 data nodes each.', 'duration': 23.733, 'max_score': 1636.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1636994.jpg'}, {'end': 1786.256, 'src': 'embed', 'start': 1758.885, 'weight': 5, 'content': [{'end': 1766.228, 'text': 'Finally, block C is stored in data node 11 of rack three, and the copies can only be created in rack two or rack one,', 'start': 1758.885, 'duration': 7.343}, {'end': 1772.49, 'text': 'depending on the network bandwidth that it would be required to push the data into either rack two or rack one.', 'start': 1766.228, 'duration': 6.262}, {'end': 1777.893, 'text': 'Whichever is minimum, that rack will be selected using the rack awareness algorithm.', 'start': 1772.55, 'duration': 5.343}, {'end': 1786.256, 'text': 'So for example, rack one stores the rest of the copies of block C in data node two and data node four.', 'start': 1778.853, 'duration': 7.403}], 'summary': 'Block c is stored in data node 11 of rack three, with copies in rack one and rack two.', 'duration': 27.371, 'max_score': 1758.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1758885.jpg'}], 'start': 1361.185, 'title': 'Hadoop storage and fault tolerance', 'summary': "Delves into hadoop's fault tolerance mechanism, which involves replicating each block three times, resulting in a default replication factor of three and potential triple storage space. it also explores the storage and replication strategy, highlighting cost-effectiveness, use of commodity hardware, limited over-replication, and the rack-awareness algorithm.", 'chapters': [{'end': 1583.693, 'start': 1361.185, 'title': 'Hadoop fault tolerance and replication', 'summary': 'Explains how hadoop ensures fault tolerance by replicating every block three times in the hadoop distributed file system, leading to a default replication factor of three and potentially triple the storage space required, while also allowing for the adjustment of the replication factor through configuration files.', 'duration': 222.508, 'highlights': ['Hadoop replicates every block three times in the Hadoop distributed file system, ensuring fault tolerance. Fault tolerance, default replication factor of three', 'The default replication factor of three means that any file copied into Hadoop distributed file system gets replicated thrice, leading to triple the storage space required. Default replication factor of three, triple the storage space', 'The default replication factor of three can be changed using the configuration files of Hadoop, allowing for adjustments to the replication factor. Adjustable replication factor, configuration files of Hadoop']}, {'end': 1881.739, 'start': 1583.733, 'title': 'Hadoop storage and replication', 'summary': 'Explains the storage and replication strategy in hadoop, emphasizing the cost-effectiveness of storage, the use of commodity hardware, the limited need for over-replication, and the rack-awareness algorithm for deciding where to store replicas of blocks.', 'duration': 298.006, 'highlights': ['The storage in Hadoop is cost-effective, utilizing commodity hardware and allowing for the addition of numerous data nodes and hard disks, accommodating as much data as needed and reducing the need for over-replication.', 'Hadoop employs the rack-awareness algorithm to determine where to store replicas of blocks, ensuring that replicas are not stored in the same rack as the original data for fault tolerance, and selecting racks based on network bandwidth for efficient data transfer.', 'The decision to store multiple copies of a block in the same rack is based on the low probability of simultaneous failure of multiple racks, minimizing the need for unnecessary data transfer between racks and conserving network bandwidth.']}], 'duration': 520.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1361185.jpg', 'highlights': ['Hadoop replicates every block three times in the Hadoop distributed file system, ensuring fault tolerance.', 'The default replication factor of three means that any file copied into Hadoop distributed file system gets replicated thrice, leading to triple the storage space required.', 'The default replication factor of three can be changed using the configuration files of Hadoop, allowing for adjustments to the replication factor.', 'The storage in Hadoop is cost-effective, utilizing commodity hardware and allowing for the addition of numerous data nodes and hard disks, accommodating as much data as needed and reducing the need for over-replication.', 'Hadoop employs the rack-awareness algorithm to determine where to store replicas of blocks, ensuring that replicas are not stored in the same rack as the original data for fault tolerance, and selecting racks based on network bandwidth for efficient data transfer.', 'The decision to store multiple copies of a block in the same rack is based on the low probability of simultaneous failure of multiple racks, minimizing the need for unnecessary data transfer between racks and conserving network bandwidth.']}, {'end': 2756.779, 'segs': [{'end': 2174.836, 'src': 'heatmap', 'start': 2085.918, 'weight': 1, 'content': [{'end': 2089.5, 'text': 'So right now we are talking about writing mechanism into the HDFS.', 'start': 2085.918, 'duration': 3.582}, {'end': 2098.383, 'text': "So the step one is there's a write request generated for block A by the client to the name node.", 'start': 2090, 'duration': 8.383}, {'end': 2105.965, 'text': 'What the name node does is it sends the list of IP addresses where the client can actually write the block.', 'start': 2098.983, 'duration': 6.982}, {'end': 2107.526, 'text': 'Okay, that is block A.', 'start': 2106.385, 'duration': 1.141}, {'end': 2118.043, 'text': 'Now this client connects to the switch and then finally sends a notification to data node 1, data node 4 and data node 6.', 'start': 2108.638, 'duration': 9.405}, {'end': 2123.545, 'text': 'Why these data nodes itself? It is because these are the data nodes that was sent by the name node.', 'start': 2118.043, 'duration': 5.502}, {'end': 2125.666, 'text': 'Name node specified.', 'start': 2123.926, 'duration': 1.74}, {'end': 2133.51, 'text': 'you can write the data in data node 1, 4 and 6, and that is why client has connected to all these three data nodes at the same time.', 'start': 2125.666, 'duration': 7.844}, {'end': 2138.476, 'text': 'Now, In the very first step,', 'start': 2134.591, 'duration': 3.885}, {'end': 2146.101, 'text': 'client actually takes a acknowledgement from all these data nodes whether they are ready to perform the right operation on them or not.', 'start': 2138.476, 'duration': 7.625}, {'end': 2152.405, 'text': 'It is because it could be like a data node is executing a task and is not available as of now.', 'start': 2146.401, 'duration': 6.004}, {'end': 2155.988, 'text': 'So very first step is to take an acknowledgement if they are ready or not.', 'start': 2152.465, 'duration': 3.523}, {'end': 2161.191, 'text': 'As soon as they say they are ready the right pipeline is created.', 'start': 2156.728, 'duration': 4.463}, {'end': 2168.83, 'text': 'Now the client What it does, it sends the write request on data node 1, 4 and 6.', 'start': 2161.531, 'duration': 7.299}, {'end': 2174.836, 'text': 'The very first copy is created in data node 1 that is block A is created in data node 1.', 'start': 2168.83, 'duration': 6.006}], 'summary': 'Writing mechanism in hdfs: client sends write request to name node, receives list of ip addresses, connects to specified data nodes, takes acknowledgements, and creates a write pipeline.', 'duration': 88.918, 'max_score': 2085.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2085918.jpg'}, {'end': 2409.498, 'src': 'embed', 'start': 2379.559, 'weight': 2, 'content': [{'end': 2384.18, 'text': 'if you are clear, Ryan says got it good.', 'start': 2379.559, 'duration': 4.621}, {'end': 2390.948, 'text': 'Okay, so what happens is the first copy of every block is created in parallel.', 'start': 2385.666, 'duration': 5.282}, {'end': 2400.351, 'text': 'So block A gets created in data node one and at the same time parallelly block B gets created in data node seven of rack five.', 'start': 2391.228, 'duration': 9.123}, {'end': 2407.237, 'text': 'Now once the first copy is created the replicas gets created in a sequential fashion.', 'start': 2401.536, 'duration': 5.701}, {'end': 2409.498, 'text': "First we'll talk about block A.", 'start': 2407.557, 'duration': 1.941}], 'summary': 'First block copies created in parallel, then replicas sequentially.', 'duration': 29.939, 'max_score': 2379.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2379559.jpg'}, {'end': 2632.388, 'src': 'embed', 'start': 2607.556, 'weight': 0, 'content': [{'end': 2614.698, 'text': "Okay, so the name node will ensure that client doesn't have to work a lot to get the data or read the data.", 'start': 2607.556, 'duration': 7.142}, {'end': 2626.463, 'text': "Okay, it will ensure that the data nodes where the actual data is stored are very close enough and then the client doesn't have to consume a lot of network bandwidth to just read the data.", 'start': 2614.778, 'duration': 11.685}, {'end': 2632.388, 'text': 'Okay? This is a very crucial thing that is taken care by NameNode and it helps a lot.', 'start': 2626.903, 'duration': 5.485}], 'summary': 'Namenode ensures efficient data access and minimizes network bandwidth consumption for clients.', 'duration': 24.832, 'max_score': 2607.556, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2607556.jpg'}], 'start': 1883.317, 'title': 'Hdfs architecture & write mechanism', 'summary': "Covers hdfs architecture, highlighting the name node, data node, replication factor, and the write mechanism. it also discusses the read mechanism and the name node's role in optimizing data retrieval to save network bandwidth.", 'chapters': [{'end': 2756.779, 'start': 1883.317, 'title': 'Hdfs architecture & write mechanism', 'summary': "Covers the architecture of hdfs, with a focus on the name node, data node, replication factor, and the write mechanism, where the client generates a write request, the name node provides a list of data nodes, and the client connects to the specified data nodes to create replicas in a sequential manner. the read mechanism is also discussed, highlighting the name node's role in optimizing data retrieval by providing the addresses of data nodes closer to the client to save network bandwidth.", 'duration': 873.462, 'highlights': ['The write mechanism involves the client generating a write request for a block, the name node providing a list of IP addresses for the client to write the block, the client connecting to the specified data nodes, creating a pipeline for writing the block, and the metadata being updated by the name node after successful write.', 'The first copy of each block is created in parallel, while the replicas are created in a sequential fashion by their subsequent data nodes.', 'The name node optimizes data retrieval by providing the addresses of data nodes closer to the client to save network bandwidth during the read mechanism.']}], 'duration': 873.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE1883317.jpg', 'highlights': ['The name node optimizes data retrieval by providing the addresses of data nodes closer to the client to save network bandwidth during the read mechanism.', 'The write mechanism involves the client generating a write request for a block, the name node providing a list of IP addresses for the client to write the block, the client connecting to the specified data nodes, creating a pipeline for writing the block, and the metadata being updated by the name node after successful write.', 'The first copy of each block is created in parallel, while the replicas are created in a sequential fashion by their subsequent data nodes.']}, {'end': 3482.275, 'segs': [{'end': 2850.858, 'src': 'embed', 'start': 2758.641, 'weight': 0, 'content': [{'end': 2766.669, 'text': "Now, let's try and find out the version of Hadoop that is running on this pseudo distributed cluster of Edureka Virtual Machine.", 'start': 2758.641, 'duration': 8.028}, {'end': 2772.235, 'text': 'The command that we need to execute is Hadoop space version and I press enter.', 'start': 2767.59, 'duration': 4.645}, {'end': 2778.3, 'text': 'As you can see, it is running a version of Hadoop 2.2.', 'start': 2775.359, 'duration': 2.941}, {'end': 2785.164, 'text': '0, right? So guys, are you clear with this? Okay.', 'start': 2778.3, 'duration': 6.864}, {'end': 2795.545, 'text': 'Another command that we saw in the presentation was Hadoop fsck and then backslash.', 'start': 2786.464, 'duration': 9.081}, {'end': 2798.928, 'text': 'so fsck is a command to get the health of a file system.', 'start': 2795.545, 'duration': 3.383}, {'end': 2800.889, 'text': 'ok, and backslash is nothing.', 'start': 2798.928, 'duration': 1.961}, {'end': 2810.417, 'text': "but I'm trying to tell the system that I want to get the health of Hadoop distributed file system and hence it can be represented by a backslash.", 'start': 2800.889, 'duration': 9.528}, {'end': 2820.424, 'text': "ok, so when I press enter I'm going to get the health of the root directory of Hadoop distributed file system.", 'start': 2810.417, 'duration': 10.007}, {'end': 2824.346, 'text': 'take some time, Okay.', 'start': 2820.424, 'duration': 3.922}, {'end': 2826.188, 'text': 'so we have all the details right here.', 'start': 2824.346, 'duration': 1.842}, {'end': 2830.993, 'text': 'Total size of the data that is stored is this.', 'start': 2828.17, 'duration': 2.823}, {'end': 2838.371, 'text': 'Okay, the directories are this, total files is 626, total blocks are 602.', 'start': 2831.434, 'duration': 6.937}, {'end': 2842.113, 'text': 'replicated minimally replicated blocks over replicated blocks.', 'start': 2838.371, 'duration': 3.742}, {'end': 2846.936, 'text': 'under replicated blocks, default replication factor for this cluster is one.', 'start': 2842.113, 'duration': 4.823}, {'end': 2850.858, 'text': 'since it is a training VM, we have kept the replication factor as one.', 'start': 2846.936, 'duration': 3.922}], 'summary': 'Hadoop version 2.2.0 is running on the pseudo distributed cluster. the file system has 626 files, 602 blocks, and a replication factor of 1.', 'duration': 92.217, 'max_score': 2758.641, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2758641.jpg'}, {'end': 2955.909, 'src': 'embed', 'start': 2924.721, 'weight': 5, 'content': [{'end': 2930.004, 'text': "Moreover, the difference between them is standalone doesn't have a daemon running and always uses one JVM.", 'start': 2924.721, 'duration': 5.283}, {'end': 2934.948, 'text': 'However, pseudo distributed mode has all the daemons running in a single machine.', 'start': 2930.525, 'duration': 4.423}, {'end': 2938.63, 'text': 'However, every daemon uses a different JVM.', 'start': 2935.388, 'duration': 3.242}, {'end': 2949.567, 'text': "Am I clear Shubham? Are you clear with this? Great, let's move on, see the next commands.", 'start': 2939.591, 'duration': 9.976}, {'end': 2955.909, 'text': 'So we have a command that is HDFS DFS hyphen LS backslash.', 'start': 2951.148, 'duration': 4.761}], 'summary': 'Standalone uses one jvm, pseudo-distributed mode has all daemons running in a single machine, each using a different jvm.', 'duration': 31.188, 'max_score': 2924.721, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2924721.jpg'}, {'end': 3140.252, 'src': 'embed', 'start': 3112.151, 'weight': 3, 'content': [{'end': 3115.393, 'text': "HDFS is that is I'm trying to execute a Hadoop command.", 'start': 3112.151, 'duration': 3.242}, {'end': 3118.155, 'text': 'DFS is distributed file system.', 'start': 3115.874, 'duration': 2.281}, {'end': 3124.02, 'text': 'hyphen put is the action that needs to be taken, that is, put the file?', 'start': 3119.076, 'duration': 4.944}, {'end': 3128.563, 'text': 'welcome. is the name of the file or the path of the file that I want to move to HDFS?', 'start': 3124.02, 'duration': 4.543}, {'end': 3133.067, 'text': 'and this is finally the path where I want to move in HDFS?', 'start': 3128.563, 'duration': 4.504}, {'end': 3135.609, 'text': 'right guys, are you clear with the command?', 'start': 3133.067, 'duration': 2.542}, {'end': 3140.252, 'text': 'a quick confirmation will help.', 'start': 3135.609, 'duration': 4.643}], 'summary': 'Executing hadoop command to move a file to hdfs using hyphen put action.', 'duration': 28.101, 'max_score': 3112.151, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE3112151.jpg'}, {'end': 3293.879, 'src': 'embed', 'start': 3255.279, 'weight': 4, 'content': [{'end': 3258.221, 'text': 'I can always download this file from here and use it.', 'start': 3255.279, 'duration': 2.942}, {'end': 3262.445, 'text': "I'll just get back to the presentation quickly.", 'start': 3260.463, 'duration': 1.982}, {'end': 3267.399, 'text': "So there's another command that is HDFS DFS hyphen help.", 'start': 3263.916, 'duration': 3.483}, {'end': 3269.62, 'text': "And we'll see that as well.", 'start': 3268.299, 'duration': 1.321}, {'end': 3278.788, 'text': 'Okay And that come back to the terminal.', 'start': 3272.983, 'duration': 5.805}, {'end': 3282.831, 'text': 'So HDFS DFS hyphen help.', 'start': 3278.948, 'duration': 3.883}, {'end': 3291.677, 'text': 'And we should get the list of commands or the help that you would need to work on the Hadoop cluster.', 'start': 3285.311, 'duration': 6.366}, {'end': 3293.879, 'text': "So then there's various commands.", 'start': 3292.137, 'duration': 1.742}], 'summary': 'Learning hadoop commands: hdfs dfs -help provides list of commands.', 'duration': 38.6, 'max_score': 3255.279, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE3255279.jpg'}], 'start': 2758.641, 'title': 'Hadoop version and hdfs overview', 'summary': 'Demonstrates finding hadoop version 2.2.0 and checking hdfs health, along with an overview of hdfs, data size, directory count, file count, block count, replication factor, and commands for file operations and ui access.', 'chapters': [{'end': 2824.346, 'start': 2758.641, 'title': 'Hadoop version and file system health check', 'summary': 'Demonstrates the process of finding the version of hadoop running on a pseudo distributed cluster, revealing it to be hadoop 2.2.0, and using the hadoop fsck command to check the health of the hadoop distributed file system.', 'duration': 65.705, 'highlights': ["The 'Hadoop version' command reveals that the pseudo distributed cluster is running Hadoop 2.2.0, providing clear insight into the version of Hadoop being utilized.", "The 'Hadoop fsck' command is used to check the health of the Hadoop distributed file system, offering a method to ensure the robustness and stability of the file system."]}, {'end': 3482.275, 'start': 2824.346, 'title': 'Hdfs overview and commands', 'summary': 'Provides an overview of hadoop distributed file system (hdfs) including details on data size, directory count, file count, block count, replication factor, and commands for listing files, putting files, and accessing the ui.', 'duration': 657.929, 'highlights': ['The total size of the stored data is [quantifiable data]. The speaker mentions the total size of the stored data without providing the specific quantifiable data.', 'Total files in the system are 626, and total blocks are 602. The transcript provides specific quantifiable data on the total number of files and blocks in the system.', 'Explanation of standalone and pseudo distributed modes, highlighting the differences and use cases. The chapter explains the differences and use cases of standalone and pseudo distributed modes, providing valuable insights for understanding Hadoop architecture.', 'Demonstration of HDFS commands including DFS -LS and DFS -put with detailed explanation. The chapter demonstrates and explains HDFS commands such as DFS -LS for listing files and DFS -put for moving files to the Hadoop distributed file system.', 'Introduction to HDFS UI and commands for accessing it, including a demonstration of the help command. The chapter introduces the HDFS UI, demonstrates accessing it, and provides details on using the help command for understanding Hadoop cluster commands.']}], 'duration': 723.634, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/m9v9lky3zcE/pics/m9v9lky3zcE2758641.jpg', 'highlights': ["The 'Hadoop version' command reveals that the pseudo distributed cluster is running Hadoop 2.2.0, providing clear insight into the version of Hadoop being utilized.", "The 'Hadoop fsck' command is used to check the health of the Hadoop distributed file system, offering a method to ensure the robustness and stability of the file system.", 'Total files in the system are 626, and total blocks are 602. The transcript provides specific quantifiable data on the total number of files and blocks in the system.', 'Demonstration of HDFS commands including DFS -LS and DFS -put with detailed explanation. The chapter demonstrates and explains HDFS commands such as DFS -LS for listing files and DFS -put for moving files to the Hadoop distributed file system.', 'Introduction to HDFS UI and commands for accessing it, including a demonstration of the help command. The chapter introduces the HDFS UI, demonstrates accessing it, and provides details on using the help command for understanding Hadoop cluster commands.', 'Explanation of standalone and pseudo distributed modes, highlighting the differences and use cases. The chapter explains the differences and use cases of standalone and pseudo distributed modes, providing valuable insights for understanding Hadoop architecture.']}], 'highlights': ['Hadoop reduced processing time for 1 TB of data to 4.3 minutes with a distributed file system, showcasing significant efficiency gains.', 'The concept of a distributed file system and its advantages are explained, emphasizing the ability to reduce data processing time from 43 minutes to 4.3 minutes using multiple machines.', 'The challenges of storing and processing big data are addressed, emphasizing the inability of existing systems to accommodate the data and the significant time delay in processing, with results taking weeks or months instead of minutes.', 'The default replication factor of three means that any file copied into Hadoop distributed file system gets replicated thrice, leading to triple the storage space required.', 'The session will cover components of Hadoop, distributed file system, services, blocks, replication factor, rack awareness, and HDFS architecture, as well as read and write mechanisms.', 'The HDFS has two main components: name node (master daemon) running on the master machine and data node (slave machine) running on commodity hardware.', 'The name node and data nodes are the main daemons running in a Hadoop Distributed File System (HDFS), with the name node storing metadata of the data on data nodes and data nodes serving read and write requests from clients.', 'The decision to store multiple copies of a block in the same rack is based on the low probability of simultaneous failure of multiple racks, minimizing the need for unnecessary data transfer between racks and conserving network bandwidth.', 'The name node optimizes data retrieval by providing the addresses of data nodes closer to the client to save network bandwidth during the read mechanism.', "The 'Hadoop version' command reveals that the pseudo distributed cluster is running Hadoop 2.2.0, providing clear insight into the version of Hadoop being utilized."]}