title
Lecture 1 — Distributed File Systems | Stanford University
description
Check out the following interesting papers. Happy learning!
Paper Title: "On the Role of Reviewer Expertise in Temporal Review Helpfulness Prediction"
Paper: https://aclanthology.org/2023.findings-eacl.125/
Dataset: https://huggingface.co/datasets/tafseer-nayeem/review_helpfulness_prediction
Paper Title: "Abstractive Unsupervised Multi-Document Summarization using Paraphrastic Sentence Fusion"
Paper: https://aclanthology.org/C18-1102/
Paper Title: "Extract with Order for Coherent Multi-Document Summarization"
Paper: https://aclanthology.org/W17-2407.pdf
Paper Title: "Paraphrastic Fusion for Abstractive Multi-Sentence Compression Generation"
Paper: https://dl.acm.org/doi/abs/10.1145/3132847.3133106
Paper Title: "Neural Diverse Abstractive Sentence Compression Generation"
Paper: https://link.springer.com/chapter/10.1007/978-3-030-15719-7_14
detail
{'title': 'Lecture 1 — Distributed File Systems | Stanford University', 'heatmap': [{'end': 561.174, 'start': 540.95, 'weight': 0.883}, {'end': 590.07, 'start': 561.915, 'weight': 0.923}, {'end': 665.921, 'start': 646.148, 'weight': 0.747}, {'end': 942.661, 'start': 910.255, 'weight': 0.781}], 'summary': "The lecture covers the need for mapreduce in handling massive datasets, citing google's example of 10 billion web pages of 200 terabytes, and the data bandwidth limitation, reducing processing time from 46 days to around an hour. it also explains the standard architecture for cluster computing, challenges such as modern data center scale, impact of node failures, and distributed file system architecture and components.", 'chapters': [{'end': 202.072, 'segs': [{'end': 30.278, 'src': 'embed', 'start': 0.765, 'weight': 2, 'content': [{'end': 2.186, 'text': 'Welcome to Mining of Massive Datasets.', 'start': 0.765, 'duration': 1.421}, {'end': 5.607, 'text': "I'm Anand Rajaraman, and today's topic is MapReduce.", 'start': 2.866, 'duration': 2.741}, {'end': 11.57, 'text': 'In the last few years, MapReduce has emerged as the leading paradigm for mining really massive data sets.', 'start': 6.447, 'duration': 5.123}, {'end': 17.572, 'text': "But before we get into MapReduce proper, let's spend a few minutes trying to understand why we need MapReduce in the first place.", 'start': 12.17, 'duration': 5.402}, {'end': 19.253, 'text': "Let's start with the basics.", 'start': 18.293, 'duration': 0.96}, {'end': 30.278, 'text': "Now we're all familiar with the basic computational model of CPU and memory right?. The algorithm runs on the CPU and accesses data that's in memory.", 'start': 21.394, 'duration': 8.884}], 'summary': 'Mapreduce is the leading paradigm for mining massive datasets, addressing the need for processing really massive data sets.', 'duration': 29.513, 'max_score': 0.765, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0765.jpg'}, {'end': 119.752, 'src': 'embed', 'start': 92.778, 'weight': 0, 'content': [{'end': 97.139, 'text': "And let's further say that the average size of a web page is 20 kilobytes.", 'start': 92.778, 'duration': 4.361}, {'end': 100.961, 'text': 'Now these are representative numbers from, from real life.', 'start': 97.8, 'duration': 3.161}, {'end': 107.884, 'text': 'Now if you take 10 billion web pages, each of 20 kilobytes you have a total data set size of 200 terabytes.', 'start': 101.381, 'duration': 6.503}, {'end': 114.808, 'text': "Now, when you have 200 terabytes, let's assume that we're using the classical computational model, classical data mining model.", 'start': 109.443, 'duration': 5.365}, {'end': 119.752, 'text': 'And all this data is stored in a single disk, and we have to read it in to be processed inside a CPU.', 'start': 115.348, 'duration': 4.404}], 'summary': '10 billion web pages at 20kb each equals 200tb dataset.', 'duration': 26.974, 'max_score': 92.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S092778.jpg'}, {'end': 172.856, 'src': 'embed', 'start': 139.383, 'weight': 3, 'content': [{'end': 141.724, 'text': 'So, so we can read data at 50 megabytes a second.', 'start': 139.383, 'duration': 2.341}, {'end': 145.186, 'text': 'How long does it take to read 200 terabytes at 50 megabytes a second?', 'start': 142.064, 'duration': 3.122}, {'end': 149.528, 'text': 'You can do some simple math, and the answer is 4 million seconds, which is more than 46 days.', 'start': 145.206, 'duration': 4.322}, {'end': 155.86, 'text': 'Remember, this is an awfully long time, and this is just the time to read the data into memory.', 'start': 150.933, 'duration': 4.927}, {'end': 159.164, 'text': "To do something useful with the data, it's going to take even longer.", 'start': 155.9, 'duration': 3.264}, {'end': 161.167, 'text': 'Right, so clearly this is unacceptable.', 'start': 159.745, 'duration': 1.422}, {'end': 165.232, 'text': "You can't take four to six days just to read the data, so you need a better solution.", 'start': 161.928, 'duration': 3.304}, {'end': 172.856, 'text': 'The obvious thing that you think of is that if you can split the data into chunks and you can have multiple disks and CPUs.', 'start': 165.973, 'duration': 6.883}], 'summary': 'Reading 200 terabytes at 50 mb/s takes over 46 days. a better solution is needed.', 'duration': 33.473, 'max_score': 139.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0139383.jpg'}, {'end': 211.68, 'src': 'embed', 'start': 180.099, 'weight': 1, 'content': [{'end': 182.38, 'text': "That's going to cut down this time by a lot.", 'start': 180.099, 'duration': 2.281}, {'end': 186.631, 'text': 'For example, if you had a thousand disks and CPUs instead of 4, 000,', 'start': 182.4, 'duration': 4.231}, {'end': 196.056, 'text': '4 million seconds and we work completely in parallel instead of 4 million seconds, we could do the job in 4 million by 1, 000, which is 4,', 'start': 186.631, 'duration': 9.425}, {'end': 196.687, 'text': '000 seconds.', 'start': 196.056, 'duration': 0.631}, {'end': 202.072, 'text': "And that's just about an hour, which is, which is a very acceptable time.", 'start': 198.008, 'duration': 4.064}, {'end': 206.776, 'text': 'Right? So, this is the fundamental idea behind the idea of cluster computing.', 'start': 202.572, 'duration': 4.204}, {'end': 211.68, 'text': 'Right? And this is the standard architecture that has emerged for cluster computing is something like this.', 'start': 207.216, 'duration': 4.464}], 'summary': 'Using cluster computing, a task that took 4 million seconds can be done in 4,000 seconds, reducing the time to about an hour.', 'duration': 31.581, 'max_score': 180.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0180099.jpg'}], 'start': 0.765, 'title': 'Mapreduce and data bandwidth limitation', 'summary': "Discusses the need for mapreduce in handling massive datasets, citing google's example of 10 billion web pages of 200 terabytes. it also highlights the data bandwidth limitation and the need for parallel processing to reduce processing time from 46 days to around an hour.", 'chapters': [{'end': 114.808, 'start': 0.765, 'title': 'Understanding mapreduce', 'summary': 'Discusses the need for mapreduce in handling massive datasets due to the limitation of traditional computational models and illustrates it using the example of google crawling and indexing 10 billion web pages of 200 terabytes in total size.', 'duration': 114.043, 'highlights': ['The limitations of the traditional computational model and the need for MapReduce in handling massive datasets.', 'Illustration of the scale of data with the example of Google crawling and indexing 10 billion web pages of 200 terabytes in total size.', 'Explanation of the familiar computational model of CPU and memory and the necessity of accessing disk when data cannot fit in memory.']}, {'end': 202.072, 'start': 115.348, 'title': 'Data bandwidth limitation and parallel processing', 'summary': 'Discusses the fundamental limitation of data bandwidth between disk and cpu, highlighting that reading 200 terabytes at 50 megabytes a second would take over 46 days, leading to the need for parallel processing with multiple disks and cpus to reduce the time to around an hour.', 'duration': 86.724, 'highlights': ['The time to read 200 terabytes at 50 megabytes a second is over 46 days, highlighting the need for a better solution such as parallel processing with multiple disks and CPUs.', 'By striping the data across a thousand disks and CPUs, the processing time can be reduced to around an hour, significantly improving efficiency compared to the initial 46 days timeframe.', 'The fundamental limitation is the data bandwidth between the disk and the CPU, with a representative number for modern SATA disks being around 50 megabytes a second.']}], 'duration': 201.307, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0765.jpg', 'highlights': ['Illustration of the scale of data with the example of Google crawling and indexing 10 billion web pages of 200 terabytes in total size.', 'By striping the data across a thousand disks and CPUs, the processing time can be reduced to around an hour, significantly improving efficiency compared to the initial 46 days timeframe.', 'The limitations of the traditional computational model and the need for MapReduce in handling massive datasets.', 'The time to read 200 terabytes at 50 megabytes a second is over 46 days, highlighting the need for a better solution such as parallel processing with multiple disks and CPUs.']}, {'end': 481.307, 'segs': [{'end': 244.972, 'src': 'embed', 'start': 202.572, 'weight': 2, 'content': [{'end': 206.776, 'text': 'Right? So, this is the fundamental idea behind the idea of cluster computing.', 'start': 202.572, 'duration': 4.204}, {'end': 211.68, 'text': 'Right? And this is the standard architecture that has emerged for cluster computing is something like this.', 'start': 207.216, 'duration': 4.464}, {'end': 216.063, 'text': 'You have racks consisting of commodity Linux nodes.', 'start': 212.681, 'duration': 3.382}, {'end': 218.986, 'text': 'You go with commodity Linux nodes because they are very cheap.', 'start': 216.083, 'duration': 2.903}, {'end': 223.747, 'text': 'And you can, you know, you can, you can buy thousands and thousands of them and, and rack them up.', 'start': 220.006, 'duration': 3.741}, {'end': 227.028, 'text': 'You you have many of these racks.', 'start': 224.567, 'duration': 2.461}, {'end': 233.209, 'text': 'Each rack has 16 to 64 of these commodity Linux nodes.', 'start': 227.168, 'duration': 6.041}, {'end': 235.81, 'text': 'And these nodes are connected by a switch.', 'start': 233.229, 'duration': 2.581}, {'end': 239.791, 'text': 'And the, the, the switch in a rack is typically a gigabit switch.', 'start': 237.05, 'duration': 2.741}, {'end': 244.972, 'text': "So there's one gigabit per second bandwidth between any pair of nodes in a rack.", 'start': 239.851, 'duration': 5.121}], 'summary': 'Cluster computing uses racks with 16-64 linux nodes connected by gigabit switch for high-bandwidth communication.', 'duration': 42.4, 'max_score': 202.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0202572.jpg'}, {'end': 394.565, 'src': 'embed', 'start': 357.188, 'weight': 0, 'content': [{'end': 364.936, 'text': "And if you assume that these servers fail independent of each other, you're going to get approximately one failure a day.", 'start': 357.188, 'duration': 7.748}, {'end': 367.757, 'text': "Which still isn't such a big deal, you can probably deal with it.", 'start': 365.436, 'duration': 2.321}, {'end': 372.298, 'text': 'But now imagine something on the scale of Google, which has a million servers in its cluster.', 'start': 368.317, 'duration': 3.981}, {'end': 376.099, 'text': "So if you have a million servers, you're going to get 1, 000 failures per day.", 'start': 372.618, 'duration': 3.481}, {'end': 377.5, 'text': 'Now, 1, 000 failures per day is a lot.', 'start': 376.139, 'duration': 1.361}, {'end': 381.241, 'text': 'And you need some kind of infrastructure to deal with that kind of failure rate.', 'start': 378.2, 'duration': 3.041}, {'end': 385.542, 'text': 'You know, failures on that scale introduce two kinds of problems.', 'start': 381.841, 'duration': 3.701}, {'end': 390.784, 'text': "The first problem is that if you know if nodes are going to fail and you're going to store your data on these nodes,", 'start': 385.902, 'duration': 4.882}, {'end': 392.865, 'text': 'how do you keep the data and store it persistently?', 'start': 390.784, 'duration': 2.081}, {'end': 394.565, 'text': 'What does this mean?', 'start': 393.985, 'duration': 0.58}], 'summary': 'With a million servers, expect 1,000 failures per day, requiring infrastructure to manage data persistence.', 'duration': 37.377, 'max_score': 357.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0357188.jpg'}], 'start': 202.572, 'title': 'Cluster computing architecture and challenges', 'summary': 'Explains the standard architecture for cluster computing involving commodity linux nodes, with racks having 16 to 64 nodes connected by a gigabit switch, and multiple racks connected by backbone switches with 2 to 10 gigabits bandwidth. it also discusses challenges such as the scale of modern data centers, impact of node failures, need for persistent data storage, ensuring availability, and network bandwidth limitations.', 'chapters': [{'end': 270.164, 'start': 202.572, 'title': 'Cluster computing architecture', 'summary': 'Explains the standard architecture for cluster computing, consisting of commodity linux nodes, with each rack having 16 to 64 nodes connected by a gigabit switch, and multiple racks connected by backbone switches with 2 to 10 gigabits bandwidth.', 'duration': 67.592, 'highlights': ['The standard architecture for cluster computing consists of racks with 16 to 64 commodity Linux nodes connected by gigabit switches, and multiple racks connected by backbone switches with 2 to 10 gigabits bandwidth.', 'Commodity Linux nodes are preferred for cluster computing due to their cost-effectiveness, allowing for the purchase of thousands of them to build data centers.', 'The gigabit switch within a rack provides one gigabit per second bandwidth between any pair of nodes, ensuring efficient communication within the cluster.', 'The backbone switch, with a higher bandwidth of 2 to 10 gigabits, connects multiple racks in a data center, enabling high-speed data transfer and communication.']}, {'end': 481.307, 'start': 270.504, 'title': 'Challenges in cluster computing', 'summary': 'Discusses the challenges in cluster computing, including the scale of modern data centers, the impact of node failures, the need for persistent data storage, ensuring availability, and network bandwidth limitations.', 'duration': 210.803, 'highlights': ["The scale of modern data centers and clusters is illustrated by Google's estimated million servers in its cluster, resulting in approximately 1,000 failures per day.", 'The impact of node failures on data storage and availability is discussed, highlighting the need for infrastructure to manage persistent storage and to ensure computation completion despite node failures.', 'The network bandwidth limitations are highlighted, with the example of moving ten terabytes of data across a one gigabit per second network connection taking approximately a day.']}], 'duration': 278.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0202572.jpg', 'highlights': ["The scale of modern data centers is illustrated by Google's estimated million servers in its cluster, resulting in approximately 1,000 failures per day.", 'The impact of node failures on data storage and availability is discussed, highlighting the need for infrastructure to manage persistent storage and to ensure computation completion despite node failures.', 'Commodity Linux nodes are preferred for cluster computing due to their cost-effectiveness, allowing for the purchase of thousands of them to build data centers.', 'The standard architecture for cluster computing consists of racks with 16 to 64 commodity Linux nodes connected by gigabit switches, and multiple racks connected by backbone switches with 2 to 10 gigabits bandwidth.']}, {'end': 947.103, 'segs': [{'end': 571.043, 'src': 'heatmap', 'start': 540.95, 'weight': 0, 'content': [{'end': 546.654, 'text': 'The same data is stored on multiple nodes, so that even if you lose one of those nodes, the data is still available on another node.', 'start': 540.95, 'duration': 5.704}, {'end': 551.005, 'text': 'The second problem that we saw was one of network bottlenecks.', 'start': 547.782, 'duration': 3.223}, {'end': 553.968, 'text': 'And this happens when you move around data a lot.', 'start': 552.046, 'duration': 1.922}, {'end': 561.174, 'text': 'What the MapReduce model does is that it moves the computation close to the data and avoids copying data around the network.', 'start': 554.388, 'duration': 6.786}, {'end': 563.856, 'text': 'And this minimizes the network bottleneck problem.', 'start': 561.915, 'duration': 1.941}, {'end': 571.043, 'text': 'And thirdly, the MapReduce model also provides a very simple programming model that hides the complexity of all the underlying magic.', 'start': 564.497, 'duration': 6.546}], 'summary': 'Mapreduce stores data on multiple nodes, reduces network bottlenecks, and simplifies programming.', 'duration': 30.093, 'max_score': 540.95, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0540950.jpg'}, {'end': 590.07, 'src': 'heatmap', 'start': 561.915, 'weight': 0.923, 'content': [{'end': 563.856, 'text': 'And this minimizes the network bottleneck problem.', 'start': 561.915, 'duration': 1.941}, {'end': 571.043, 'text': 'And thirdly, the MapReduce model also provides a very simple programming model that hides the complexity of all the underlying magic.', 'start': 564.497, 'duration': 6.546}, {'end': 574.459, 'text': "So let's look at each of these pieces in turn.", 'start': 572.077, 'duration': 2.382}, {'end': 577.621, 'text': 'The first piece is the redundant storage infrastructure.', 'start': 574.839, 'duration': 2.782}, {'end': 581.384, 'text': "Now, redundant storage is provided by what's called a distributed file system.", 'start': 578.101, 'duration': 3.283}, {'end': 590.07, 'text': 'A distributed file system is, is a file system that stores data you know, across a cluster, but stores each piece of data multiple times.', 'start': 581.864, 'duration': 8.206}], 'summary': 'Mapreduce model simplifies programming, minimizes network bottleneck, uses redundant storage infrastructure.', 'duration': 28.155, 'max_score': 561.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0561915.jpg'}, {'end': 633.318, 'src': 'embed', 'start': 603.402, 'weight': 2, 'content': [{'end': 605.525, 'text': "Hadoop's HDFS is another example.", 'start': 603.402, 'duration': 2.123}, {'end': 609.192, 'text': 'And these are the two most popular distributed file systems out there.', 'start': 606.407, 'duration': 2.785}, {'end': 620.73, 'text': 'A typical usage pattern that these distributed file systems are optimized for is huge files that are in the hundreds of gigabytes to terabytes.', 'start': 612.205, 'duration': 8.525}, {'end': 626.494, 'text': 'But the even though the files are really huge, the data is very rarely updated in place, right?', 'start': 621.831, 'duration': 4.663}, {'end': 630.396, 'text': "Once once data is written, you know it's it's read very often.", 'start': 626.614, 'duration': 3.782}, {'end': 633.318, 'text': "But when it's updated, it's updated through appends.", 'start': 630.796, 'duration': 2.522}], 'summary': 'Hdfs and hadoop are popular distributed file systems optimized for huge files, rarely updated, and frequently read.', 'duration': 29.916, 'max_score': 603.402, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0603402.jpg'}, {'end': 676.11, 'src': 'heatmap', 'start': 646.148, 'weight': 3, 'content': [{'end': 650.513, 'text': "It doesn't ever go and update the content of a web page that it already has crawled.", 'start': 646.148, 'duration': 4.365}, {'end': 658.162, 'text': 'Right? So a typical usage pattern consists of writing the data once, reading it multiple times, and appending to it occasionally.', 'start': 650.954, 'duration': 7.208}, {'end': 662.358, 'text': "Let's go under the hood of a distributed file system and see how it actually works.", 'start': 659.295, 'duration': 3.063}, {'end': 665.921, 'text': 'Data is kept in chunks that are spread across machines.', 'start': 663.038, 'duration': 2.883}, {'end': 672.006, 'text': 'So if you take any file, the file is divided into chunks, and these chunks are spread across multiple machines.', 'start': 666.021, 'duration': 5.985}, {'end': 676.11, 'text': 'So the machines themselves are called chunk servers in this context.', 'start': 672.086, 'duration': 4.024}], 'summary': 'A distributed file system stores data in chunks spread across machines, allowing for data writing once, multiple readings, and occasional appending.', 'duration': 29.962, 'max_score': 646.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0646148.jpg'}, {'end': 735.218, 'src': 'embed', 'start': 703.464, 'weight': 4, 'content': [{'end': 707.125, 'text': 'One of them is on chunk server two, and one of them is on chunk server three.', 'start': 703.464, 'duration': 3.661}, {'end': 709.386, 'text': 'Now, this is not sufficient.', 'start': 707.965, 'duration': 1.421}, {'end': 712.967, 'text': 'You actually have to store multiple copies of each of these chunks.', 'start': 709.886, 'duration': 3.081}, {'end': 714.307, 'text': 'And so.', 'start': 714.107, 'duration': 0.2}, {'end': 724.31, 'text': "We replicate these chunks, so here's a copy of C1 on chunk server 2, a copy of C2 in chunk server 3, and so on.", 'start': 716.184, 'duration': 8.126}, {'end': 727.632, 'text': 'So each chunk in this case is replicated twice.', 'start': 724.33, 'duration': 3.302}, {'end': 735.218, 'text': "And if you notice carefully, you'll see that the replicas of a chunk are never on the same chunk server.", 'start': 729.294, 'duration': 5.924}], 'summary': 'Chunks are replicated twice, spread across different servers.', 'duration': 31.754, 'max_score': 703.464, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0703464.jpg'}, {'end': 788.11, 'src': 'embed', 'start': 758.203, 'weight': 5, 'content': [{'end': 762.164, 'text': "And so and so that's stored on different chunk servers as well.", 'start': 758.203, 'duration': 3.961}, {'end': 773.999, 'text': "Now so, so we've served, we've sort of chunked files and stored them on, on these, on these chunk servers.", 'start': 766.673, 'duration': 7.326}, {'end': 778.302, 'text': 'Now it turns out that the chunk servers also act as compute servers.', 'start': 774.859, 'duration': 3.443}, {'end': 788.11, 'text': 'And when, whenever your computation has to access data, that computation is actually scheduled on the chunk server that actually contains the data.', 'start': 779.423, 'duration': 8.687}], 'summary': 'Data is chunked and stored on chunk servers, which also act as compute servers for computations accessing the data.', 'duration': 29.907, 'max_score': 758.203, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0758203.jpg'}, {'end': 851.444, 'src': 'embed', 'start': 819.259, 'weight': 6, 'content': [{'end': 820.72, 'text': 'And each chunk is replicated.', 'start': 819.259, 'duration': 1.461}, {'end': 824.201, 'text': 'In, in our example, we saw each chunk replicated twice.', 'start': 820.74, 'duration': 3.461}, {'end': 826.722, 'text': 'But it could be 2x or 3x replication.', 'start': 824.701, 'duration': 2.021}, {'end': 827.883, 'text': '3x is the most common.', 'start': 827.022, 'duration': 0.861}, {'end': 833.748, 'text': 'And we saw that the chunks were actually kept on different chunk servers.', 'start': 829.464, 'duration': 4.284}, {'end': 843.397, 'text': 'But when you replicate 3x you know, the, the system usually makes an effort to keep at least one replica in an entirely different rack if possible.', 'start': 835.089, 'duration': 8.308}, {'end': 851.444, 'text': "And why do we do that? We do that because It's you know, the, the, the most common scenario is that a single node can fail.", 'start': 844.097, 'duration': 7.347}], 'summary': 'Data chunks replicated 3x, with effort to keep at least one replica in a different rack to prevent single node failure.', 'duration': 32.185, 'max_score': 819.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0819259.jpg'}, {'end': 947.103, 'src': 'heatmap', 'start': 905.293, 'weight': 7, 'content': [{'end': 909.534, 'text': 'And here is, here are the locations of each of the six chunks, and here are the locations of the replicas.', 'start': 905.293, 'duration': 4.241}, {'end': 915.516, 'text': 'And the master node itself may be replicated because otherwise it might become a single point of failure.', 'start': 910.255, 'duration': 5.261}, {'end': 919.977, 'text': 'The final component of the distributed file system is a client library.', 'start': 917.136, 'duration': 2.841}, {'end': 927.978, 'text': 'Now when the when a client or, or an algorithm that needs to access the data tries to access a file it goes through the client library.', 'start': 920.497, 'duration': 7.481}, {'end': 933.359, 'text': 'The client library talks to the master and finds the chunk servers that actually store the chunks.', 'start': 928.378, 'duration': 4.981}, {'end': 942.661, 'text': "And once that's done the the client is directly connected to the chunk servers and it can access the data without going through the master nodes.", 'start': 934.379, 'duration': 8.282}, {'end': 947.103, 'text': 'So the data access actually happens in a peer to peer fashion without going through the master node.', 'start': 942.962, 'duration': 4.141}], 'summary': 'Distributed file system has 6 chunks, replicas, and peer-to-peer data access.', 'duration': 41.81, 'max_score': 905.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0905293.jpg'}, {'end': 947.103, 'src': 'embed', 'start': 942.962, 'weight': 8, 'content': [{'end': 947.103, 'text': 'So the data access actually happens in a peer to peer fashion without going through the master node.', 'start': 942.962, 'duration': 4.141}], 'summary': 'Data access occurs peer to peer, bypassing master node.', 'duration': 4.141, 'max_score': 942.962, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0942962.jpg'}], 'start': 482.388, 'title': 'Distributed file systems', 'summary': 'Discusses the architecture and components of distributed file systems, focusing on data storage, replication, computation, and strategies such as rack awareness, with examples of mapreduce addressing distributed computing challenges and distributed file system architecture.', 'chapters': [{'end': 650.513, 'start': 482.388, 'title': 'Mapreduce and distributed computing', 'summary': "Explains how mapreduce addresses the challenges of distributed computing by providing redundant storage, minimizing network bottleneck, and offering a simple programming model, with examples of google's gfs and hadoop's hdfs.", 'duration': 168.125, 'highlights': ['MapReduce model addresses persistence and availability by storing data redundantly on multiple nodes, ensuring data availability in case of node failure. MapReduce model ensures data availability by storing the same data on multiple nodes, mitigating the impact of node failures.', 'MapReduce minimizes network bottlenecks by moving computation close to the data and avoiding extensive data movement across the network. MapReduce model reduces network bottlenecks by optimizing data movement and computation proximity.', "The chapter discusses the usage patterns and optimization of distributed file systems like Google's GFS and Hadoop's HDFS for handling huge files and infrequent data updates through appends. The chapter highlights the usage patterns and optimizations of distributed file systems for handling large files and infrequent data updates."]}, {'end': 818.358, 'start': 650.954, 'title': 'Distributed file system overview', 'summary': 'Explains the architecture of a distributed file system, where data is stored in chunks spread across multiple machines, with each chunk being replicated twice and computation being scheduled on the chunk server containing the data.', 'duration': 167.404, 'highlights': ['Data is kept in chunks that are spread across machines. The data in the distributed file system is divided into chunks and distributed across multiple machines.', 'Each chunk is replicated twice, ensuring fault tolerance and high availability. To ensure fault tolerance and high availability, each chunk is replicated twice across different chunk servers.', 'Computation is scheduled on the chunk server containing the data to avoid unnecessary data movement. The system schedules computation on the chunk server containing the required data, eliminating the need for moving data to where the computation runs.']}, {'end': 947.103, 'start': 819.259, 'title': 'Distributed file system overview', 'summary': 'Explains the components of a distributed file system, including chunk replication strategies, rack awareness, the role of the master node, and the peer-to-peer data access mechanism.', 'duration': 127.844, 'highlights': ['The system usually makes an effort to keep at least one replica in an entirely different rack if possible, to prevent entire rack inaccessibility. 3x replication is the most common, with an effort to keep at least one replica in an entirely different rack to prevent entire rack inaccessibility.', 'The master node stores metadata about where the files are stored and might know the locations of the replicas. The master node stores metadata about file locations, chunk division, and replica locations to prevent being a single point of failure.', 'The client library enables direct peer-to-peer data access without going through the master nodes. Client library facilitates direct peer-to-peer data access without the need to go through the master nodes.']}], 'duration': 464.715, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xoA5v9AO7S0/pics/xoA5v9AO7S0482388.jpg', 'highlights': ['MapReduce model ensures data availability by storing the same data on multiple nodes, mitigating the impact of node failures.', 'MapReduce model reduces network bottlenecks by optimizing data movement and computation proximity.', 'The chapter highlights the usage patterns and optimizations of distributed file systems for handling large files and infrequent data updates.', 'The data in the distributed file system is divided into chunks and distributed across multiple machines.', 'To ensure fault tolerance and high availability, each chunk is replicated twice across different chunk servers.', 'The system schedules computation on the chunk server containing the required data, eliminating the need for moving data to where the computation runs.', '3x replication is the most common, with an effort to keep at least one replica in an entirely different rack to prevent entire rack inaccessibility.', 'The master node stores metadata about file locations, chunk division, and replica locations to prevent being a single point of failure.', 'Client library facilitates direct peer-to-peer data access without the need to go through the master nodes.']}], 'highlights': ['By striping the data across a thousand disks and CPUs, the processing time can be reduced to around an hour, significantly improving efficiency compared to the initial 46 days timeframe.', 'The limitations of the traditional computational model and the need for MapReduce in handling massive datasets.', "The lecture covers the need for mapreduce in handling massive datasets, citing google's example of 10 billion web pages of 200 terabytes, and the data bandwidth limitation, reducing processing time from 46 days to around an hour.", "The scale of modern data centers is illustrated by Google's estimated million servers in its cluster, resulting in approximately 1,000 failures per day.", 'The impact of node failures on data storage and availability is discussed, highlighting the need for infrastructure to manage persistent storage and to ensure computation completion despite node failures.', 'MapReduce model ensures data availability by storing the same data on multiple nodes, mitigating the impact of node failures.', 'MapReduce model reduces network bottlenecks by optimizing data movement and computation proximity.', 'The chapter highlights the usage patterns and optimizations of distributed file systems for handling large files and infrequent data updates.', 'The data in the distributed file system is divided into chunks and distributed across multiple machines.', 'To ensure fault tolerance and high availability, each chunk is replicated twice across different chunk servers.', '3x replication is the most common, with an effort to keep at least one replica in an entirely different rack to prevent entire rack inaccessibility.']}