title
Introducing Apache Hadoop: The Modern Data Operating System
description
(November 16, 2011) Amr Awadallah introduces Apache Hadoop and asserts that it is the data operating system of the future. He explains many of the data problems faced by modern data systems while highlighting the benefits and features of Hadoop.
Stanford University:
http://www.stanford.edu/
Stanford School of Engineering:
http://engineering.stanford.edu/
Stanford Electrical Engineering Department:
http://ee.stanford.edu
Stanford EE380 Computer Systems Colloquium
http://www.stanford.edu/class/ee380/
Stanford University Channel on YouTube:
http://www.youtube.com/stanford
detail
{'title': 'Introducing Apache Hadoop: The Modern Data Operating System', 'heatmap': [{'end': 691.932, 'start': 504.933, 'weight': 0.987}, {'end': 1703.879, 'start': 1564.462, 'weight': 0.898}, {'end': 1938.84, 'start': 1795.163, 'weight': 0.849}, {'end': 2676.782, 'start': 2576.487, 'weight': 0.733}], 'summary': "Introduces apache hadoop, covering its impact on big data processing, challenges of traditional data analytics, data processing models, scalability, hdfs overview, network speeds, power efficiency, system enhancements, cloudera's cdh, setting up a hadoop cluster with mapreduce, open source benefits, and handling unstructured data.", 'chapters': [{'end': 479.34, 'segs': [{'end': 103.572, 'src': 'embed', 'start': 79.258, 'weight': 0, 'content': [{'end': 85.343, 'text': 'As with Linux, the other great open source success, various companies have sprung up to push this along.', 'start': 79.258, 'duration': 6.085}, {'end': 93.548, 'text': 'Amr Awadallah of Cloudera is here to tell us about Hadoop, the various tools in this infrastructure,', 'start': 86.043, 'duration': 7.505}, {'end': 100.958, 'text': 'how it all works together and how the ecosystem is going to continue to make it possible for the rest of us to do big data.', 'start': 93.548, 'duration': 7.41}, {'end': 103.572, 'text': 'Thank you for a nice introduction.', 'start': 102.452, 'duration': 1.12}], 'summary': "Hadoop's ecosystem enables big data processing, supported by companies like cloudera.", 'duration': 24.314, 'max_score': 79.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI79258.jpg'}, {'end': 322.205, 'src': 'embed', 'start': 296.452, 'weight': 1, 'content': [{'end': 302.135, 'text': 'So the three problems that we got hit with are these three red stars that you have there on the chart.', 'start': 296.452, 'duration': 5.683}, {'end': 304.076, 'text': 'This one was a forcing function.', 'start': 302.755, 'duration': 1.321}, {'end': 305.897, 'text': 'This one here is that there is no escape from it.', 'start': 304.096, 'duration': 1.801}, {'end': 313.141, 'text': "is, the amount of data we have accumulated every day was reaching a point where we couldn't finish economically,", 'start': 306.537, 'duration': 6.604}, {'end': 316.742, 'text': 'finish processing the data from the previous day before the new day started.', 'start': 313.141, 'duration': 3.601}, {'end': 322.205, 'text': "Once you hit that point, there's nothing you can do, your host, unless you can find a new solution.", 'start': 317.603, 'duration': 4.602}], 'summary': 'Three key problems: data accumulation, processing speed, and finding a new solution.', 'duration': 25.753, 'max_score': 296.452, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI296452.jpg'}, {'end': 401.481, 'src': 'embed', 'start': 371.668, 'weight': 3, 'content': [{'end': 379.453, 'text': 'But once it goes to tape, from my experience at Yahoo and others that worked at large corporations with backup policies, you never see the data again.', 'start': 371.668, 'duration': 7.785}, {'end': 380.674, 'text': "It's as good as dead.", 'start': 379.733, 'duration': 0.941}, {'end': 387.496, 'text': 'The cost of archiving solutions is very economical for storage, extremely expensive for retrieval.', 'start': 381.674, 'duration': 5.822}, {'end': 390.457, 'text': 'To get something back from archival storage is extremely expensive.', 'start': 387.736, 'duration': 2.721}, {'end': 393.018, 'text': 'So we wanted to keep the data alive for longer.', 'start': 390.497, 'duration': 2.521}, {'end': 401.481, 'text': 'And that meant that we had to find a solution that the economics of storing the bytes justified doing that.', 'start': 393.418, 'duration': 8.063}], 'summary': 'Archiving data is economical for storage but extremely expensive for retrieval, prompting the need for a cost-effective solution to keep data alive longer.', 'duration': 29.813, 'max_score': 371.668, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI371668.jpg'}, {'end': 448.619, 'src': 'embed', 'start': 414.888, 'weight': 5, 'content': [{'end': 418.111, 'text': 'So essentially we had to bring the cost down to a point where we can justify doing that.', 'start': 414.888, 'duration': 3.223}, {'end': 425.07, 'text': 'Last but not least, We wanted the ability to go back and explore the original highest fidelity data.', 'start': 418.991, 'duration': 6.079}, {'end': 428.351, 'text': "Because when you go through this ETL process, you're actually aggregating your data.", 'start': 425.15, 'duration': 3.201}, {'end': 429.392, 'text': 'You are losing fidelity.', 'start': 428.411, 'duration': 0.981}, {'end': 437.434, 'text': "You're either aggregating or you're normalizing to conform dimensions, for those of you with business intelligence experience.", 'start': 429.412, 'duration': 8.022}, {'end': 439.975, 'text': "That process means you're going to lose data.", 'start': 438.294, 'duration': 1.681}, {'end': 448.619, 'text': "And every now and then you'll get a question from the business or a question that you, as somebody trying to help the business, you want to ask,", 'start': 440.595, 'duration': 8.024}], 'summary': 'Reduced cost, maintained data fidelity, and preserved original data for analysis.', 'duration': 33.731, 'max_score': 414.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI414888.jpg'}], 'start': 5.051, 'title': 'Hadoop, big data, and analytics challenges', 'summary': 'Discusses the rise of hadoop technology, its impact on big data processing, and the challenges of traditional data analytics and business intelligence stack, including speed issues, archiving costs, and data fidelity loss.', 'chapters': [{'end': 214.941, 'start': 5.051, 'title': 'Hadoop and big data revolution', 'summary': 'Discusses the rise of hadoop, a technology causing disruption in big data processing space, and its foundational role in solving business problems, as well as the passion and belief of the speaker in its potential impact.', 'duration': 209.89, 'highlights': ['Hadoop technology causing disruption in big data processing space', 'Foundational role of Hadoop in solving business problems', 'Passion and belief of the speaker in the potential impact of Hadoop']}, {'end': 479.34, 'start': 215.722, 'title': 'Data analytics stack challenges', 'summary': 'Highlights the challenges of traditional data analytics and business intelligence stack, including the issues related to data processing speed, archiving costs, and data fidelity loss, which led to the need for a new solution.', 'duration': 263.618, 'highlights': ["The massive data accumulation was reaching a point where it couldn't be economically processed before the new day started.", 'Archiving solutions were economically viable for storage but extremely expensive for retrieval, necessitating a cost-effective way to keep the data alive for longer.', 'The process of ETL led to data fidelity loss, making it challenging to explore the original highest fidelity data and limiting the ability to ask new questions not supported by the existing schema.']}], 'duration': 474.289, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI5051.jpg', 'highlights': ['Hadoop technology causing disruption in big data processing space', "The massive data accumulation was reaching a point where it couldn't be economically processed before the new day started", 'Foundational role of Hadoop in solving business problems', 'Archiving solutions were economically viable for storage but extremely expensive for retrieval, necessitating a cost-effective way to keep the data alive for longer', 'Passion and belief of the speaker in the potential impact of Hadoop', 'The process of ETL led to data fidelity loss, making it challenging to explore the original highest fidelity data and limiting the ability to ask new questions not supported by the existing schema']}, {'end': 1011.232, 'segs': [{'end': 691.932, 'src': 'heatmap', 'start': 479.34, 'weight': 2, 'content': [{'end': 485.367, 'text': 'amend our ETL logic to process that and get the data out of our storage was an extremely expensive proposition.', 'start': 479.34, 'duration': 6.027}, {'end': 487.169, 'text': 'It took weeks, if not months, to do.', 'start': 485.407, 'duration': 1.762}, {'end': 490.23, 'text': 'And so, hence, it was very hard to justify doing.', 'start': 487.829, 'duration': 2.401}, {'end': 492.53, 'text': 'So these are the three driving problems.', 'start': 490.49, 'duration': 2.04}, {'end': 496.831, 'text': "And I'll go over them from the opposite side, like benefits.", 'start': 492.87, 'duration': 3.961}, {'end': 500.652, 'text': "What are the benefits of overcoming these problems? Though I hope it's very clear.", 'start': 497.291, 'duration': 3.361}, {'end': 503.232, 'text': 'Scalability of compute over the data.', 'start': 501.112, 'duration': 2.12}, {'end': 507.853, 'text': 'Scalability and economics of keeping the data alive for much longer.', 'start': 504.933, 'duration': 2.92}, {'end': 513.955, 'text': 'Flexibility and agility of going back and asking questions from the raw unstructured data.', 'start': 508.674, 'duration': 5.281}, {'end': 515.054, 'text': 'These are the forcing functions.', 'start': 514.015, 'duration': 1.039}, {'end': 517.102, 'text': 'So here comes Hadoop.', 'start': 516.383, 'duration': 0.719}, {'end': 520.764, 'text': "So what is Hadoop? Hadoop is an elephant, for those who don't know.", 'start': 517.722, 'duration': 3.042}, {'end': 529.606, 'text': 'The creator of Hadoop, Doug Cutting, his son, three-year-old son at the time, had a nice plush elephant toy, a yellow elephant, small elephant.', 'start': 521.604, 'duration': 8.002}, {'end': 531.226, 'text': 'Oops, sorry.', 'start': 529.626, 'duration': 1.6}, {'end': 540.688, 'text': "And that elephant, his son, one day out of the blue, called it Hadoop, right? It's just he made up that name.", 'start': 531.986, 'duration': 8.702}, {'end': 545.15, 'text': 'And Doug Cutting decided to pick that name to be the name of this framework.', 'start': 540.988, 'duration': 4.162}, {'end': 550.371, 'text': "Now the son is actually 11 years old, and he's very proud of himself because of this accomplishment.", 'start': 546.15, 'duration': 4.221}, {'end': 558.454, 'text': "So Hadoop, in a nutshell, it's a scalable, fault-tolerant distributed system for data storage and processing.", 'start': 552.572, 'duration': 5.882}, {'end': 563.016, 'text': "And it's licensed under the Apache license, which I can talk about briefly later in the presentation.", 'start': 558.614, 'duration': 4.402}, {'end': 566.517, 'text': "It's one of the friendliest licenses out there for open source consumption.", 'start': 563.036, 'duration': 3.481}, {'end': 570.618, 'text': 'I like to describe Hadoop in two ways.', 'start': 568.877, 'duration': 1.741}, {'end': 577.06, 'text': "One way is, what's an opening system? When I ask you about Linux or Windows, what's the heart of these systems? It's two things at the heart.", 'start': 570.858, 'duration': 6.202}, {'end': 581.021, 'text': "It's the ability to store files and the ability to run applications on top of files.", 'start': 577.76, 'duration': 3.261}, {'end': 581.781, 'text': "That's the core.", 'start': 581.241, 'duration': 0.54}, {'end': 588.803, 'text': 'And then there is Windows environment, GUIs, device drivers, security credentials, access lists, and all these things, libraries and so on.', 'start': 582.261, 'duration': 6.542}, {'end': 593.864, 'text': "There's just things around that core of storing files and running stuff on top of these files.", 'start': 589.203, 'duration': 4.661}, {'end': 598.527, 'text': "And that's what Hadoop is, with the difference that Hadoop does that on many, many, many machines, right?", 'start': 594.424, 'duration': 4.103}, {'end': 602.309, 'text': "It's a data center kind of opening system as opposed to a single node opening system.", 'start': 598.567, 'duration': 3.742}, {'end': 607.493, 'text': "In fact, it's an abstraction above, like Hadoop leverages Windows and leverages Linux to give that layer.", 'start': 602.91, 'duration': 4.583}, {'end': 610.715, 'text': 'The other way I like to describe Hadoop is like the opposite of a virtual machine.', 'start': 608.013, 'duration': 2.702}, {'end': 614.998, 'text': 'So, if you think of VMware essentially or any other virtualization technology,', 'start': 610.755, 'duration': 4.243}, {'end': 619.521, 'text': "it's about taking a physical server and chopping that up to be many small virtual servers.", 'start': 614.998, 'duration': 4.523}, {'end': 621.602, 'text': 'right, and hadoop is the other way around.', 'start': 619.921, 'duration': 1.681}, {'end': 628.227, 'text': "it's about taking many, many physical servers and then merge them all together to look like one big massive virtual server.", 'start': 621.602, 'duration': 6.625}, {'end': 633.53, 'text': "so that's another way to think about the loop and the two underlying systems, which were inspired by google,", 'start': 628.227, 'duration': 5.303}, {'end': 635.011, 'text': 'and thanks google for publishing the papers.', 'start': 633.53, 'duration': 1.481}, {'end': 639.334, 'text': "they didn't publish the source code for their internal systems, but they did publish the papers that that had.", 'start': 635.011, 'duration': 4.323}, {'end': 645.878, 'text': 'all the concepts behind these systems are the hadoop distributed file system, which is about creating this scalable, distributed,', 'start': 639.334, 'duration': 6.544}, {'end': 652.582, 'text': 'self-healing storage layer, and map reduce, which has a dual thing in it.', 'start': 645.878, 'duration': 6.704}, {'end': 656.964, 'text': 'MapReduce is both a scheduling system for scheduling resources on top of a big cluster of machines,', 'start': 652.642, 'duration': 4.322}, {'end': 660.466, 'text': "but it's also a programming model that makes it easy for developers to think in a parallel way.", 'start': 656.964, 'duration': 3.502}, {'end': 666.168, 'text': "And it's important to differentiate between that dual nature of MapReduce, which I'll highlight later on in the presentation.", 'start': 661.106, 'duration': 5.062}, {'end': 666.909, 'text': "So that's it.", 'start': 666.469, 'duration': 0.44}, {'end': 668.43, 'text': "That's what Hadoop is in a nutshell.", 'start': 666.949, 'duration': 1.481}, {'end': 677.494, 'text': "It's an operating system purposely built for data processing that can be installed on many, many nodes and make them look like one big mainframe.", 'start': 668.89, 'duration': 8.604}, {'end': 685.487, 'text': 'This is the key slide, because this is the key thing that differentiates Hadoop from previous technologies.', 'start': 679.642, 'duration': 5.845}, {'end': 690.371, 'text': "And I would like you to pay attention for this one, because it's a very key concept.", 'start': 685.667, 'duration': 4.704}, {'end': 691.932, 'text': "It's really what makes Hadoop stand out.", 'start': 690.551, 'duration': 1.381}], 'summary': 'Hadoop is a scalable, fault-tolerant system for data storage and processing, with a unique operating system design.', 'duration': 41.424, 'max_score': 479.34, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI479340.jpg'}, {'end': 577.06, 'src': 'embed', 'start': 546.15, 'weight': 0, 'content': [{'end': 550.371, 'text': "Now the son is actually 11 years old, and he's very proud of himself because of this accomplishment.", 'start': 546.15, 'duration': 4.221}, {'end': 558.454, 'text': "So Hadoop, in a nutshell, it's a scalable, fault-tolerant distributed system for data storage and processing.", 'start': 552.572, 'duration': 5.882}, {'end': 563.016, 'text': "And it's licensed under the Apache license, which I can talk about briefly later in the presentation.", 'start': 558.614, 'duration': 4.402}, {'end': 566.517, 'text': "It's one of the friendliest licenses out there for open source consumption.", 'start': 563.036, 'duration': 3.481}, {'end': 570.618, 'text': 'I like to describe Hadoop in two ways.', 'start': 568.877, 'duration': 1.741}, {'end': 577.06, 'text': "One way is, what's an opening system? When I ask you about Linux or Windows, what's the heart of these systems? It's two things at the heart.", 'start': 570.858, 'duration': 6.202}], 'summary': 'Hadoop is a scalable, fault-tolerant distributed system for data, licensed under apache license.', 'duration': 30.91, 'max_score': 546.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI546150.jpg'}, {'end': 787.511, 'src': 'embed', 'start': 759.874, 'weight': 1, 'content': [{'end': 767.639, 'text': 'Now, the problem is because of this explicit load operation and because of the explicit schema being present,', 'start': 759.874, 'duration': 7.765}, {'end': 775.324, 'text': 'new data cannot flow in until you prepare for it, until you created the column for it and you created the ETL logic for it.', 'start': 767.639, 'duration': 7.685}, {'end': 776.285, 'text': 'And that is the problem.', 'start': 775.524, 'duration': 0.761}, {'end': 777.326, 'text': 'So it limits your agility.', 'start': 776.325, 'duration': 1.001}, {'end': 782.109, 'text': 'It limits your ability to be flexible and to grow at the speed that your data is evolving.', 'start': 777.806, 'duration': 4.303}, {'end': 787.511, 'text': 'At Yahoo, and I would like to say I had a very agile team that knew how to do things very quickly.', 'start': 783.149, 'duration': 4.362}], 'summary': 'Explicit load operation and schema limit agility and flexibility at yahoo.', 'duration': 27.637, 'max_score': 759.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI759874.jpg'}, {'end': 867.072, 'src': 'embed', 'start': 839.936, 'weight': 4, 'content': [{'end': 843.099, 'text': "And that's the point is these systems are not built to replace each other.", 'start': 839.936, 'duration': 3.163}, {'end': 844.139, 'text': 'They augment each other.', 'start': 843.119, 'duration': 1.02}, {'end': 846.061, 'text': "It's a very important point which I'll further.", 'start': 844.219, 'duration': 1.842}, {'end': 846.621, 'text': 'You want both.', 'start': 846.141, 'duration': 0.48}, {'end': 847.722, 'text': 'You want both environments.', 'start': 846.741, 'duration': 0.981}, {'end': 850.664, 'text': "The problem is we only had this and we didn't have that and now we have that.", 'start': 848.162, 'duration': 2.502}, {'end': 852.785, 'text': 'So schema and read is about late binding.', 'start': 851.184, 'duration': 1.601}, {'end': 856.627, 'text': "It's like, let's not bind our schema until the latest stage in the process.", 'start': 853.045, 'duration': 3.582}, {'end': 859.088, 'text': 'So with Hadoop, you load your data as it is.', 'start': 857.027, 'duration': 2.061}, {'end': 859.588, 'text': "You don't load it.", 'start': 859.108, 'duration': 0.48}, {'end': 860.308, 'text': 'You just copy the data.', 'start': 859.608, 'duration': 0.7}, {'end': 863.01, 'text': 'You drag and drop the files inside your Hadoop cluster.', 'start': 860.369, 'duration': 2.641}, {'end': 867.072, 'text': "And then at read time, when you're actually querying your data, that's when you apply your own lens.", 'start': 863.49, 'duration': 3.582}], 'summary': 'Systems should not replace but augment each other. hadoop allows late schema binding and data loading.', 'duration': 27.136, 'max_score': 839.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI839936.jpg'}], 'start': 479.34, 'title': 'Hadoop and data processing models', 'summary': 'Discusses hadoop as a scalable data processing system, highlighting its benefits and unique capabilities. it also explores the differences between schema on read and schema on write models, emphasizing their impact on data management, loading times, and governance.', 'chapters': [{'end': 691.932, 'start': 479.34, 'title': 'Understanding hadoop: a scalable data processing system', 'summary': 'Discusses the challenges of data processing, the benefits of overcoming these challenges, and introduces hadoop as a scalable, fault-tolerant distributed system for data storage and processing, licensed under the apache license, with a unique ability to make many nodes look like one big mainframe.', 'duration': 212.592, 'highlights': ['Hadoop is a scalable, fault-tolerant distributed system for data storage and processing.', 'Challenges of data processing include scalability, economics of data retention, and flexibility of querying unstructured data.', 'Benefits of overcoming data processing problems include scalability, economics, and agility of querying unstructured data.']}, {'end': 1011.232, 'start': 692.052, 'title': 'Schema on read vs. schema on write', 'summary': 'Explains the benefits and limitations of schema on write and schema on read models, with a focus on agility and flexibility in data management, highlighting the impact on data loading times and organizational governance.', 'duration': 319.18, 'highlights': ['Schema on read provides agility and flexibility by allowing the loading of data as is, offering the ability to interpret and parse unstructured data at query time, enabling new data to flow in with the option to augment the parsing lens later, without the need for reloading.', 'Schema on write, while providing optimizations such as indexes, compression, special data structures, and partitioning, imposes limitations on agility and flexibility due to the explicit load operation and the need for creating and preparing the schema before new data can flow in, resulting in slower adaptation to evolving data needs.', 'The chapter emphasizes the complementary nature of schema on write and schema on read, advocating for their simultaneous usage to leverage their respective strengths and mitigate their limitations, ultimately promoting innovation in data management.']}], 'duration': 531.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI479340.jpg', 'highlights': ['Hadoop is a scalable, fault-tolerant distributed system for data storage and processing.', 'Schema on read provides agility and flexibility by allowing the loading of data as is.', 'Challenges of data processing include scalability, economics of data retention, and flexibility of querying unstructured data.', 'Benefits of overcoming data processing problems include scalability, economics, and agility of querying unstructured data.', 'The chapter emphasizes the complementary nature of schema on write and schema on read.']}, {'end': 1534.781, 'segs': [{'end': 1111.052, 'src': 'embed', 'start': 1079.032, 'weight': 0, 'content': [{'end': 1080.373, 'text': 'if you go in with Java MapReduce.', 'start': 1079.032, 'duration': 1.341}, {'end': 1087.276, 'text': "So you only want to go to Java MapReduce when you know what you're doing and you really care about special things that you can get from the higher abstraction frameworks.", 'start': 1080.693, 'duration': 6.583}, {'end': 1093.399, 'text': 'Streaming MapReduce is an extension that still depends on the MapReduce model.', 'start': 1088.177, 'duration': 5.222}, {'end': 1094.8, 'text': 'So you need to know how to think in that.', 'start': 1093.419, 'duration': 1.381}, {'end': 1096.661, 'text': "And I'll cover the MapReduce model briefly later on.", 'start': 1094.84, 'duration': 1.821}, {'end': 1101.885, 'text': "However, it expands the universe of languages that you don't have to use Java.", 'start': 1098.562, 'duration': 3.323}, {'end': 1111.052, 'text': "You can use Java, Python, Perl, C++, Ruby, whatever language you're comfortable with, right? So that makes it more appealing for folks.", 'start': 1101.905, 'duration': 9.147}], 'summary': 'Streaming mapreduce extends to multiple languages, making it more accessible for users.', 'duration': 32.02, 'max_score': 1079.032, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1079032.jpg'}, {'end': 1241.403, 'src': 'embed', 'start': 1217.185, 'weight': 1, 'content': [{'end': 1223.51, 'text': "This is very, very, very important and it's fundamental to the MapReduce model itself, but also the way the system Hadoop system,", 'start': 1217.185, 'duration': 6.325}, {'end': 1227.553, 'text': 'the MapReduce and GFS systems have been built, which materialized in Hadoop as well.', 'start': 1223.51, 'duration': 4.043}, {'end': 1228.994, 'text': 'Two concepts here.', 'start': 1228.293, 'duration': 0.701}, {'end': 1230.895, 'text': 'First, the first one is the system itself is scalable.', 'start': 1229.034, 'duration': 1.861}, {'end': 1231.896, 'text': 'You can start with a few nodes.', 'start': 1230.915, 'duration': 0.981}, {'end': 1235.099, 'text': 'When you add more nodes, the system just automatically grows right?', 'start': 1232.257, 'duration': 2.842}, {'end': 1241.403, 'text': 'It repartitions the data, redistributes the data and the jobs coming in will now start taking advantage of the new nodes, et cetera, et cetera.', 'start': 1235.159, 'duration': 6.244}], 'summary': 'Hadoop system is scalable, repartitions data, redistributes jobs, and grows automatically.', 'duration': 24.218, 'max_score': 1217.185, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1217185.jpg'}, {'end': 1320.048, 'src': 'embed', 'start': 1289.57, 'weight': 2, 'content': [{'end': 1291.211, 'text': 'You can take the hardest algorithms in the world.', 'start': 1289.57, 'duration': 1.641}, {'end': 1296.494, 'text': "But if you take the simplest algorithms with lots of data, that's the premise, you will beat the hardest algorithms.", 'start': 1292.031, 'duration': 4.463}, {'end': 1299.535, 'text': 'And one of the examples that were used were natural language translation.', 'start': 1296.634, 'duration': 2.901}, {'end': 1303.718, 'text': 'So natural language translation between different human languages existed for many, many years.', 'start': 1299.916, 'duration': 3.802}, {'end': 1306.72, 'text': "It's a very hard problem.", 'start': 1304.798, 'duration': 1.922}, {'end': 1310.522, 'text': 'Lots of very sophisticated, very complex algorithms to try and attack that problem.', 'start': 1307.22, 'duration': 3.302}, {'end': 1320.048, 'text': 'And Google came in and by just looking at n-grams and correlations of them between different corpses of text they have across all of the web index they have,', 'start': 1311.002, 'duration': 9.046}], 'summary': 'Simple algorithms with lots of data can outperform complex algorithms. google used n-grams for language translation.', 'duration': 30.478, 'max_score': 1289.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1289570.jpg'}, {'end': 1373.277, 'src': 'embed', 'start': 1344.159, 'weight': 3, 'content': [{'end': 1345.5, 'text': "And I don't want to see data dying anymore.", 'start': 1344.159, 'duration': 1.341}, {'end': 1348.122, 'text': "So that's one of the things that Hadoop allows you to do.", 'start': 1345.58, 'duration': 2.542}, {'end': 1351.745, 'text': 'It brings down the economics of storage in order to do orders of magnitude.', 'start': 1348.362, 'duration': 3.383}, {'end': 1362.214, 'text': 'So now, before I jump into the internals of Hadoop, I wanted to very quickly show you which cases Hadoop is the right tool for,', 'start': 1352.631, 'duration': 9.583}, {'end': 1364.274, 'text': 'versus which cases the relational database is the right tool for.', 'start': 1362.214, 'duration': 2.06}, {'end': 1370.256, 'text': 'And the energy I use here is the energy of a sports car, not necessarily because sports cars are expensive,', 'start': 1365.174, 'duration': 5.082}, {'end': 1373.277, 'text': 'though that is part of it versus a free train.', 'start': 1370.256, 'duration': 3.021}], 'summary': 'Hadoop reduces storage costs, making it ideal for big data, while relational databases are suitable for specific cases.', 'duration': 29.118, 'max_score': 1344.159, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1344159.jpg'}, {'end': 1463.75, 'src': 'embed', 'start': 1431.244, 'weight': 4, 'content': [{'end': 1432.765, 'text': 'So latency, these guys win.', 'start': 1431.244, 'duration': 1.521}, {'end': 1434.826, 'text': 'Throughput, Hadoop wins.', 'start': 1433.065, 'duration': 1.761}, {'end': 1441.914, 'text': 'Second. so Hadoop can do interactive OLAP, but not at the same speed that these systems can because of all the optimizations they are able to do,', 'start': 1436.69, 'duration': 5.224}, {'end': 1448.359, 'text': 'because they paid the cost of parsing the data at load time, as opposed to Hadoop, which is paying it at read time.', 'start': 1441.914, 'duration': 6.445}, {'end': 1450.841, 'text': 'So obviously, for the big queries, the cost is amortized.', 'start': 1448.499, 'duration': 2.342}, {'end': 1453.443, 'text': 'And hence, it starts to win for the bigger queries.', 'start': 1451.242, 'duration': 2.201}, {'end': 1455.765, 'text': 'Multi-step asset transactions.', 'start': 1454.404, 'duration': 1.361}, {'end': 1457.066, 'text': "Hadoop can't even touch that.", 'start': 1455.805, 'duration': 1.261}, {'end': 1463.75, 'text': "So if you're doing banking transactions, moving money between accounts, and you're going to say begin statement and you're doing select a cursor,", 'start': 1457.186, 'duration': 6.564}], 'summary': 'Latency: these systems win. throughput: hadoop wins. hadoop for big queries, but not for multi-step transactions.', 'duration': 32.506, 'max_score': 1431.244, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1431244.jpg'}], 'start': 1011.572, 'title': 'Hadoop ecosystem and data scalability', 'summary': 'Covers accessing original data within the hadoop ecosystem using various languages and tools. it also discusses the scalability of hadoop system and the trade-off between speed and flexibility in data processing.', 'chapters': [{'end': 1214.797, 'start': 1011.572, 'title': 'Accessing hadoop ecosystem', 'summary': 'Outlines the foundational movement of accessing original, raw, high fidelity data within the hadoop ecosystem, highlighting the flexibility of using various languages and the spectrum of tools, including java mapreduce, streaming mapreduce, pig latin, hive, and uzi, to run algorithms and create data pipelines.', 'duration': 203.225, 'highlights': ['Java MapReduce provides the most performance and flexibility within Hadoop ecosystem, but requires expertise and may involve longer development time.', 'The flexibility of using different languages such as Java, Python, Perl, C++, Ruby expands the appeal of Hadoop ecosystem.', 'Pig Latin provides a higher abstraction language for building data pipelines, abstracting complex concepts and converting scripts into MapReduce.', 'Hive, a system from Facebook, enables the use of SQL and converts it into MapReduce, with the ability to plug in custom MapReduce functions and defined aggregates.', 'Uzi provides a high-level abstraction for creating workflow of jobs, linking various tools like Hive, Pig, MapReduce, and has the capability of retrying failed jobs.']}, {'end': 1364.274, 'start': 1217.185, 'title': 'Scalability and unreasonable effectiveness of data', 'summary': "Discusses the scalability of the hadoop system, highlighting its ability to automatically grow by repartitioning and redistributing data, as well as the concept of 'data beats algorithm' illustrated by google's approach to natural language translation.", 'duration': 147.089, 'highlights': ['The Hadoop system is scalable, allowing for automatic growth by repartitioning and redistributing data as nodes are added, facilitating the parallelization of jobs and eliminating the need for program redesign (quantifiable: automatic growth, parallelization of jobs, no program redesign).', "Google's approach to natural language translation using simple algorithms and massive amounts of data demonstrates the concept of 'data beats algorithm', emphasizing the potential of leveraging large datasets to surpass sophisticated algorithms in various domains such as finance and biotech (quantifiable: use of massive data for translation, surpassing sophisticated algorithms).", 'Hadoop brings down the economics of storage, enabling significant cost reductions and preventing data from becoming economically unsustainable, emphasizing its role in preserving data viability and longevity (quantifiable: cost reduction, data viability preservation).']}, {'end': 1534.781, 'start': 1365.174, 'title': 'Data processing: speed vs flexibility', 'summary': 'Explains the trade-off between speed and flexibility in data processing, where hadoop excels in throughput but lags in latency compared to traditional relational database systems, while also highlighting the importance of choosing the right tool for specific data processing needs.', 'duration': 169.607, 'highlights': ['Traditional relational database systems excel in latency for interactive OLAP and multi-step asset transactions, while Hadoop outperforms in throughput and flexibility for unstructured data processing.', "A sports car's acceleration compared to a free train illustrates the difference in speed between the two systems, with traditional relational database systems being able to achieve millisecond query times for certain operations, while Hadoop may take seconds or minutes, but excels in larger queries that would take hours or days in relational database systems.", "The importance of using the right tool for specific data processing needs is emphasized through the analogy of touching one's ear, where the efficient way represents the right tool and the inefficient way represents the wrong tool for the job."]}], 'duration': 523.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1011572.jpg', 'highlights': ['Java MapReduce provides the most performance and flexibility within Hadoop ecosystem, but requires expertise and may involve longer development time.', 'The Hadoop system is scalable, allowing for automatic growth by repartitioning and redistributing data as nodes are added, facilitating the parallelization of jobs and eliminating the need for program redesign (quantifiable: automatic growth, parallelization of jobs, no program redesign).', "Google's approach to natural language translation using simple algorithms and massive amounts of data demonstrates the concept of 'data beats algorithm', emphasizing the potential of leveraging large datasets to surpass sophisticated algorithms in various domains such as finance and biotech (quantifiable: use of massive data for translation, surpassing sophisticated algorithms).", 'Hadoop brings down the economics of storage, enabling significant cost reductions and preventing data from becoming economically unsustainable, emphasizing its role in preserving data viability and longevity (quantifiable: cost reduction, data viability preservation).', 'Traditional relational database systems excel in latency for interactive OLAP and multi-step asset transactions, while Hadoop outperforms in throughput and flexibility for unstructured data processing.']}, {'end': 2139.337, 'segs': [{'end': 1703.879, 'src': 'heatmap', 'start': 1564.462, 'weight': 0.898, 'content': [{'end': 1573.169, 'text': 'Last but not least, nothing, and I say this with confidence, except for Google, nothing scales to the same level that Hadoop scales today.', 'start': 1564.462, 'duration': 8.707}, {'end': 1576.872, 'text': "Nothing, whether commercial or open source, other than Google's internal infrastructure.", 'start': 1573.349, 'duration': 3.523}, {'end': 1585.72, 'text': 'And that exists in deployments that range from Yahoo having more than 40, 000 servers running in Hadoop clusters, Facebook having a single namespace,', 'start': 1577.933, 'duration': 7.787}, {'end': 1588.262, 'text': 'single file system with 70 petabytes in it.', 'start': 1585.72, 'duration': 2.542}, {'end': 1590.624, 'text': 'Nothing is as big as that, proven to run as big as that.', 'start': 1588.662, 'duration': 1.962}, {'end': 1595.789, 'text': "So with that now, I'm going to start jumping into the internals of Hadoop itself.", 'start': 1592.806, 'duration': 2.983}, {'end': 1598.89, 'text': 'HDFS, the Hadoop Distributed File System.', 'start': 1597.25, 'duration': 1.64}, {'end': 1600.991, 'text': "And again, I'm giving a very quick overview.", 'start': 1599.171, 'duration': 1.82}, {'end': 1602.611, 'text': 'This is a very introductory talk.', 'start': 1601.491, 'duration': 1.12}, {'end': 1607.813, 'text': 'If you guys want to hear more, maybe you should invite the Google guys over and they can tell you at way more depth than I can.', 'start': 1602.792, 'duration': 5.021}, {'end': 1614.655, 'text': 'But essentially, the idea of the distributed file system in very simple terms is a file comes in, you chop up that file into blocks.', 'start': 1608.453, 'duration': 6.202}, {'end': 1616.695, 'text': 'The default is 64 megabytes.', 'start': 1615.355, 'duration': 1.34}, {'end': 1622.637, 'text': 'And then you take these blocks and you spread them out through your infrastructure, your data nodes that store the data.', 'start': 1617.375, 'duration': 5.262}, {'end': 1626.949, 'text': 'And you replicate each block a number of times.', 'start': 1624.807, 'duration': 2.142}, {'end': 1628.93, 'text': 'The default is three across that infrastructure.', 'start': 1626.969, 'duration': 1.961}, {'end': 1634.014, 'text': 'The system is built and optimized for these kind of operations.', 'start': 1630.131, 'duration': 3.883}, {'end': 1636.196, 'text': "It's not a fully POSIX compliance file system.", 'start': 1634.074, 'duration': 2.122}, {'end': 1638.498, 'text': "It's optimized for throughput, not for latency.", 'start': 1636.276, 'duration': 2.222}, {'end': 1645.884, 'text': "It's optimized for you can put a file, you can get the file, and open the file and scan the file, and you can delete the file.", 'start': 1639.139, 'duration': 6.745}, {'end': 1647.806, 'text': "But you can't insert stuff in the middle of the file.", 'start': 1646.144, 'duration': 1.662}, {'end': 1649.307, 'text': 'You cannot update stuff in the middle of the file.', 'start': 1647.826, 'duration': 1.481}, {'end': 1650.628, 'text': 'You cannot do that with Hadoop.', 'start': 1649.327, 'duration': 1.301}, {'end': 1651.629, 'text': 'You have to create another file.', 'start': 1650.648, 'duration': 0.981}, {'end': 1653.632, 'text': 'and do that as you create the other file.', 'start': 1652.25, 'duration': 1.382}, {'end': 1655.134, 'text': 'That said, you can do appends.', 'start': 1654.073, 'duration': 1.061}, {'end': 1656.797, 'text': 'So you can append stuff at the end of files.', 'start': 1655.274, 'duration': 1.523}, {'end': 1662.145, 'text': "So you can see that the problem, and with these assumptions, that's why we're able to achieve the scalability that we achieve.", 'start': 1656.877, 'duration': 5.268}, {'end': 1666.137, 'text': "The block replication, so some people would look at the replication and say, it's very wasteful.", 'start': 1663.055, 'duration': 3.082}, {'end': 1669.859, 'text': 'Why create three blocks? We can use rate-like original coding techniques and be way more efficient.', 'start': 1666.177, 'duration': 3.682}, {'end': 1672.02, 'text': 'True For durability, that is true.', 'start': 1670.64, 'duration': 1.38}, {'end': 1676.643, 'text': 'So for durability, you are able to achieve it at a much lower cost than doing 3x.', 'start': 1672.461, 'duration': 4.182}, {'end': 1679.525, 'text': 'However, for availability and throughput, you are not.', 'start': 1677.103, 'duration': 2.422}, {'end': 1687.069, 'text': 'By having that block repeated three times in three different machines, you can now have three different jobs leveraging the cores inside the machines,', 'start': 1680.305, 'duration': 6.764}, {'end': 1689.691, 'text': 'reading this data at the same time at full throughput from the local disks.', 'start': 1687.069, 'duration': 2.622}, {'end': 1697.195, 'text': 'So performance of running multi-tenants, multi-jobs in the same cluster served a lot by the replication.', 'start': 1690.651, 'duration': 6.544}, {'end': 1701.798, 'text': 'In fact, when you have a hot file, a new file that came in that you know a lot of people are going to want,', 'start': 1697.235, 'duration': 4.563}, {'end': 1703.879, 'text': 'you can set the replication for it to be much higher.', 'start': 1701.798, 'duration': 2.081}], 'summary': 'Hadoop scales to 70 petabytes, yahoo runs 40,000 servers, and facebook has a single file system with 70 petabytes. hdfs is optimized for throughput, not for latency.', 'duration': 139.417, 'max_score': 1564.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1564462.jpg'}, {'end': 1616.695, 'src': 'embed', 'start': 1577.933, 'weight': 0, 'content': [{'end': 1585.72, 'text': 'And that exists in deployments that range from Yahoo having more than 40, 000 servers running in Hadoop clusters, Facebook having a single namespace,', 'start': 1577.933, 'duration': 7.787}, {'end': 1588.262, 'text': 'single file system with 70 petabytes in it.', 'start': 1585.72, 'duration': 2.542}, {'end': 1590.624, 'text': 'Nothing is as big as that, proven to run as big as that.', 'start': 1588.662, 'duration': 1.962}, {'end': 1595.789, 'text': "So with that now, I'm going to start jumping into the internals of Hadoop itself.", 'start': 1592.806, 'duration': 2.983}, {'end': 1598.89, 'text': 'HDFS, the Hadoop Distributed File System.', 'start': 1597.25, 'duration': 1.64}, {'end': 1600.991, 'text': "And again, I'm giving a very quick overview.", 'start': 1599.171, 'duration': 1.82}, {'end': 1602.611, 'text': 'This is a very introductory talk.', 'start': 1601.491, 'duration': 1.12}, {'end': 1607.813, 'text': 'If you guys want to hear more, maybe you should invite the Google guys over and they can tell you at way more depth than I can.', 'start': 1602.792, 'duration': 5.021}, {'end': 1614.655, 'text': 'But essentially, the idea of the distributed file system in very simple terms is a file comes in, you chop up that file into blocks.', 'start': 1608.453, 'duration': 6.202}, {'end': 1616.695, 'text': 'The default is 64 megabytes.', 'start': 1615.355, 'duration': 1.34}], 'summary': 'Yahoo has over 40,000 servers running hadoop, facebook has a single file system with 70 petabytes.', 'duration': 38.762, 'max_score': 1577.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1577933.jpg'}, {'end': 1722.793, 'src': 'embed', 'start': 1687.069, 'weight': 3, 'content': [{'end': 1689.691, 'text': 'reading this data at the same time at full throughput from the local disks.', 'start': 1687.069, 'duration': 2.622}, {'end': 1697.195, 'text': 'So performance of running multi-tenants, multi-jobs in the same cluster served a lot by the replication.', 'start': 1690.651, 'duration': 6.544}, {'end': 1701.798, 'text': 'In fact, when you have a hot file, a new file that came in that you know a lot of people are going to want,', 'start': 1697.235, 'duration': 4.563}, {'end': 1703.879, 'text': 'you can set the replication for it to be much higher.', 'start': 1701.798, 'duration': 2.081}, {'end': 1705.78, 'text': 'You can actually do that selectively on a per file basis.', 'start': 1704.159, 'duration': 1.621}, {'end': 1706.921, 'text': 'So I say, this file is hot.', 'start': 1705.8, 'duration': 1.121}, {'end': 1708.882, 'text': "Let's make 10 replicas of every block in that file.", 'start': 1706.941, 'duration': 1.941}, {'end': 1711.184, 'text': 'So now I can have multiple users accessing it at full speed.', 'start': 1708.902, 'duration': 2.282}, {'end': 1713.564, 'text': 'Last but not least, availability.', 'start': 1712.102, 'duration': 1.462}, {'end': 1722.793, 'text': 'By having these replicas spread out in more than one machine and across racks as well, we protect against single machine failure,', 'start': 1714.164, 'duration': 8.629}], 'summary': 'Using replication for performance and availability in multi-tenant clusters.', 'duration': 35.724, 'max_score': 1687.069, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1687069.jpg'}, {'end': 1938.84, 'src': 'heatmap', 'start': 1795.163, 'weight': 0.849, 'content': [{'end': 1796.986, 'text': 'Any other HDFS questions? Cool.', 'start': 1795.163, 'duration': 1.823}, {'end': 1802.435, 'text': 'Key points, very scalable, very distributed, highly, highly available.', 'start': 1798.168, 'duration': 4.267}, {'end': 1805.139, 'text': "And I'll go over the architecture of the subsystems in a future slide.", 'start': 1802.455, 'duration': 2.684}, {'end': 1809.12, 'text': 'Yes The block placement? That could be the bottleneck.', 'start': 1805.24, 'duration': 3.88}, {'end': 1810.46, 'text': "Yes I'll talk about that.", 'start': 1809.16, 'duration': 1.3}, {'end': 1811.401, 'text': 'Yes Very good point.', 'start': 1810.52, 'duration': 0.881}, {'end': 1812.662, 'text': "I'll talk about that in a little bit.", 'start': 1811.481, 'duration': 1.181}, {'end': 1813.042, 'text': 'Thank you.', 'start': 1812.702, 'duration': 0.34}, {'end': 1814.923, 'text': 'So this is the MapReduce.', 'start': 1813.842, 'duration': 1.081}, {'end': 1817.484, 'text': 'As I said earlier in the talk, MapReduce has a dual nature.', 'start': 1814.983, 'duration': 2.501}, {'end': 1822.827, 'text': 'There is the MapReduce framework for making us, as developers,', 'start': 1817.704, 'duration': 5.123}, {'end': 1832.112, 'text': 'think in a parallel way and having a prescriptive development framework imposed on us so we can write jobs that scale without having to rewrite them.', 'start': 1822.827, 'duration': 9.285}, {'end': 1838.375, 'text': 'But there is also MapReduce, the execution engine, the resource manager that manages lots of resources and executes them.', 'start': 1833.012, 'duration': 5.363}, {'end': 1839.735, 'text': "I'm going to cover both.", 'start': 1838.755, 'duration': 0.98}, {'end': 1840.776, 'text': 'So first, the framework.', 'start': 1839.995, 'duration': 0.781}, {'end': 1842.717, 'text': 'And again, you should invite the guys from Google.', 'start': 1841.056, 'duration': 1.661}, {'end': 1844.197, 'text': 'They can tell you way more about this than I can.', 'start': 1842.737, 'duration': 1.46}, {'end': 1848.54, 'text': 'But at a very high level, you as developer, you only write two little functions.', 'start': 1844.618, 'duration': 3.922}, {'end': 1851.101, 'text': 'You write something called a map function and a reduce function.', 'start': 1848.8, 'duration': 2.301}, {'end': 1852.341, 'text': "And that's all you write.", 'start': 1851.621, 'duration': 0.72}, {'end': 1853.942, 'text': 'And the system takes care of everything else.', 'start': 1852.622, 'duration': 1.32}, {'end': 1859.825, 'text': 'It takes care of the distribution, of the fault tolerance, of the aggregation, the shuffle, the sorting.', 'start': 1854.482, 'duration': 5.343}, {'end': 1861.366, 'text': 'All these things are taken care of for you.', 'start': 1859.865, 'duration': 1.501}, {'end': 1866.752, 'text': 'So a very simple example here if you have a bunch of documents and you want to count the frequency of word of documents,', 'start': 1861.946, 'duration': 4.806}, {'end': 1869.956, 'text': 'then each one of your mappers, the system will give it part of the data automatically.', 'start': 1866.752, 'duration': 3.204}, {'end': 1871.939, 'text': "It will figure out which, depending on the file name you're going in.", 'start': 1869.976, 'duration': 1.963}, {'end': 1874.484, 'text': 'And then you will take the part of data.', 'start': 1872.743, 'duration': 1.741}, {'end': 1879.565, 'text': 'Your job now as a single function that you wrote here is just to count each one of these words and how many times they showed up.', 'start': 1874.504, 'duration': 5.061}, {'end': 1881.326, 'text': 'You spit out these words.', 'start': 1880.366, 'duration': 0.96}, {'end': 1883.307, 'text': 'So you have the word and the number of times you saw it.', 'start': 1881.346, 'duration': 1.961}, {'end': 1884.667, 'text': 'There is a hashing.', 'start': 1884.007, 'duration': 0.66}, {'end': 1889.749, 'text': 'There is a consistent hashing algorithm now that will take these words and will make sure that, based on this key,', 'start': 1884.687, 'duration': 5.062}, {'end': 1894.953, 'text': 'that the words that have the same key go to the same reducer function that you have on the other side.', 'start': 1889.749, 'duration': 5.204}, {'end': 1896.494, 'text': 'So you can start your aggregation phase.', 'start': 1895.213, 'duration': 1.281}, {'end': 1902.061, 'text': "And then in your aggregation phase you have a very simple reducer function that says as long as I'm seeing b, b, b, b, b.", 'start': 1896.955, 'duration': 5.106}, {'end': 1904.264, 'text': "I'm just going to add up the numbers I'm seeing and then I'll get the total count.", 'start': 1902.061, 'duration': 2.203}, {'end': 1909.17, 'text': "So that's the MapReduce model at a very, very high level simple explanation of it.", 'start': 1905.245, 'duration': 3.925}, {'end': 1913.693, 'text': "What that gives you is now, you as a developer, you don't have to worry about the scalability.", 'start': 1910.211, 'duration': 3.482}, {'end': 1915.054, 'text': 'You solve this very simple problem.', 'start': 1913.833, 'duration': 1.221}, {'end': 1918.436, 'text': "You write this very simple function here, there, and you're done.", 'start': 1915.114, 'duration': 3.322}, {'end': 1922.118, 'text': "And Google would brag about this all the time and say we'll have interns come in.", 'start': 1918.996, 'duration': 3.122}, {'end': 1927.141, 'text': "Within a week, they're writing algorithms running on thousands and thousands of machines, which is very true.", 'start': 1922.138, 'duration': 5.003}, {'end': 1933.475, 'text': 'Now, this is the MapReduce resource manager and scheduler, which is the other half of MapReduce.', 'start': 1928.43, 'duration': 5.045}, {'end': 1938.84, 'text': "But actually in the new version of Hadoop we're splitting these two out from each other the MapReduce framework from the MapReduce resource manager.", 'start': 1933.915, 'duration': 4.925}], 'summary': 'Mapreduce simplifies parallel processing, with google interns running algorithms on thousands of machines within a week.', 'duration': 143.677, 'max_score': 1795.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1795163.jpg'}, {'end': 1871.939, 'src': 'embed', 'start': 1842.737, 'weight': 2, 'content': [{'end': 1844.197, 'text': 'They can tell you way more about this than I can.', 'start': 1842.737, 'duration': 1.46}, {'end': 1848.54, 'text': 'But at a very high level, you as developer, you only write two little functions.', 'start': 1844.618, 'duration': 3.922}, {'end': 1851.101, 'text': 'You write something called a map function and a reduce function.', 'start': 1848.8, 'duration': 2.301}, {'end': 1852.341, 'text': "And that's all you write.", 'start': 1851.621, 'duration': 0.72}, {'end': 1853.942, 'text': 'And the system takes care of everything else.', 'start': 1852.622, 'duration': 1.32}, {'end': 1859.825, 'text': 'It takes care of the distribution, of the fault tolerance, of the aggregation, the shuffle, the sorting.', 'start': 1854.482, 'duration': 5.343}, {'end': 1861.366, 'text': 'All these things are taken care of for you.', 'start': 1859.865, 'duration': 1.501}, {'end': 1866.752, 'text': 'So a very simple example here if you have a bunch of documents and you want to count the frequency of word of documents,', 'start': 1861.946, 'duration': 4.806}, {'end': 1869.956, 'text': 'then each one of your mappers, the system will give it part of the data automatically.', 'start': 1866.752, 'duration': 3.204}, {'end': 1871.939, 'text': "It will figure out which, depending on the file name you're going in.", 'start': 1869.976, 'duration': 1.963}], 'summary': 'Developers write two functions, map and reduce, and system handles distribution, fault tolerance, aggregation, shuffle, and sorting.', 'duration': 29.202, 'max_score': 1842.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1842737.jpg'}], 'start': 1535.261, 'title': 'Hadoop and hdfs overview', 'summary': 'Discusses the capabilities of hadoop, including its support for complex data processing with various programming languages, its immense scalability demonstrated by yahoo and facebook, and an overview of the hadoop distributed file system (hdfs) focusing on its block replication strategy and its impact on availability and throughput. it also provides an overview of hadoop mapreduce and hdfs, highlighting the scalable and distributed nature of hdfs, the simplicity of the mapreduce framework, and the fault tolerance and failure recovery mechanisms of mapreduce.', 'chapters': [{'end': 1784.854, 'start': 1535.261, 'title': 'Hadoop and hdfs overview', 'summary': 'Discusses the capabilities of hadoop, including its support for complex data processing with various programming languages, its immense scalability demonstrated by yahoo and facebook, and an overview of the hadoop distributed file system (hdfs) focusing on its block replication strategy and its impact on availability and throughput.', 'duration': 249.593, 'highlights': ["Hadoop's immense scalability is demonstrated by Yahoo's 40,000 servers running in Hadoop clusters and Facebook's single namespace with 70 petabytes, emphasizing its unmatched scale in both commercial and open source systems.", 'Hadoop Distributed File System (HDFS) overview includes its block replication strategy, default block size of 64 megabytes, and its optimization for throughput rather than latency, allowing file operations such as put, get, open, and scan but not allowing updates or inserts in the middle of a file.', 'The impact of block replication on availability and throughput is discussed, highlighting the benefits of replication for performance, multi-tenancy, and availability, as well as the trade-offs in durability and efficiency compared to erasure coding.']}, {'end': 2139.337, 'start': 1784.894, 'title': 'Hadoop mapreduce and hdfs overview', 'summary': 'Provides an overview of hadoop mapreduce and hdfs, highlighting the scalable and distributed nature of hdfs, the simplicity of the mapreduce framework, and the fault tolerance and failure recovery mechanisms of mapreduce.', 'duration': 354.443, 'highlights': ['HDFS is very scalable, distributed, and highly available.', 'The MapReduce framework simplifies development by handling distribution, fault tolerance, aggregation, shuffle, and sorting automatically.', 'MapReduce is optimized for batch processing and failure recovery, constantly monitoring and optimizing tasks for fault tolerance.']}], 'duration': 604.076, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI1535261.jpg', 'highlights': ["Hadoop's immense scalability demonstrated by Yahoo's 40,000 servers and Facebook's 70 petabytes.", 'HDFS overview includes block replication strategy and default block size of 64 megabytes.', 'MapReduce framework simplifies development by handling distribution, fault tolerance, aggregation, shuffle, and sorting automatically.', 'The impact of block replication on availability and throughput is discussed, highlighting the benefits of replication for performance, multi-tenancy, and availability.']}, {'end': 2561.442, 'segs': [{'end': 2223.748, 'src': 'embed', 'start': 2189.143, 'weight': 1, 'content': [{'end': 2191.305, 'text': "Again, I'm multiplying by 10 and stuff like that.", 'start': 2189.143, 'duration': 2.162}, {'end': 2192.666, 'text': 'So just make the math simple.', 'start': 2191.325, 'duration': 1.341}, {'end': 2197.909, 'text': 'A server today, typically, a typical server for Hadoop has 12 hard disks at 2 terabytes each.', 'start': 2193.206, 'duration': 4.703}, {'end': 2199.17, 'text': "That's 24 terabytes of storage.", 'start': 2197.929, 'duration': 1.241}, {'end': 2201.252, 'text': 'And the cost is not that much.', 'start': 2200.271, 'duration': 0.981}, {'end': 2203.433, 'text': "It's a few thousand dollars to get a server like that.", 'start': 2201.372, 'duration': 2.061}, {'end': 2206.075, 'text': "That's 1.2 gigabytes per second.", 'start': 2204.134, 'duration': 1.941}, {'end': 2208.557, 'text': "And that's not maxing out the SATA bus speeds and the PCI.", 'start': 2206.095, 'duration': 2.462}, {'end': 2211.759, 'text': "There's room to grow even more than that if the hard disk speeds grow more than this.", 'start': 2208.577, 'duration': 3.182}, {'end': 2214.041, 'text': "That's 12 gigabits per second.", 'start': 2212.7, 'duration': 1.341}, {'end': 2216.723, 'text': "So that's barely now touching the 10G speeds.", 'start': 2214.081, 'duration': 2.642}, {'end': 2218.944, 'text': "But now we're going to take the server.", 'start': 2217.483, 'duration': 1.461}, {'end': 2220.605, 'text': "We're going to have a rack of them, which is 20 servers.", 'start': 2218.964, 'duration': 1.641}, {'end': 2223.748, 'text': 'And that rack is going to be 240 gigabytes per second.', 'start': 2221.166, 'duration': 2.582}], 'summary': 'A typical hadoop server has 24tb storage, costing a few thousand dollars, with a rack of 20 servers achieving 240gb/s.', 'duration': 34.605, 'max_score': 2189.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2189143.jpg'}, {'end': 2282.593, 'src': 'embed', 'start': 2244.162, 'weight': 0, 'content': [{'end': 2246.965, 'text': 'And then if you have a big cluster, actually the cluster size is doubling every year.', 'start': 2244.162, 'duration': 2.803}, {'end': 2250.509, 'text': 'Like when we did a survey last year, cluster sizes were about 60 nodes.', 'start': 2247.005, 'duration': 3.504}, {'end': 2251.53, 'text': "Now they're 120 nodes.", 'start': 2250.589, 'duration': 0.941}, {'end': 2254.012, 'text': 'So people are putting more and more data in their systems.', 'start': 2251.77, 'duration': 2.242}, {'end': 2259.017, 'text': 'Large clusters, which are the 4, 000 kind of node clusters, can get you 48 terabytes per second.', 'start': 2254.493, 'duration': 4.524}, {'end': 2263.462, 'text': 'So the whole point is the scalability of doing this divide and conquer parallelism.', 'start': 2259.178, 'duration': 4.284}, {'end': 2265.404, 'text': "You just can't beat that with a single pipe.", 'start': 2263.482, 'duration': 1.922}, {'end': 2265.844, 'text': "You just can't.", 'start': 2265.424, 'duration': 0.42}, {'end': 2270.63, 'text': 'And just an example here, 4.8 terabytes is not that much data.', 'start': 2266.909, 'duration': 3.721}, {'end': 2273.011, 'text': 'So this is 4.8 terabytes.', 'start': 2271.15, 'duration': 1.861}, {'end': 2277.972, 'text': 'You can fit 4.8 terabytes on your hard disk, on your laptop.', 'start': 2273.051, 'duration': 4.921}, {'end': 2282.593, 'text': "However, if you try to scan 4.8 terabytes using your laptop, you're going to take 13 hours.", 'start': 2278.252, 'duration': 4.341}], 'summary': 'Cluster size doubles yearly, 48tb/sec for 4000-node clusters, 4.8tb scan takes 13 hours.', 'duration': 38.431, 'max_score': 2244.162, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2244162.jpg'}, {'end': 2357.234, 'src': 'embed', 'start': 2331.52, 'weight': 4, 'content': [{'end': 2335.983, 'text': "Yeah So, yeah, I mean, that's an argument that I hear a lot.", 'start': 2331.52, 'duration': 4.463}, {'end': 2336.444, 'text': 'Two answers.', 'start': 2336.023, 'duration': 0.421}, {'end': 2341.148, 'text': 'There is a trend moving more towards power-efficient CPUs, obviously, with Atom and ARM and so on.', 'start': 2336.644, 'duration': 4.504}, {'end': 2346.712, 'text': "There's designs now being built specifically for this type of workload to try and turn off the CPUs from power consumption point of view.", 'start': 2341.168, 'duration': 5.544}, {'end': 2349.715, 'text': 'The parts of the CPU are not using them and so on.', 'start': 2346.772, 'duration': 2.943}, {'end': 2351.977, 'text': "Still very early, but we'll get there.", 'start': 2350.556, 'duration': 1.421}, {'end': 2353.338, 'text': 'The other argument.', 'start': 2352.457, 'duration': 0.881}, {'end': 2355.653, 'text': "I'm sorry.", 'start': 2355.293, 'duration': 0.36}, {'end': 2356.914, 'text': 'Say again? Yes.', 'start': 2355.813, 'duration': 1.101}, {'end': 2357.234, 'text': 'Go ahead.', 'start': 2356.954, 'duration': 0.28}], 'summary': 'Trend towards power-efficient cpus like atom and arm for reducing power consumption in cpu design.', 'duration': 25.714, 'max_score': 2331.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2331520.jpg'}, {'end': 2477.014, 'src': 'embed', 'start': 2447.164, 'weight': 5, 'content': [{'end': 2454.048, 'text': 'compared to a standard storage aeronautics solution is an order of magnitude less when you factor in everything.', 'start': 2447.164, 'duration': 6.884}, {'end': 2455.629, 'text': "So that's the key today.", 'start': 2454.548, 'duration': 1.081}, {'end': 2457.911, 'text': "That might change in the future, but that's how it is today.", 'start': 2456.17, 'duration': 1.741}, {'end': 2462.354, 'text': "Because of the system and other things have been done, you don't run anything idle.", 'start': 2457.931, 'duration': 4.423}, {'end': 2466.456, 'text': 'These clusters run hot and are kept.', 'start': 2462.654, 'duration': 3.802}, {'end': 2477.014, 'text': 'So would they run hot in both dimensions? Data has to stay on, but it may be not so much.', 'start': 2466.537, 'duration': 10.477}], 'summary': 'Aeronautics storage solution is an order of magnitude less compared to standard storage, with efficient system utilization.', 'duration': 29.85, 'max_score': 2447.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2447164.jpg'}, {'end': 2546.937, 'src': 'embed', 'start': 2510.794, 'weight': 6, 'content': [{'end': 2513.975, 'text': 'and how can we proactively turn parts of the system on and off, and so on.', 'start': 2510.794, 'duration': 3.181}, {'end': 2516.116, 'text': "So yes, there's lots of room for improvements.", 'start': 2514.435, 'duration': 1.681}, {'end': 2518.976, 'text': 'However, the economics of it today still beats the other options.', 'start': 2516.356, 'duration': 2.62}, {'end': 2520.577, 'text': "So I'm agreeing with you.", 'start': 2519.876, 'duration': 0.701}, {'end': 2522.517, 'text': "I'm just trying to show.", 'start': 2521.717, 'duration': 0.8}, {'end': 2530.504, 'text': 'So it does make sense to have the concept of a in the concept of a Hadoop cluster, of a storage-only node or a compute-only node.', 'start': 2523.055, 'duration': 7.449}, {'end': 2533.988, 'text': "Not every node needs- Storage-heavy node versus, so yes, there's different workloads.", 'start': 2530.544, 'duration': 3.444}, {'end': 2540.295, 'text': "There's workloads which are more towards, I want to store, and I want to have my data alive, and disks start spinning.", 'start': 2534.008, 'duration': 6.287}, {'end': 2543.036, 'text': "But I'm not necessarily running very heavy computes.", 'start': 2540.975, 'duration': 2.061}, {'end': 2544.556, 'text': 'And there is jobs that are more about computes.', 'start': 2543.116, 'duration': 1.44}, {'end': 2546.937, 'text': 'So an example I love to give is a company called eHarmony.', 'start': 2544.596, 'duration': 2.341}], 'summary': 'Economics favor system; potential for storage-only or compute-only nodes; diverse workloads exist.', 'duration': 36.143, 'max_score': 2510.794, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2510794.jpg'}], 'start': 2139.337, 'title': 'Network speeds and power efficiency', 'summary': 'Discusses the rapid growth of network speeds, increasing density of hard disks, and scalability of clusters, highlighting the ability to process 4.8 terabytes of data in a second. it also covers challenges and trends in power efficiency for co-locating compute and storage, shift towards power-efficient cpus, and economics of hadoop clusters in terms of power consumption and price, emphasizing potential separate storage-heavy and compute-heavy nodes.', 'chapters': [{'end': 2303.282, 'start': 2139.337, 'title': 'Scalability of network speeds', 'summary': 'Discusses the rapid growth of network speeds, the increasing density of hard disks, and the scalability of clusters, highlighting the ability to process 4.8 terabytes of data in a second in a large cluster.', 'duration': 163.945, 'highlights': ['The scalability of clusters is emphasized, with large clusters of 4,000 nodes capable of processing 48 terabytes per second, showcasing the ability to handle large amounts of data at high speeds.', 'The increasing density of hard disks and cores per server allows for locality, with a server having 24 terabytes of storage and processing at 1.2 gigabytes per second, demonstrating the efficient utilization of resources.', 'Network speeds are growing significantly faster than hard disk speeds, with a rack of 20 servers capable of reaching 240 gigabytes per second, highlighting the rapid advancements in network technology.', 'The comparison of scanning 4.8 terabytes using a laptop taking 13 hours versus finishing it in a second in a large cluster illustrates the scalability and speed of managing data in a clustered environment, emphasizing the efficiency of parallel processing.']}, {'end': 2561.442, 'start': 2304.043, 'title': 'Power efficiency in compute and storage', 'summary': 'Discusses the challenges and trends in power efficiency for co-locating compute and storage, the shift towards power-efficient cpus, the increasing speed of networks, and the economics of hadoop clusters in terms of power consumption and price, highlighting the potential for separate storage-heavy and compute-heavy nodes.', 'duration': 257.399, 'highlights': ['The trend is moving towards power-efficient CPUs, including designs specifically built for minimizing power consumption, such as Atom and ARM processors.', 'Networks are advancing with terabit speeds and the potential for multi-fiber setups, despite current cost considerations.', 'The economics of Hadoop clusters today offer an order of magnitude less power consumption and cost compared to standard storage solutions, due to efficient utilization and the inclusion of extra cores for free.', 'There is room for improvements in power management for Hadoop clusters, but the current economics still favor Hadoop over other options.', "The concept of separate storage-heavy and compute-heavy nodes in Hadoop clusters is sensible due to different workload requirements, as exemplified by eHarmony's use case with a relatively small data set."]}], 'duration': 422.105, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2139337.jpg', 'highlights': ['Large clusters of 4,000 nodes can process 48 terabytes per second, showcasing high-speed data handling.', 'A server with 24 terabytes of storage processes at 1.2 gigabytes per second, demonstrating efficient resource utilization.', 'Network speeds are growing significantly faster than hard disk speeds, with a rack of 20 servers reaching 240 gigabytes per second.', 'Scanning 4.8 terabytes in a second in a large cluster illustrates scalability and speed of parallel processing.', 'Trend towards power-efficient CPUs, including Atom and ARM processors, is gaining momentum.', 'Economics of Hadoop clusters offer an order of magnitude less power consumption and cost compared to standard storage solutions.', 'Separate storage-heavy and compute-heavy nodes in Hadoop clusters are sensible due to different workload requirements.', 'Room for improvements in power management for Hadoop clusters, but current economics still favor Hadoop over other options.']}, {'end': 2920.876, 'segs': [{'end': 2676.782, 'src': 'heatmap', 'start': 2561.962, 'weight': 1, 'content': [{'end': 2567.744, 'text': 'And they still needed the solution like this because of the throughput they can get, even though the hard disks were only like 1% filled.', 'start': 2561.962, 'duration': 5.782}, {'end': 2570.885, 'text': "But it's the throughput that they wanted to get.", 'start': 2569.605, 'duration': 1.28}, {'end': 2572.906, 'text': "Don't forget there's two aspects.", 'start': 2570.905, 'duration': 2.001}, {'end': 2576.067, 'text': "There's the I-O throughput, but there's also the archival storage.", 'start': 2572.926, 'duration': 3.141}, {'end': 2582.849, 'text': "And the earlier discussion of the data going dead, if you could just keep it on the same spindles, even if you're not accessing it, it's there.", 'start': 2576.487, 'duration': 6.362}, {'end': 2584.289, 'text': "And that's hugely powerful.", 'start': 2583.269, 'duration': 1.02}, {'end': 2589.671, 'text': "Yeah, but the argument is, now can I turn off the CPUs on these nodes when they're not running? You can slow them down.", 'start': 2584.429, 'duration': 5.242}, {'end': 2602.52, 'text': "Do you have examples of the cost metrics for various I-O metric or throughput metric? For a typical cluster? I don't have that with me right now.", 'start': 2589.771, 'duration': 12.749}, {'end': 2605.543, 'text': "If you're really curious about that, we can send it offline.", 'start': 2602.84, 'duration': 2.703}, {'end': 2609.366, 'text': 'So this is the very high level architecture of Hadoop.', 'start': 2607.445, 'duration': 1.921}, {'end': 2613.691, 'text': 'And this is where somebody tried to get me to say Hadoop has a single point of failure, which it does.', 'start': 2609.407, 'duration': 4.284}, {'end': 2615.693, 'text': 'And I want to speak about that.', 'start': 2614.772, 'duration': 0.921}, {'end': 2617.495, 'text': "And I want to speak about what we're doing about it, which we are.", 'start': 2615.833, 'duration': 1.662}, {'end': 2623.482, 'text': "So, the data nodes, these are the servers, right? So, the actual servers, there's lots of these, right? There's hundreds, thousands of these.", 'start': 2617.975, 'duration': 5.507}, {'end': 2626.365, 'text': 'Each one of these servers run two subsystems.', 'start': 2623.942, 'duration': 2.423}, {'end': 2629.97, 'text': 'They run a data node subsystem and a task tracker subsystem.', 'start': 2626.806, 'duration': 3.164}, {'end': 2637.419, 'text': 'The data node subsystem, all it does is manage the blocks which are really files in AXT4, in NX, or NTFS in Windows.', 'start': 2630.11, 'duration': 7.309}, {'end': 2641.321, 'text': "It just manages these blocks, and it doesn't know what's inside the blocks.", 'start': 2637.899, 'duration': 3.422}, {'end': 2642.762, 'text': "It doesn't know which file belongs to these blocks.", 'start': 2641.361, 'duration': 1.401}, {'end': 2643.763, 'text': 'It just have block ID, block.', 'start': 2642.782, 'duration': 0.981}, {'end': 2644.523, 'text': "That's all it does, right?", 'start': 2643.783, 'duration': 0.74}, {'end': 2651.747, 'text': 'And then it reports back to a central node called the name node, which maintains the namespace, the file allocation table, if you want,', 'start': 2644.943, 'duration': 6.804}, {'end': 2658.671, 'text': 'for the whole cluster, and knows the mapping between a file name to a block, to which server has the block, which servers have the block right?', 'start': 2651.747, 'duration': 6.924}, {'end': 2660.673, 'text': 'So the name node is responsible for doing that.', 'start': 2659.072, 'duration': 1.601}, {'end': 2661.633, 'text': "So that's on the storage side.", 'start': 2660.713, 'duration': 0.92}, {'end': 2662.974, 'text': 'So these are the HDFS subsystems.', 'start': 2661.673, 'duration': 1.301}, {'end': 2665.495, 'text': 'And then these are the method use subsystems.', 'start': 2663.474, 'duration': 2.021}, {'end': 2671.019, 'text': 'You have a task tracker that is giving a piece of Java, runs the Java encapsulated inside the JVM.', 'start': 2665.675, 'duration': 5.344}, {'end': 2676.782, 'text': 'And then there is a job tracker that essentially manages the job flow and hands out when there is free slots in different servers.', 'start': 2671.879, 'duration': 4.903}], 'summary': 'Hadoop architecture discussed, focusing on throughput and storage management.', 'duration': 53.731, 'max_score': 2561.962, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2561962.jpg'}, {'end': 2723.242, 'src': 'embed', 'start': 2698.29, 'weight': 0, 'content': [{'end': 2705.772, 'text': 'So the problem is these two subsystems today in the current version of Hadoop are a single point of failure.', 'start': 2698.29, 'duration': 7.482}, {'end': 2707.872, 'text': 'These are not.', 'start': 2707.411, 'duration': 0.461}, {'end': 2709.193, 'text': 'These can all go down.', 'start': 2708.172, 'duration': 1.021}, {'end': 2711.314, 'text': "There's lots of videos on the internet, on YouTube,", 'start': 2709.593, 'duration': 1.721}, {'end': 2717.338, 'text': 'where you have somebody running a cluster and they break a big hammer and just hammer down one of the nodes in the cluster and it still keeps running.', 'start': 2711.314, 'duration': 6.024}, {'end': 2719.78, 'text': 'Any one of these guys go down, no issues.', 'start': 2717.758, 'duration': 2.022}, {'end': 2723.242, 'text': "These guys, if they go down from a durability point of view, there's no issues.", 'start': 2720.5, 'duration': 2.742}], 'summary': 'Current hadoop version has two subsystems as single point of failure, but new ones are not.', 'duration': 24.952, 'max_score': 2698.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2698290.jpg'}], 'start': 2561.962, 'title': 'Hadoop system enhancements', 'summary': "Discusses current shortcomings of hadoop's data and method use subsystems, implications of single points of failure, and ongoing efforts to implement high availability solutions and separate the mapreduce application logic from the resource scheduler.", 'chapters': [{'end': 2615.693, 'start': 2561.962, 'title': 'Hadoop throughput and storage solution', 'summary': 'Discusses the importance of throughput in utilizing hadoop, the benefits of keeping data on the same spindles for archival storage, and the high-level architecture of hadoop with a single point of failure.', 'duration': 53.731, 'highlights': ['The importance of throughput in utilizing Hadoop, even when hard disks are only 1% filled, for achieving desired performance.', 'Benefits of keeping data on the same spindles for archival storage, providing powerful access to data even when not actively accessed.', 'Discussion about the high-level architecture of Hadoop and its single point of failure, emphasizing the need to address this issue for improved reliability.', 'The potential to turn off CPUs on nodes when not running to conserve energy and resources.', 'Lack of specific cost metrics for I-O or throughput, with an offer to provide detailed information offline.']}, {'end': 2920.876, 'start': 2615.833, 'title': 'Hadoop system enhancements', 'summary': "Discusses the current shortcomings of hadoop's data and method use subsystems, the implications of single points of failure, and the ongoing efforts to address these issues by implementing high availability solutions and separating the mapreduce application logic from the resource scheduler.", 'duration': 305.043, 'highlights': ["Hadoop's data and method use subsystems are currently single points of failure, leading to availability issues and inefficiencies.", 'Efforts are underway to address these issues by implementing high availability solutions, such as running the name node on multiple nodes simultaneously.', 'The separation of the MapReduce application logic from the resource scheduler is being introduced to improve availability and efficiency, enabling the support of multiple computational frameworks on the same infrastructure.']}], 'duration': 358.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2561962.jpg', 'highlights': ["Efforts to implement high availability solutions for Hadoop's data and method use subsystems", 'Discussion about the high-level architecture of Hadoop and its single point of failure', 'The importance of throughput in utilizing Hadoop for achieving desired performance', 'Benefits of keeping data on the same spindles for archival storage', 'The potential to turn off CPUs on nodes when not running to conserve energy and resources']}, {'end': 3336.955, 'segs': [{'end': 2968.011, 'src': 'embed', 'start': 2938.308, 'weight': 0, 'content': [{'end': 2942.231, 'text': 'And then each client has a database of the full namespace, and they can reach any one of them.', 'start': 2938.308, 'duration': 3.923}, {'end': 2946.114, 'text': "So virtually, it still looks like one single big namespace, but you're much more scalable.", 'start': 2942.631, 'duration': 3.483}, {'end': 2951.238, 'text': 'Right now, with a single name node model, you can only go up to 4, 000 servers in a single cluster.', 'start': 2946.454, 'duration': 4.784}, {'end': 2954.561, 'text': 'This will allow us to go way, way bigger than that in terms of namespace scalability.', 'start': 2951.778, 'duration': 2.783}, {'end': 2956.002, 'text': 'Same thing with the job tracker.', 'start': 2955.001, 'duration': 1.001}, {'end': 2959.164, 'text': 'In the current model, the job tracker is responsible for managing the state of all the jobs.', 'start': 2956.122, 'duration': 3.042}, {'end': 2961.086, 'text': 'In the new one, each job has its own tracker.', 'start': 2959.544, 'duration': 1.542}, {'end': 2963.788, 'text': 'So the scalability actually is also significantly enhanced.', 'start': 2961.546, 'duration': 2.242}, {'end': 2968.011, 'text': 'The bigger issue for our customers today is the availability, but scalability is also being addressed.', 'start': 2964.528, 'duration': 3.483}], 'summary': 'New model allows for much more scalable namespace and job tracker, addressing availability and scalability concerns for customers.', 'duration': 29.703, 'max_score': 2938.308, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2938308.jpg'}, {'end': 3096.405, 'src': 'embed', 'start': 3065.249, 'weight': 3, 'content': [{'end': 3067.771, 'text': "You have a key, and you're looking at the columns for that key.", 'start': 3065.249, 'duration': 2.522}, {'end': 3070.113, 'text': 'You cannot do joins very low latency.', 'start': 3068.112, 'duration': 2.001}, {'end': 3073.957, 'text': "So it doesn't have OLAP indexes, cubes, and stuff like that.", 'start': 3070.174, 'duration': 3.783}, {'end': 3083.202, 'text': 'The other key thing it adds is, I told you before, Hadoop HDFS is optimized for append, right? So you cannot insert and delete.', 'start': 3074.357, 'duration': 8.845}, {'end': 3086.843, 'text': 'HBase adds transactional support, right? And it has that in memory.', 'start': 3083.942, 'duration': 2.901}, {'end': 3089.683, 'text': "It throws the changes and then writes the files behind so you don't get to see that.", 'start': 3086.863, 'duration': 2.82}, {'end': 3096.405, 'text': 'But essentially, in HBase you can for every key and for every row, you have atomic transactions on that row right?', 'start': 3090.184, 'duration': 6.221}], 'summary': 'Hbase adds transactional support for atomic transactions on every row.', 'duration': 31.156, 'max_score': 3065.249, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3065249.jpg'}, {'end': 3130.492, 'src': 'embed', 'start': 3102.947, 'weight': 1, 'content': [{'end': 3106.448, 'text': 'The consistency for them will be properly maintained, but only at the at the row level.', 'start': 3102.947, 'duration': 3.501}, {'end': 3109.429, 'text': 'There is research now about how to extend that beyond multiple rows.', 'start': 3106.848, 'duration': 2.581}, {'end': 3113.15, 'text': "So that's HBase, very, very key system.", 'start': 3111.39, 'duration': 1.76}, {'end': 3116.131, 'text': 'Another key system is Apache Flume.', 'start': 3113.631, 'duration': 2.5}, {'end': 3117.952, 'text': 'And Apache Flume is about.', 'start': 3116.712, 'duration': 1.24}, {'end': 3122.233, 'text': 'how do we like, if you have a Hadoop system but you have no data in it, then what am I going to do with it??', 'start': 3117.952, 'duration': 4.281}, {'end': 3123.694, 'text': "OK, there's a question over there.", 'start': 3122.514, 'duration': 1.18}, {'end': 3129.251, 'text': 'When you were talking about the last thing you said about HBase, the transactions are on a single row.', 'start': 3124.724, 'duration': 4.527}, {'end': 3130.492, 'text': "It's atomic at the row level.", 'start': 3129.271, 'duration': 1.221}], 'summary': 'Hbase ensures consistency at row level, flume resolves empty hadoop system issue.', 'duration': 27.545, 'max_score': 3102.947, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3102947.jpg'}], 'start': 2920.936, 'title': "Enhanced scalability and cloudera's cdh", 'summary': "Discusses implementing a federation model for enhanced scalability, allowing for a larger namespace scalability than single name node model. it also highlights cloudera's cdh, a complete data operating system including apache hadoop with key components like hbase, apache flume, apache mahout, and apache big top.", 'chapters': [{'end': 2968.011, 'start': 2920.936, 'title': 'Enhanced scalability with federation model', 'summary': 'Discusses the implementation of a federation model to enhance scalability, allowing for a much larger namespace scalability than the existing single name node model which can only handle up to 4,000 servers in a single cluster.', 'duration': 47.075, 'highlights': ['The federation model allows for much larger namespace scalability than the existing single name node model which can only handle up to 4,000 servers in a single cluster.', 'Each job has its own tracker in the new model, enhancing scalability.', 'The availability is the bigger issue for customers, but scalability is also being addressed.']}, {'end': 3336.955, 'start': 2969.695, 'title': "Cloudera's cdh: a complete data operating system", 'summary': "Highlights cloudera's cdh, a complete data operating system including apache hadoop, with key components like hbase for low latency access and transactional support, apache flume for data collection, apache mahout for data mining algorithms, and apache big top for building, testing, and integrating hadoop components.", 'duration': 367.26, 'highlights': ["Cloudera's CDH includes key components like HBase for low latency access and transactional support, Apache Flume for data collection, Apache Mahout for data mining algorithms, and Apache Big Top for building, testing, and integrating Hadoop components.", 'HBase provides low latency access and transactional support at the row level, while also adding transactional support and in-memory storage for atomic transactions.', 'Apache Flume is scalable and reliable, allowing data collection across various servers and network equipment, and materializing it inside Apache Hadoop.', 'Apache Mahout consists of data mining algorithms, such as support vector machines and clustering algorithms, designed to leverage the Hadoop parallel processing model.', 'Apache Big Top is a build test system for integrating individual Hadoop projects into final bits, ensuring cross-integration testing and compatibility between different components.']}], 'duration': 416.019, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI2920936.jpg', 'highlights': ['The federation model allows for much larger namespace scalability than the existing single name node model which can only handle up to 4,000 servers in a single cluster.', "Cloudera's CDH includes key components like HBase for low latency access and transactional support, Apache Flume for data collection, Apache Mahout for data mining algorithms, and Apache Big Top for building, testing, and integrating Hadoop components.", 'Each job has its own tracker in the new model, enhancing scalability.', 'HBase provides low latency access and transactional support at the row level, while also adding transactional support and in-memory storage for atomic transactions.', 'Apache Flume is scalable and reliable, allowing data collection across various servers and network equipment, and materializing it inside Apache Hadoop.']}, {'end': 3716.303, 'segs': [{'end': 3406.979, 'src': 'embed', 'start': 3336.955, 'weight': 3, 'content': [{'end': 3343.141, 'text': "I want to have MapReduce and then you'll have nice installation progress bars and we'll go fetch all the code and you'll have a cluster up and running in a few minutes.", 'start': 3336.955, 'duration': 6.186}, {'end': 3348.806, 'text': 'We brag about this and say that our CEO, Mike Olson, was able to actually get a Hadoop cluster up and running using this.', 'start': 3343.781, 'duration': 5.025}, {'end': 3349.987, 'text': "so that's how we knew we made it.", 'start': 3348.806, 'duration': 1.181}, {'end': 3352.209, 'text': 'The software achieved its goal.', 'start': 3351.248, 'duration': 0.961}, {'end': 3356.453, 'text': 'Any questions about the stack, the completeness of the stack? Yeah.', 'start': 3353.871, 'duration': 2.582}, {'end': 3362.242, 'text': 'Do you have integration with any other NoSQL databases out there, like Couchbase? We have integration.', 'start': 3357.699, 'duration': 4.543}, {'end': 3363.903, 'text': 'So integration is a different thing.', 'start': 3362.262, 'duration': 1.641}, {'end': 3370.186, 'text': 'So now there is, how can we get data between these systems and other, actually, which is a good question because Scoop addresses that problem.', 'start': 3363.923, 'duration': 6.263}, {'end': 3377.39, 'text': "Apache Scoop, which is short for SQL to Hadoop, is about bridging the gap between the Hadoop world and the SQL database's world.", 'start': 3370.667, 'duration': 6.723}, {'end': 3384.214, 'text': 'So Scoop is a very simple framework where you can say essentially Scoop the name of the file in HDFS, the name of the table in the database,', 'start': 3378.231, 'duration': 5.983}, {'end': 3385.115, 'text': 'the credentials, and it will.', 'start': 3384.214, 'duration': 0.901}, {'end': 3387.757, 'text': 'would take care of copying that over and vice versa.', 'start': 3385.535, 'duration': 2.222}, {'end': 3392.16, 'text': 'Scoop out of the box works over JDBC, which is single threaded, which is very, very slow.', 'start': 3388.217, 'duration': 3.943}, {'end': 3396.543, 'text': 'However, we, with a number of the big database providers, built parallel versions of Scoop.', 'start': 3392.6, 'duration': 3.943}, {'end': 3398.164, 'text': 'So you can do parallel transfer between these systems.', 'start': 3396.563, 'duration': 1.601}, {'end': 3400.525, 'text': 'And some of it is CouchDB is one of them as well.', 'start': 3398.244, 'duration': 2.281}, {'end': 3402.427, 'text': "In the NoSQL world, there's a bunch of them.", 'start': 3400.866, 'duration': 1.561}, {'end': 3406.979, 'text': 'So it depends on the vendor.', 'start': 3404.358, 'duration': 2.621}], 'summary': 'The software achieved its goal, enabling quick hadoop cluster setup and integration with various nosql databases.', 'duration': 70.024, 'max_score': 3336.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3336955.jpg'}, {'end': 3469.761, 'src': 'embed', 'start': 3444.301, 'weight': 1, 'content': [{'end': 3451.085, 'text': 'In conclusion, the things I would like you guys to remember again the money slide which I had very earlier on the schema.', 'start': 3444.301, 'duration': 6.784}, {'end': 3451.806, 'text': 'on read versus schema.', 'start': 3451.085, 'duration': 0.721}, {'end': 3452.726, 'text': 'on write.', 'start': 3451.806, 'duration': 0.92}, {'end': 3458.83, 'text': 'the differentiating thing about Hadoop is its ability to work with unstructured data using any language you would like.', 'start': 3452.726, 'duration': 6.104}, {'end': 3462.493, 'text': 'So you have the agility and flexibility of evolving at the speed of your data.', 'start': 3458.91, 'duration': 3.583}, {'end': 3469.761, 'text': "Once you find this perfect meal, once you invent the McDonald's, then you can move it into a pipeline and have it happen with the predefined schema.", 'start': 3463.192, 'duration': 6.569}], 'summary': "Hadoop's agility allows working with unstructured data using any language, providing flexibility and the ability to evolve at the speed of data.", 'duration': 25.46, 'max_score': 3444.301, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3444301.jpg'}, {'end': 3505.584, 'src': 'embed', 'start': 3482.337, 'weight': 2, 'content': [{'end': 3489.22, 'text': 'You have very high scalability of your storage and your compute, both in terms of systems, in terms of people being able to write parallel problems,', 'start': 3482.337, 'duration': 6.883}, {'end': 3492.761, 'text': 'but also in terms of economics, which allows you to keep your data alive forever.', 'start': 3489.22, 'duration': 3.541}, {'end': 3497.383, 'text': 'And then the two core subsystems are HDFS and MapReduce.', 'start': 3493.201, 'duration': 4.182}, {'end': 3501.245, 'text': 'So with that, how much time I have for questions? Nice.', 'start': 3498.143, 'duration': 3.102}, {'end': 3505.584, 'text': 'Cool Question of Hadoop and so forth.', 'start': 3501.705, 'duration': 3.879}], 'summary': 'High scalability of storage and compute, with core subsystems hdfs and mapreduce.', 'duration': 23.247, 'max_score': 3482.337, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3482337.jpg'}, {'end': 3547.083, 'src': 'embed', 'start': 3525.297, 'weight': 0, 'content': [{'end': 3534.26, 'text': 'And Hadoop and this whole suite of software, which we call essentially the data center, data operating system, is our business.', 'start': 3525.297, 'duration': 8.963}, {'end': 3534.94, 'text': "And that's what we do.", 'start': 3534.34, 'duration': 0.6}, {'end': 3536.721, 'text': "That said, it's open source.", 'start': 3535.841, 'duration': 0.88}, {'end': 3541.282, 'text': 'So people frequently ask, how can you make money? This thing here, 100% open source, 100% free.', 'start': 3536.841, 'duration': 4.441}, {'end': 3544.083, 'text': "We don't charge anything for it, both the source code and the compiled bits.", 'start': 3541.302, 'duration': 2.781}, {'end': 3547.083, 'text': 'So we make it up in volume.', 'start': 3546.182, 'duration': 0.901}], 'summary': 'Hadoop suite is 100% open source, generates revenue through volume.', 'duration': 21.786, 'max_score': 3525.297, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3525297.jpg'}, {'end': 3662.004, 'src': 'embed', 'start': 3636.591, 'weight': 6, 'content': [{'end': 3641.854, 'text': 'it spreads violently inside organizations much easier than if you try at the end to go and convince somebody to buy a new proprietary solution that you have.', 'start': 3636.591, 'duration': 5.263}, {'end': 3645.636, 'text': "So that's one of the benefits of open source, is the viral consumption.", 'start': 3642.194, 'duration': 3.442}, {'end': 3649.518, 'text': 'In addition, another benefit we get obviously is Facebook is contributing to this platform.', 'start': 3646.437, 'duration': 3.081}, {'end': 3650.979, 'text': 'we are contributing to this platform.', 'start': 3649.518, 'duration': 1.461}, {'end': 3652.26, 'text': 'Yahoo is contributing to this platform.', 'start': 3650.979, 'duration': 1.281}, {'end': 3657.123, 'text': 'Twitter is LinkedIn, is Hortonworks, which is a spinoff from Yahoo is, et cetera, et cetera.', 'start': 3652.26, 'duration': 4.863}, {'end': 3660.664, 'text': 'So we get all of that free IP coming in.', 'start': 3657.523, 'duration': 3.141}, {'end': 3662.004, 'text': 'So lots of benefits to open source.', 'start': 3660.704, 'duration': 1.3}], 'summary': 'Open source software has benefits of viral consumption and free ip, with contributions from big companies like facebook, yahoo, twitter, and linkedin.', 'duration': 25.413, 'max_score': 3636.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3636591.jpg'}, {'end': 3694.179, 'src': 'embed', 'start': 3667.946, 'weight': 7, 'content': [{'end': 3674.228, 'text': "And then, once the technology becomes production, once now it's running a mission-critical pipeline that has revenues in it the CIO,", 'start': 3667.946, 'duration': 6.282}, {'end': 3681.051, 'text': "the CTO of the company they can't have that without insurance, without having a maintenance contract with the company that if there's an issue,", 'start': 3674.228, 'duration': 6.823}, {'end': 3682.231, 'text': 'it will be resolved within a given time.', 'start': 3681.051, 'duration': 1.18}, {'end': 3689.335, 'text': 'Or going and hiring their own super rocket engineers that can do that for super rocket scientist engineers that can do that for them.', 'start': 3683.352, 'duration': 5.983}, {'end': 3691.097, 'text': 'It has to be one of these two things.', 'start': 3690.096, 'duration': 1.001}, {'end': 3694.179, 'text': 'In fact a very nice thing that Martin Mikos.', 'start': 3691.417, 'duration': 2.762}], 'summary': 'Technology must have insurance or maintenance for mission-critical pipeline with revenues.', 'duration': 26.233, 'max_score': 3667.946, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3667946.jpg'}], 'start': 3336.955, 'title': 'Hadoop and cloudera', 'summary': "Discusses setting up a hadoop cluster with mapreduce and integrating apache scoop with nosql databases like couchbase, along with the benefits of hadoop's agility and scalability, cloudera's business model, and the advantages of open source software, including the viral spread and the need for maintenance contracts.", 'chapters': [{'end': 3422.344, 'start': 3336.955, 'title': 'Hadoop cluster setup and apache scoop', 'summary': 'Discusses setting up a hadoop cluster with mapreduce, achieving a running cluster in a few minutes, the integration of apache scoop with nosql databases like couchbase, and the parallel transfer functionality of scoop with big database providers.', 'duration': 85.389, 'highlights': ['Apache Scoop facilitates bridging the gap between Hadoop and SQL databases, allowing for data transfer with simple commands and credentials, achieving a running Hadoop cluster within minutes.', 'Scoop offers parallel versions for faster data transfer, including integration with big database providers like CouchDB, enhancing the speed and efficiency of data transfer between systems.', 'The setup of a Hadoop cluster with MapReduce boasts a quick installation process and the ability to have a running cluster within minutes, showcasing the efficiency and effectiveness of the software.', 'The discussion includes the integration of Apache Scoop with NoSQL databases like Couchbase, providing insights into the comprehensive functionality of the software and its compatibility with various database systems.']}, {'end': 3716.303, 'start': 3424.465, 'title': 'Hadoop and cloudera business model', 'summary': "Discusses the benefits of hadoop's agility and scalability, cloudera's business model, and the advantages of open source software, including the viral spread and the need for maintenance contracts.", 'duration': 291.838, 'highlights': ["Cloudera's business model is based on monetizing the open-source Hadoop platform through volume deployment, similar to companies like Red Hat and Sleepy Cat.", 'The benefits of open source software include viral spreading of the technology within companies and contributions from major players like Facebook, Yahoo, Twitter, and LinkedIn.', "Hadoop's agility and flexibility allow working with unstructured data using any programming language, enabling the discovery and solving of almost any problem.", 'Hadoop offers high scalability in terms of storage, compute, and economics, allowing data to be kept alive forever and supporting parallel processing.', 'The importance of maintenance contracts for mission-critical pipelines running on open-source platforms like Hadoop to ensure timely issue resolution.']}], 'duration': 379.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3336955.jpg', 'highlights': ["Cloudera's business model monetizes the open-source Hadoop platform through volume deployment.", "Hadoop's agility enables working with unstructured data using any programming language.", 'Hadoop offers high scalability in terms of storage, compute, and economics.', 'Apache Scoop facilitates bridging the gap between Hadoop and SQL databases.', 'Scoop offers parallel versions for faster data transfer, including integration with big database providers like CouchDB.', 'The setup of a Hadoop cluster with MapReduce boasts a quick installation process.', 'The benefits of open source software include viral spreading of the technology within companies.', 'The importance of maintenance contracts for mission-critical pipelines running on open-source platforms like Hadoop.', 'The discussion includes the integration of Apache Scoop with NoSQL databases like Couchbase.']}, {'end': 4601.824, 'segs': [{'end': 3746.744, 'src': 'embed', 'start': 3716.644, 'weight': 1, 'content': [{'end': 3719.228, 'text': 'The two meanings are free, as in freedom, liberty.', 'start': 3716.644, 'duration': 2.584}, {'end': 3721.098, 'text': 'Right? No lock-in.', 'start': 3720.337, 'duration': 0.761}, {'end': 3721.819, 'text': "I'm not locked in.", 'start': 3721.138, 'duration': 0.681}, {'end': 3723.341, 'text': 'And free as in no money.', 'start': 3722.299, 'duration': 1.042}, {'end': 3724.141, 'text': 'Zero cost.', 'start': 3723.621, 'duration': 0.52}, {'end': 3727.605, 'text': 'Right? And open source is about the former, not the latter.', 'start': 3724.942, 'duration': 2.663}, {'end': 3728.927, 'text': 'Open source is about liberty.', 'start': 3727.705, 'duration': 1.222}, {'end': 3732.41, 'text': "It's about choosing your destiny by having all of your data in an open platform.", 'start': 3729.367, 'duration': 3.043}, {'end': 3736.255, 'text': "A year from now, Cloudera can come and say, hey, we're going to up our prices by this much.", 'start': 3732.931, 'duration': 3.324}, {'end': 3740.039, 'text': 'And nothing you can do because all of your data is locked into this platform.', 'start': 3736.475, 'duration': 3.564}, {'end': 3741.56, 'text': 'No, you can say, thank you, Cloudera.', 'start': 3740.279, 'duration': 1.281}, {'end': 3742.461, 'text': "We're not happy with you.", 'start': 3741.6, 'duration': 0.861}, {'end': 3746.744, 'text': "We're going to go work with this other company instead, if you have an open source underlying foundational platform.", 'start': 3742.481, 'duration': 4.263}], 'summary': 'Open source promotes freedom and zero cost, avoiding data lock-in.', 'duration': 30.1, 'max_score': 3716.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3716644.jpg'}, {'end': 3797.127, 'src': 'embed', 'start': 3766.841, 'weight': 3, 'content': [{'end': 3770.345, 'text': 'the other half works on a management suite that is proprietary.', 'start': 3766.841, 'duration': 3.504}, {'end': 3777.131, 'text': 'so we do have a proprietary management suite that you only get when you are a maintenance subscriber with us,', 'start': 3770.345, 'duration': 6.786}, {'end': 3782.297, 'text': 'and that management suite makes it easier for you to deploy the software, to configure the software, monitor proactively,', 'start': 3777.131, 'duration': 5.166}, {'end': 3784.038, 'text': 'predict problems before they happen.', 'start': 3782.297, 'duration': 1.741}, {'end': 3786.461, 'text': 'when you call in our support line we have an issue.', 'start': 3784.038, 'duration': 2.423}, {'end': 3792.365, 'text': 'there is a snapshot button you click that takes a snapshot of the full state across the cluster and sends that back to our snapshot of the data metrics,', 'start': 3786.461, 'duration': 5.904}, {'end': 3793.005, 'text': 'not the actual data.', 'start': 3792.365, 'duration': 0.64}, {'end': 3797.127, 'text': 'That helps us then debug what the problem is and telling you, hey, you need to change this, change that.', 'start': 3793.846, 'duration': 3.281}], 'summary': 'Proprietary management suite for maintenance subscribers enables proactive monitoring and issue resolution with snapshot feature.', 'duration': 30.286, 'max_score': 3766.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3766841.jpg'}, {'end': 3884.416, 'src': 'embed', 'start': 3855.24, 'weight': 0, 'content': [{'end': 3860.543, 'text': 'Log files from Linux servers, from web servers, from Java application servers, from network equipment, from mobile devices.', 'start': 3855.24, 'duration': 5.303}, {'end': 3863.325, 'text': 'These are the stats.', 'start': 3862.404, 'duration': 0.921}, {'end': 3869.508, 'text': "The unstructured data is growing at a rate that far exceeds the relational data in today's world.", 'start': 3863.485, 'duration': 6.023}, {'end': 3873.99, 'text': 'So we believe this solution, this space, is going to be bigger than the relational space.', 'start': 3870.208, 'duration': 3.782}, {'end': 3878.333, 'text': 'What about the analytics marketplace?', 'start': 3874.01, 'duration': 4.323}, {'end': 3884.416, 'text': 'How much money is spent doing analytics in unstructured data versus structured data?', 'start': 3878.433, 'duration': 5.983}], 'summary': 'Unstructured data growth is surpassing relational data, indicating potential for a larger analytics marketplace.', 'duration': 29.176, 'max_score': 3855.24, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3855240.jpg'}, {'end': 4147.872, 'src': 'embed', 'start': 4118.924, 'weight': 2, 'content': [{'end': 4120.305, 'text': 'We also have the founders of the project.', 'start': 4118.924, 'duration': 1.381}, {'end': 4120.885, 'text': 'So that cutting,', 'start': 4120.345, 'duration': 0.54}, {'end': 4127.49, 'text': 'the creator of the technology works at Cloudera and many of the other key projects in the big slide with the projects are at the company.', 'start': 4120.885, 'duration': 6.605}, {'end': 4129.491, 'text': 'So we have all these things that differentiate us.', 'start': 4127.93, 'duration': 1.561}, {'end': 4131.872, 'text': 'But the most important thing that differentiates us is we came in.', 'start': 4129.711, 'duration': 2.161}, {'end': 4133.921, 'text': 'way ahead of these other players.', 'start': 4132.64, 'duration': 1.281}, {'end': 4140.386, 'text': "We believe that we're going to be the leader of this wave, like VMware is to the virtualization, where there was many wave.", 'start': 4134.001, 'duration': 6.385}, {'end': 4143.609, 'text': 'there was many other companies that tried to do like VMware.', 'start': 4140.386, 'duration': 3.223}, {'end': 4144.63, 'text': 'ZenSource is one example.', 'start': 4143.609, 'duration': 1.021}, {'end': 4147.872, 'text': 'And some of them died, some of them made good money, but they were not the leader.', 'start': 4144.71, 'duration': 3.162}], 'summary': 'The project founders believe they will lead the wave of technology, similar to vmware in virtualization.', 'duration': 28.948, 'max_score': 4118.924, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI4118924.jpg'}, {'end': 4200.444, 'src': 'embed', 'start': 4175.491, 'weight': 4, 'content': [{'end': 4183.075, 'text': "So there's companies that already have big deployments of BI tools, like BusinessObjects, Cognos, and SAS for the mining, and so many other ones.", 'start': 4175.491, 'duration': 7.584}, {'end': 4185.975, 'text': 'They want to use them with Hadoop.', 'start': 4184.035, 'duration': 1.94}, {'end': 4188.298, 'text': 'They want to figure out how to do that, that gap.', 'start': 4186.015, 'duration': 2.283}, {'end': 4194.1, 'text': "So that's one thing we're working on, is better integration between this ecosystem and the existing IT investments, which are very important.", 'start': 4188.318, 'duration': 5.782}, {'end': 4200.444, 'text': "and foresee that there's going to be many other solutions built from ground up to work with this system,", 'start': 4195.761, 'duration': 4.683}], 'summary': 'Companies aim to integrate bi tools like businessobjects, cognos, and sas with hadoop to leverage existing it investments and expect new solutions to emerge.', 'duration': 24.953, 'max_score': 4175.491, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI4175491.jpg'}], 'start': 3716.644, 'title': 'Open source and unstructured data', 'summary': "Discusses the importance of open source in providing liberty and freedom from vendor lock-in, as well as the growth of unstructured data and cloudera's position in the market for future scalability and application development.", 'chapters': [{'end': 3801.249, 'start': 3716.644, 'title': 'Open source and liberty', 'summary': 'Discusses the importance of open source in providing liberty and freedom from vendor lock-in, emphasizing the benefits of open platforms and the peripherality of proprietary management suites.', 'duration': 84.605, 'highlights': ['Open source provides liberty and freedom from vendor lock-in, allowing users to choose their destiny by having all of their data in an open platform, thus avoiding potential price increases by a specific vendor.', 'The proprietary management suite provided by Cloudera is only accessible to maintenance subscribers, offering ease of deployment, configuration, proactive monitoring, and predictive issue resolution.', 'Half of the engineering team works on an open source platform, while the other half focuses on a proprietary management suite, showcasing the dual approach taken by the company in its product offerings.']}, {'end': 4601.824, 'start': 3801.309, 'title': 'The future of data: unstructured data and analytics', 'summary': "Discusses the growth of unstructured data, highlighting its rapid expansion compared to relational data, the potential of advanced analytics and business intelligence, and cloudera's position as a leader in the market, with plans for future scalability and application development.", 'duration': 800.515, 'highlights': ["Unstructured data is growing at a rate that far exceeds the relational data in today's world, indicating the potential for unstructured data to surpass the relational space.", 'Cloudera aims to be a leader in the unstructured data and analytics market, leveraging its first-mover advantage, industry experience, and cutting-edge technology.', 'Plans for scalability and overcoming limitations in Hadoop, including addressing node boundary limits and working on resource manager scalability using ZooKeeper technology.', 'The potential for application development and integration with existing IT investments, with a focus on building new solutions for the Hadoop ecosystem and better integration with existing BI tools.', 'The importance of advanced analytics and business intelligence in leveraging unstructured data, with examples of potential use cases in industries and government agencies.']}], 'duration': 885.18, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/d2xeNpfzsYI/pics/d2xeNpfzsYI3716644.jpg', 'highlights': ["Unstructured data is growing at a rate that far exceeds the relational data in today's world, indicating the potential for unstructured data to surpass the relational space.", 'Open source provides liberty and freedom from vendor lock-in, allowing users to choose their destiny by having all of their data in an open platform, thus avoiding potential price increases by a specific vendor.', 'Cloudera aims to be a leader in the unstructured data and analytics market, leveraging its first-mover advantage, industry experience, and cutting-edge technology.', 'The proprietary management suite provided by Cloudera is only accessible to maintenance subscribers, offering ease of deployment, configuration, proactive monitoring, and predictive issue resolution.', 'The potential for application development and integration with existing IT investments, with a focus on building new solutions for the Hadoop ecosystem and better integration with existing BI tools.']}], 'highlights': ["Hadoop's immense scalability demonstrated by Yahoo's 40,000 servers and Facebook's 70 petabytes", 'Large clusters of 4,000 nodes can process 48 terabytes per second, showcasing high-speed data handling', "Efforts to implement high availability solutions for Hadoop's data and method use subsystems", "Hadoop's agility enables working with unstructured data using any programming language", "Unstructured data is growing at a rate that far exceeds the relational data in today's world, indicating the potential for unstructured data to surpass the relational space", 'Hadoop technology causing disruption in big data processing space', 'Hadoop offers high scalability in terms of storage, compute, and economics', 'Cloudera aims to be a leader in the unstructured data and analytics market, leveraging its first-mover advantage, industry experience, and cutting-edge technology', 'The federation model allows for much larger namespace scalability than the existing single name node model which can only handle up to 4,000 servers in a single cluster', 'The process of ETL led to data fidelity loss, making it challenging to explore the original highest fidelity data and limiting the ability to ask new questions not supported by the existing schema']}