title
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial | Hadoop Training | Edureka
description
🔥 Edureka Hadoop Training: https://www.edureka.co/big-data-hadoop-training-certification
This Edureka "Hadoop Tutorial for Beginners" video will help you learn Big Data Hadoop and Apache Spark from scratch. We have discussed 2 Hadoop projects in this hadoop tutorial - US Primary Election & Instant Cab use-cases. You will also learn how to use k-means clustering and zeppelin to analyze and visualize your data.
Comment below in case you need the datasets used in this Hadoop Tutorial For Beginners video.
Below are the topics covered in this Hadoop tutorial for beginners:
1. Big Data Use Cases - US Election & Instant Cabs
2. Solution strategy of the use cases
3. Hadoop & Spark Introduction
4. Hadoop Master/Slave Architecture
5. Hadoop Core Components
6. HDFS Data Blocks
7. HDFS Read/Write Mechanism
8. YARN Components
9. Spark Components
10. Spark Architecture
11. K-Means and Zeppelin
12.Implementing Solution of both use-cases
Subscribe to our channel to get video updates. Hit the subscribe button above.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
--------------------Edureka Big Data Training and Certifications------------------------
🔵 Edureka Hadoop Training: http://bit.ly/2YBlw29
🔵 Edureka Spark Training: http://bit.ly/2PeHvc9
🔵 Edureka Kafka Training: http://bit.ly/34e7Riy
🔵 Edureka Cassandra Training: http://bit.ly/2E9AK54
🔵 Edureka Talend Training: http://bit.ly/2YzYIjg
🔵 Edureka Hadoop Administration Training: http://bit.ly/2YE8Nf9
PG in Big Data Engineering with NIT Rourkela : https://www.edureka.co/post-graduate/big-data-engineering (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies)
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
#edureka #edurekaHadoop #HadoopTutorial #Hadoop #HadoopTutorialForBeginners #HadoopArchitecture #LearnHadoop #HadoopTraining #HadoopCertification
How it Works?
1. This is a 5 Week Instructor-led Online Course, 40 hours of assignment and 30 hours of project work
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. At the end of the training, you will have to undergo a 2-hour LIVE Practical Exam based on which we will provide you a Grade and a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
Edureka’s Big Data and Hadoop online training is designed to help you become a top Hadoop developer. During this course, our expert Hadoop instructors will help you:
1. Master the concepts of HDFS and MapReduce framework
2. Understand Hadoop 2.x Architecture
3. Setup Hadoop Cluster and write Complex MapReduce programs
4. Learn data loading techniques using Sqoop and Flume
5. Perform data analytics using Pig, Hive and YARN
6. Implement HBase and MapReduce integration
7. Implement Advanced Usage and Indexing
8. Schedule jobs using Oozie
9. Implement best practices for Hadoop development
10. Work on a real-life Project on Big Data Analytics
11. Understand Spark and its Ecosystem
12. Learn how to work in RDD in Spark
- - - - - - - - - - - - - -
Who should go for this course?
If you belong to any of the following groups, knowledge of Big Data and Hadoop is crucial for you if you want to progress in your career:
1. Analytics professionals
2. BI /ETL/DW professionals
3. Project managers
4. Testing professionals
5. Mainframe professionals
6. Software developers and architects
7. Recent graduates passionate about building a successful career in Big Data
- - - - - - - - - - - - - -
Why Learn Hadoop?
Big Data! A Worldwide Problem?
The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is - Hadoop has become an integral part for storing, handling, evaluating and retrieving hundreds of terabytes, and even petabytes of data.
- - - - - - - - - - - - - -
Opportunities for Hadoopers!
Opportunities for Hadoopers are infinite - from a Hadoop Developer, to a Hadoop Tester or a Hadoop Architect, and so on. If cracking and managing BIG Data is your passion in life, then think no more and Join Edureka's Hadoop Online course and carve a niche for yourself!
For more information, Please write back to us at sales@edureka.in or call us at IND: 9606058406 / US: 18338555775 (toll free).
Customer Review:
Michael Harkins, System Architect, Hortonworks says: “The courses are top rate. The best part is live instruction, with playback. But my favorite feature is viewing a previous class. Also, they are always there to answer questions, and prompt when you open an issue if you are having any trouble. Added bonus ~ you get lifetime access to the course you took!!! ~ This is the killer education app... I've take two courses, and I'm taking two more.”
detail
{'title': 'Hadoop Tutorial For Beginners | Apache Hadoop Tutorial | Hadoop Training | Edureka', 'heatmap': [{'end': 869.276, 'start': 806.217, 'weight': 0.732}, {'end': 1497.782, 'start': 1323.608, 'weight': 0.971}, {'end': 1728.583, 'start': 1669.119, 'weight': 0.758}, {'end': 2132.893, 'start': 1840.635, 'weight': 0.717}, {'end': 2596.707, 'start': 2531.56, 'weight': 0.75}], 'summary': 'This tutorial covers the fundamentals of hadoop and spark, their application in us primary election analysis, big data analysis strategies, data analytics applications, hadoop daemons, commands, yarn, hadoop and spark architectures, data analysis and clustering with apache zeppelin, and analyzing u.s. county data and cab/uber pickups.', 'chapters': [{'end': 254.199, 'segs': [{'end': 76.008, 'src': 'embed', 'start': 33.788, 'weight': 0, 'content': [{'end': 37.994, 'text': 'As I already told you, we have two big data use cases that we will study about.', 'start': 33.788, 'duration': 4.206}, {'end': 44.743, 'text': 'First is the U.S. primary election, and the second is the instant cabs startup, much like Uber cabs.', 'start': 38.354, 'duration': 6.389}, {'end': 53.848, 'text': 'We will start with the problem statements of both the use cases and then proceed ahead to learn the big data technologies and concepts in order to solve them.', 'start': 45.58, 'duration': 8.268}, {'end': 59.775, 'text': 'Since we are going to use Hadoop and Spark, we will start with a brief introduction to Hadoop and Spark.', 'start': 54.529, 'duration': 5.246}, {'end': 65.54, 'text': 'Next, we will understand the components of Hadoop in detail, which are HDFS and YARN.', 'start': 60.235, 'duration': 5.305}, {'end': 69.223, 'text': 'After understanding Hadoop, we will move on to Spark.', 'start': 66.341, 'duration': 2.882}, {'end': 72.385, 'text': 'We will be learning how Spark works and its different components.', 'start': 69.503, 'duration': 2.882}, {'end': 76.008, 'text': 'Then, we will go ahead to understand K-Means and Zeppelin.', 'start': 72.725, 'duration': 3.283}], 'summary': 'Studying 2 big data use cases: u.s. primary election and a startup similar to uber. learning hadoop, spark, k-means, and zeppelin.', 'duration': 42.22, 'max_score': 33.788, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o33788.jpg'}, {'end': 137.771, 'src': 'embed', 'start': 99.945, 'weight': 2, 'content': [{'end': 106.21, 'text': 'the contenders from each party compete against each other to represent his or her own political party in the final elections.', 'start': 99.945, 'duration': 6.265}, {'end': 110.734, 'text': 'There are two major political parties in the US, the Democrats and Republicans.', 'start': 106.67, 'duration': 4.064}, {'end': 114.819, 'text': 'From the Democrats, the contenders were Hillary Clinton and Bernie Sanders.', 'start': 111.314, 'duration': 3.505}, {'end': 117.884, 'text': 'And out of them, Hillary Clinton won the primary elections.', 'start': 115.18, 'duration': 2.704}, {'end': 122.811, 'text': 'And from the Republicans, the contenders were Donald Trump, Ted Cruz, and a few others.', 'start': 118.204, 'duration': 4.607}, {'end': 126.396, 'text': 'As you already know, Donald Trump was the winner from the Republicans.', 'start': 123.311, 'duration': 3.085}, {'end': 131.984, 'text': 'So now let us assume that you are an analyst already and you have been hired by Donald Trump,', 'start': 126.758, 'duration': 5.226}, {'end': 137.771, 'text': 'and he tells you that I want to know what were the different reasons because of which Hillary Clinton won.', 'start': 131.984, 'duration': 5.787}], 'summary': 'In the us primary elections, hillary clinton won against bernie sanders for the democrats, while donald trump emerged as the winner for the republicans.', 'duration': 37.826, 'max_score': 99.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o99945.jpg'}], 'start': 11.522, 'title': 'Hadoop, spark, and us primary election analysis', 'summary': "Covers the fundamentals of hadoop and spark, their application in us primary election analysis, and hands-on training of big data technologies. it also discusses the victory of hillary clinton and donald trump in their respective parties, and the task of a data analyst hired by trump to analyze reasons for clinton's win.", 'chapters': [{'end': 99.945, 'start': 11.522, 'title': 'Hadoop and spark fundamentals', 'summary': 'Covers the fundamentals of hadoop and spark and their application in two big data use cases - the u.s. primary election and instant cabs startup, along with hands-on training and learning of big data technologies and concepts.', 'duration': 88.423, 'highlights': ['The chapter covers the fundamentals of Hadoop and Spark, along with the application in two big data use cases - the U.S. primary election and instant cabs startup.', 'Hands-on training is provided for learning technology, including understanding the components of Hadoop (HDFS and YARN) and Spark, K-Means machine learning algorithm, and Zeppelin for data visualization.', 'Two big data use cases are discussed - U.S. primary election and instant cabs startup, much like Uber cabs.', 'The chapter also includes a brief introduction to Hadoop and Spark, and the use of Zeppelin to visualize data.']}, {'end': 254.199, 'start': 99.945, 'title': 'Us primary election analysis', 'summary': "Discusses the us primary elections, highlighting the victory of hillary clinton in the democratic party and donald trump in the republican party, and the task of a data analyst hired by donald trump to analyze the reasons for clinton's win in preparation for upcoming campaigns.", 'duration': 154.254, 'highlights': ['The victory of Hillary Clinton and Donald Trump in the Democratic and Republican primary elections respectively. Hillary Clinton won the primary elections for the Democrats, and Donald Trump emerged as the winner from the Republicans.', "The task assigned to a data analyst by Donald Trump to analyze the reasons for Hillary Clinton's win in preparation for upcoming campaigns. The data analyst is hired by Donald Trump to analyze the reasons behind Hillary Clinton's victory, with the objective of leveraging these insights for future campaigns.", 'The description of the U.S. primary election data set and the fields it contains, including state, county, FIPS code, party, candidate, and number of votes received. The U.S. primary election data set contains fields such as state, county, FIPS code, party, candidate, and the number of votes each candidate received.']}], 'duration': 242.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o11522.jpg', 'highlights': ['The chapter covers the fundamentals of Hadoop and Spark, along with the application in two big data use cases - the U.S. primary election and instant cabs startup.', 'Hands-on training is provided for learning technology, including understanding the components of Hadoop (HDFS and YARN) and Spark, K-Means machine learning algorithm, and Zeppelin for data visualization.', 'The victory of Hillary Clinton and Donald Trump in the Democratic and Republican primary elections respectively.', "The task assigned to a data analyst by Donald Trump to analyze the reasons for Hillary Clinton's win in preparation for upcoming campaigns."]}, {'end': 1077.253, 'segs': [{'end': 299.818, 'src': 'embed', 'start': 254.419, 'weight': 0, 'content': [{'end': 262.645, 'text': "You won't know what this exactly contains because it is written in a coded form, but let me give you an example what this dataset contains.", 'start': 254.419, 'duration': 8.226}, {'end': 267.089, 'text': "Let me tell you that I'm just showing you a few rows of the dataset.", 'start': 263.246, 'duration': 3.843}, {'end': 268.69, 'text': 'This is not the entire dataset.', 'start': 267.129, 'duration': 1.561}, {'end': 276.357, 'text': 'So this contains different fields, like population in 2014 and 2010,, the sex ratio how many females, males?', 'start': 269.15, 'duration': 7.207}, {'end': 279.36, 'text': 'and then based on some ethnicity, how many Asians?', 'start': 276.357, 'duration': 3.003}, {'end': 280.742, 'text': 'how many Hispanic?', 'start': 279.36, 'duration': 1.382}, {'end': 282.844, 'text': 'how many black American people?', 'start': 280.742, 'duration': 2.102}, {'end': 285.346, 'text': 'how many black African people?', 'start': 282.844, 'duration': 2.502}, {'end': 291.973, 'text': 'And then there is also based on the age groups how many infants, how many senior citizens, How many adults?', 'start': 285.486, 'duration': 6.487}, {'end': 299.818, 'text': 'So there are a lot of fields in our data set and this will help us to analyze and actually find out what led to the winning of Hillary Clinton.', 'start': 292.293, 'duration': 7.525}], 'summary': "The dataset contains demographic data including population, sex ratio, ethnicity, and age groups to analyze factors influencing hillary clinton's win.", 'duration': 45.399, 'max_score': 254.419, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o254419.jpg'}, {'end': 367.757, 'src': 'embed', 'start': 340.937, 'weight': 2, 'content': [{'end': 344.779, 'text': 'So the next task is to transform that data using Spark SQL.', 'start': 340.937, 'duration': 3.842}, {'end': 352.884, 'text': 'Transforming here means filtering out the data and the rows and columns that you might need in order to implement or in order to process this.', 'start': 345.16, 'duration': 7.724}, {'end': 356.466, 'text': 'The next step is clustering this data using Spark MLib.', 'start': 353.404, 'duration': 3.062}, {'end': 359.808, 'text': 'And for clustering our data, we will be using k-means.', 'start': 356.566, 'duration': 3.242}, {'end': 362.99, 'text': 'And the final step is to visualize the result using Zeppelin.', 'start': 359.968, 'duration': 3.022}, {'end': 367.757, 'text': 'Now. visualizing this data is also very important, because without the visualization,', 'start': 363.53, 'duration': 4.227}], 'summary': 'Using spark sql, filter and transform data, cluster with k-means, and visualize results with zeppelin.', 'duration': 26.82, 'max_score': 340.937, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o340937.jpg'}, {'end': 483.849, 'src': 'embed', 'start': 450.259, 'weight': 3, 'content': [{'end': 455.764, 'text': "So now we've got a line graph that compares the votes of Hillary Clinton and Bernie Sanders together.", 'start': 450.259, 'duration': 5.505}, {'end': 465.593, 'text': 'Again, we have got an area graph also that compares Bernie Sanders and Hillary Clinton votes, and hence we have a lot more visualization.', 'start': 456.465, 'duration': 9.128}, {'end': 469.417, 'text': 'We have got our bar charts and everything finally.', 'start': 466.574, 'duration': 2.843}, {'end': 474.141, 'text': 'We also have state and county-wise distribution of votes.', 'start': 469.717, 'duration': 4.424}, {'end': 483.849, 'text': 'So these The visualizations that will help you derive a conclusion to derive an answer, whatever answer that Donald Trump wants from you.', 'start': 474.441, 'duration': 9.408}], 'summary': 'Visualizations compare votes for hillary clinton and bernie sanders, including state and county-wise distribution.', 'duration': 33.59, 'max_score': 450.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o450259.jpg'}, {'end': 545.811, 'src': 'embed', 'start': 518.572, 'weight': 4, 'content': [{'end': 529.924, 'text': 'And what they want to do is they want to maximize their profit by finding out the beehive points where they can get a lot of pickups and getting their cabs there during peak hours.', 'start': 518.572, 'duration': 11.352}, {'end': 532.067, 'text': 'So this is your second task.', 'start': 530.365, 'duration': 1.702}, {'end': 535.248, 'text': 'So again, the first thing is that you needed a data set.', 'start': 532.327, 'duration': 2.921}, {'end': 545.811, 'text': 'So this is our Uber data set that has been given to you in order to analyze and find out what were the peak hours and how much cabs are expected in those locations during the peak hours.', 'start': 535.688, 'duration': 10.123}], 'summary': 'Maximize profit by identifying beehive points for cab pickups during peak hours using uber data set.', 'duration': 27.239, 'max_score': 518.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o518572.jpg'}, {'end': 869.276, 'src': 'heatmap', 'start': 806.217, 'weight': 0.732, 'content': [{'end': 810.98, 'text': "and it'll be stored in a distributed manner in commodity hardware for processing.", 'start': 806.217, 'duration': 4.763}, {'end': 817.203, 'text': "You've got YARN, which stands for Yet Another Resource Negotiator, And this is the processing unit of Hadoop,", 'start': 811.06, 'duration': 6.143}, {'end': 822.986, 'text': 'which allows parallel processing of the distributed data across your Hadoop cluster in HDFS.', 'start': 817.203, 'duration': 5.783}, {'end': 825.208, 'text': "Then we've got Spark.", 'start': 823.767, 'duration': 1.441}, {'end': 828.73, 'text': 'So Apache Spark is one of the most popular projects by Apache.', 'start': 825.348, 'duration': 3.382}, {'end': 834.193, 'text': 'And this is an open source cluster computing framework for real-time processing.', 'start': 829.03, 'duration': 5.163}, {'end': 838.035, 'text': 'Where on the other hand, Hadoop is used for batch processing.', 'start': 834.633, 'duration': 3.402}, {'end': 839.996, 'text': 'Spark is used for real-time processing.', 'start': 838.115, 'duration': 1.881}, {'end': 842.598, 'text': 'Because with Spark, the processing happens in memory.', 'start': 840.256, 'duration': 2.342}, {'end': 849.521, 'text': 'And it provides you with an interface for programming entire clusters with implicit data parallelism and fault tolerance.', 'start': 842.998, 'duration': 6.523}, {'end': 858.786, 'text': 'So what is data parallelism? Data parallelism is a form of parallelization across multiple processes in parallel computing environments.', 'start': 850.182, 'duration': 8.604}, {'end': 861.108, 'text': 'A lot of parallel words in that sentence.', 'start': 859.447, 'duration': 1.661}, {'end': 869.276, 'text': 'So let me tell you simply that it basically means distributing your data across nodes which operate on the data parallel,', 'start': 862.789, 'duration': 6.487}], 'summary': 'Hadoop enables batch processing, while spark allows real-time processing with in-memory operations and data parallelism.', 'duration': 63.059, 'max_score': 806.217, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o806217.jpg'}, {'end': 1036.41, 'src': 'embed', 'start': 995.06, 'weight': 5, 'content': [{'end': 1001.185, 'text': "It's got enhanced features like in memory processing, machine learning capabilities, and you can use it with Hadoop.", 'start': 995.06, 'duration': 6.125}, {'end': 1007.53, 'text': 'Hadoop uses commodity hardware, which can give you better processing with minimum cost.', 'start': 1001.445, 'duration': 6.085}, {'end': 1013.715, 'text': 'These are the benefits that you get when you combine Spark and Hadoop together in order to analyze big data.', 'start': 1008.451, 'duration': 5.264}, {'end': 1016.277, 'text': "Let's see some of the big data use cases.", 'start': 1014.176, 'duration': 2.101}, {'end': 1019.64, 'text': 'So the first big data use case is web e-tailing.', 'start': 1016.357, 'duration': 3.283}, {'end': 1022.362, 'text': 'The recommendation engines.', 'start': 1020.28, 'duration': 2.082}, {'end': 1027.146, 'text': 'whenever you go out on Amazon or any other online shopping site in order to buy something,', 'start': 1022.362, 'duration': 4.784}, {'end': 1032.328, 'text': 'you will see some recommended items popping below your screen or to the side of your screen.', 'start': 1027.506, 'duration': 4.822}, {'end': 1036.41, 'text': 'And that is all generated using big data analytics and ad targeting.', 'start': 1032.627, 'duration': 3.783}], 'summary': 'Spark and hadoop offer enhanced features, enabling better processing with minimum cost, and are used in big data analytics for web e-tailing and recommendation engines.', 'duration': 41.35, 'max_score': 995.06, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o995060.jpg'}], 'start': 254.419, 'title': 'Dataset fields and big data analysis strategy', 'summary': "Discusses dataset fields including population data from 2014 and 2010, sex ratio, ethnicity breakdown, and age groups for analyzing hillary clinton's victory. it also outlines a big data analysis strategy involving steps such as data processing, visualization, and learning components like hadoop, spark, k-means, and zeppelin for deriving insights for u.s. elections and instant cabs.", 'chapters': [{'end': 299.818, 'start': 254.419, 'title': 'Dataset fields for analyzing election results', 'summary': "Discusses a dataset containing population data from 2014 and 2010, sex ratio, ethnicity breakdown, and age groups, aiming to analyze factors contributing to hillary clinton's victory.", 'duration': 45.399, 'highlights': ['The dataset contains fields such as population in 2014 and 2010, sex ratio, ethnicity breakdown including Asians, Hispanics, black Americans, and black Africans, and age group distribution including infants, senior citizens, and adults, providing comprehensive data for analysis.', 'The dataset aims to analyze factors contributing to the winning of Hillary Clinton, indicating the relevance and significance of the data for political analysis.']}, {'end': 1077.253, 'start': 300.378, 'title': 'Big data analysis strategy', 'summary': 'Outlines the strategy for analyzing big data, including steps such as data processing, visualization, and learning components like hadoop, spark, k-means, and zeppelin, to derive insights for use cases such as u.s. elections and instant cabs.', 'duration': 776.875, 'highlights': ['Data Processing Strategy The first step involves understanding the dataset, storing it in HDFS, processing data using Spark components like Spark SQL and Spark MLib, transforming the data by filtering rows and columns, clustering the data using k-means, and visualizing the results using Zeppelin.', 'Use Case: U.S. Elections Visualization of the U.S. elections data includes analysis of ethnicities, comparing votes of Hillary Clinton and Bernie Sanders, creating area graphs, bar charts, and state and county-wise distribution of votes, all to derive insights and conclusions for the use case.', 'Use Case: Instant Cabs The second use case focuses on analyzing Uber data to identify peak hours and demand for cabs at specific locations, involving steps such as data transformation, clustering using k-means, and visualization to identify beehive points for maximizing profit.', 'Learning Components The learning components include an introduction to Hadoop and Spark, understanding the Hadoop ecosystem, using tools like Apache Spark for improved analysis, and the benefits of combining Spark and Hadoop for big data analysis.', 'Big Data Use Cases Various big data use cases are highlighted, including web e-tailing, recommendation engines, ad targeting, search quality abuse and click fraud detection, telecommunications, government applications, healthcare and life sciences, fraud detection, and cybersecurity.']}], 'duration': 822.834, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o254419.jpg', 'highlights': ['The dataset contains fields such as population in 2014 and 2010, sex ratio, ethnicity breakdown, and age group distribution, providing comprehensive data for analysis.', 'The dataset aims to analyze factors contributing to the winning of Hillary Clinton, indicating the relevance and significance of the data for political analysis.', 'The first step involves understanding the dataset, storing it in HDFS, processing data using Spark components like Spark SQL and Spark MLib, transforming the data by filtering rows and columns, clustering the data using k-means, and visualizing the results using Zeppelin.', 'Visualization of the U.S. elections data includes analysis of ethnicities, comparing votes of Hillary Clinton and Bernie Sanders, creating area graphs, bar charts, and state and county-wise distribution of votes, all to derive insights and conclusions for the use case.', 'The second use case focuses on analyzing Uber data to identify peak hours and demand for cabs at specific locations, involving steps such as data transformation, clustering using k-means, and visualization to identify beehive points for maximizing profit.', 'The learning components include an introduction to Hadoop and Spark, understanding the Hadoop ecosystem, using tools like Apache Spark for improved analysis, and the benefits of combining Spark and Hadoop for big data analysis.', 'Various big data use cases are highlighted, including web e-tailing, recommendation engines, ad targeting, search quality abuse and click fraud detection, telecommunications, government applications, healthcare and life sciences, fraud detection, and cybersecurity.']}, {'end': 1683.83, 'segs': [{'end': 1141.836, 'src': 'embed', 'start': 1077.253, 'weight': 0, 'content': [{'end': 1080.216, 'text': 'healthcare service quality improvements and drug safety.', 'start': 1077.253, 'duration': 2.963}, {'end': 1088.029, 'text': 'Now let me tell you that with big data analytics, it has been very easy in order to diagnose a particular disease and find out the cure also.', 'start': 1080.66, 'duration': 7.369}, {'end': 1090.892, 'text': 'So these are some more big data use cases.', 'start': 1088.489, 'duration': 2.403}, {'end': 1099.823, 'text': 'It is also used in banks and financial services for modeling true risk, fraud detection, credit card scoring, analysis, and many more.', 'start': 1091.413, 'duration': 8.41}, {'end': 1105.008, 'text': 'It can be used in retail, transportation services, hotels, and food delivery services.', 'start': 1100.303, 'duration': 4.705}, {'end': 1111.375, 'text': 'Actually every field you name, no matter whatever business you have, if you are able to use Big Data efficiently,', 'start': 1105.689, 'duration': 5.686}, {'end': 1118.363, 'text': 'your company will grow and you will be gaining different insights by using Big Data analytics and hence improve your business even more.', 'start': 1111.375, 'duration': 6.988}, {'end': 1126.248, 'text': "Nowadays everyone is using big data and you've seen different fields and everything is different from each other,", 'start': 1119.264, 'duration': 6.984}, {'end': 1133.091, 'text': 'but everyone is using big data analytics and big data analysis can be done with tools like Hadoop and Spark, etc.', 'start': 1126.248, 'duration': 6.843}, {'end': 1141.836, 'text': 'So this is why big data analytics is very much in demand today and why it is very important for you to learn how to perform big data analytics with tools like this.', 'start': 1134.012, 'duration': 7.824}], 'summary': 'Big data analytics enables improvements in healthcare, finance, retail, and more, leading to business growth and valuable insights.', 'duration': 64.583, 'max_score': 1077.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1077253.jpg'}, {'end': 1223.261, 'src': 'embed', 'start': 1194.541, 'weight': 3, 'content': [{'end': 1201.965, 'text': 'So HDFS stands for Hadoop Distributed File System, and this is the storage unit for Hadoop, and here is the architecture of HDFS.', 'start': 1194.541, 'duration': 7.424}, {'end': 1207.789, 'text': 'Since I already told you that it is a master-slave architecture.', 'start': 1204.387, 'duration': 3.402}, {'end': 1213.493, 'text': 'so the master node is known as name node and slave nodes are known as data nodes.', 'start': 1207.789, 'duration': 5.704}, {'end': 1217.096, 'text': "and then we've got another node here which is known as secondary name node.", 'start': 1213.493, 'duration': 3.603}, {'end': 1223.261, 'text': "Now don't get confused that secondary name node is just going to be a replacement for name node, because it is not.", 'start': 1217.436, 'duration': 5.825}], 'summary': 'Hdfs is the storage unit for hadoop, with master-slave architecture, including name nodes, data nodes, and secondary name node.', 'duration': 28.72, 'max_score': 1194.541, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1194541.jpg'}, {'end': 1497.782, 'src': 'heatmap', 'start': 1323.608, 'weight': 0.971, 'content': [{'end': 1328.13, 'text': 'It just tells the name node what the data node is alive and functioning properly.', 'start': 1323.608, 'duration': 4.522}, {'end': 1330.511, 'text': 'Now comes the secondary name node.', 'start': 1328.37, 'duration': 2.141}, {'end': 1335.353, 'text': 'The secondary name node does a very important task, and that task is known as checkpointing.', 'start': 1330.831, 'duration': 4.522}, {'end': 1340.235, 'text': 'Checkpointing is the process of combining edit logs with FSImage.', 'start': 1335.793, 'duration': 4.442}, {'end': 1344.599, 'text': 'Now let me tell you what an edit log is and what is an FS image.', 'start': 1340.595, 'duration': 4.004}, {'end': 1353.846, 'text': "Let's say that I have set up my Hadoop cluster 20 days back and whatever transactions that happen with every new data blocks are stored in my HDFS.", 'start': 1344.799, 'duration': 9.047}, {'end': 1359.932, 'text': 'Whatever data blocks are deleted, every transaction is combined in a file known as an FS image.', 'start': 1354.067, 'duration': 5.865}, {'end': 1362.454, 'text': 'And the FS image resides in your disk.', 'start': 1360.172, 'duration': 2.282}, {'end': 1365.616, 'text': "And there's one more similar file which is known as an edit log.", 'start': 1362.474, 'duration': 3.142}, {'end': 1371.761, 'text': 'Now, edit logs will not keep the record of transactions 20 days back, but just a few hours back.', 'start': 1366.357, 'duration': 5.404}, {'end': 1377.584, 'text': "Now, let's say it will keep the record and the transaction details that happened in the past four hours,", 'start': 1372.001, 'duration': 5.583}, {'end': 1381.887, 'text': 'and checkpointing is the task of combining the edit log with an FS image.', 'start': 1377.584, 'duration': 4.303}, {'end': 1386.43, 'text': 'It allows faster failover as we have a backup of the metadata.', 'start': 1382.147, 'duration': 4.283}, {'end': 1393.002, 'text': "So a situation where a name node goes down and the entire metadata is lost, we don't have to worry.", 'start': 1387.777, 'duration': 5.225}, {'end': 1399.769, 'text': 'We can set up a new name node and get the same transactional files and the metadata from the secondary name node,', 'start': 1393.443, 'duration': 6.326}, {'end': 1406.095, 'text': 'because it has been keeping an updated copy and checkpointing happens after every hour, but you can also configure it.', 'start': 1399.769, 'duration': 6.326}, {'end': 1410.119, 'text': "So let's understand the process of checkpointing in detail.", 'start': 1406.816, 'duration': 3.303}, {'end': 1413.082, 'text': "Here's the FS image and the edit log.", 'start': 1410.92, 'duration': 2.162}, {'end': 1417.747, 'text': 'So the FS image in the disk and the edit log resides in your RAM.', 'start': 1413.803, 'duration': 3.944}, {'end': 1427.156, 'text': 'What the secondary name node does is that it first copies the FS image and the edit log and adds them together in order to get the updated FS image,', 'start': 1418.147, 'duration': 9.009}, {'end': 1430.219, 'text': 'and then this FS image is copied back to the name node.', 'start': 1427.156, 'duration': 3.063}, {'end': 1437.306, 'text': 'Now the name node has an updated FS image and in the meantime a new edit log is created when the checkpointing is happening.', 'start': 1430.539, 'duration': 6.767}, {'end': 1445.454, 'text': 'So this process keeps going on and hence it helps the name node in order to keep an updated copy of the FS image of the transactions every hour.', 'start': 1437.426, 'duration': 8.028}, {'end': 1449.079, 'text': "Now let's talk about the data nodes.", 'start': 1446.997, 'duration': 2.082}, {'end': 1454.425, 'text': 'These are the slave daemons, and this is where your actual data is stored.', 'start': 1449.96, 'duration': 4.465}, {'end': 1461.492, 'text': 'And whenever a client gives a read or write request, the data node serves it because the data is actually stored in the data nodes.', 'start': 1454.545, 'duration': 6.947}, {'end': 1465.316, 'text': 'So this is all about the components in HDFS.', 'start': 1462.373, 'duration': 2.943}, {'end': 1469.119, 'text': "Now let's understand the entire HDFS architecture in detail.", 'start': 1465.456, 'duration': 3.663}, {'end': 1473.583, 'text': "So we've got different data nodes here and we can set up different data nodes in racks.", 'start': 1469.68, 'duration': 3.903}, {'end': 1478.728, 'text': "In rack one, we've got three different data nodes and in rack two, we've got two different data nodes.", 'start': 1473.883, 'duration': 4.845}, {'end': 1486.775, 'text': 'And each of the data nodes contains different data block because in data nodes, the data is stored in blocks.', 'start': 1479.508, 'duration': 7.267}, {'end': 1489.037, 'text': "And so we'll learn about that in the coming slides.", 'start': 1486.935, 'duration': 2.102}, {'end': 1492.399, 'text': 'So the client can request either a read or write.', 'start': 1489.597, 'duration': 2.802}, {'end': 1495.781, 'text': "And let's say that the client requests to read a particular file.", 'start': 1492.619, 'duration': 3.162}, {'end': 1497.782, 'text': 'It will first go to the name node.', 'start': 1495.941, 'duration': 1.841}], 'summary': 'Secondary name node performs checkpointing every hour, ensuring metadata backup for faster failover in hadoop cluster.', 'duration': 174.174, 'max_score': 1323.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1323608.jpg'}, {'end': 1359.932, 'src': 'embed', 'start': 1330.831, 'weight': 4, 'content': [{'end': 1335.353, 'text': 'The secondary name node does a very important task, and that task is known as checkpointing.', 'start': 1330.831, 'duration': 4.522}, {'end': 1340.235, 'text': 'Checkpointing is the process of combining edit logs with FSImage.', 'start': 1335.793, 'duration': 4.442}, {'end': 1344.599, 'text': 'Now let me tell you what an edit log is and what is an FS image.', 'start': 1340.595, 'duration': 4.004}, {'end': 1353.846, 'text': "Let's say that I have set up my Hadoop cluster 20 days back and whatever transactions that happen with every new data blocks are stored in my HDFS.", 'start': 1344.799, 'duration': 9.047}, {'end': 1359.932, 'text': 'Whatever data blocks are deleted, every transaction is combined in a file known as an FS image.', 'start': 1354.067, 'duration': 5.865}], 'summary': 'Secondary name node performs checkpointing by combining edit logs with fsimage in hadoop cluster, storing transactions and deleted data blocks.', 'duration': 29.101, 'max_score': 1330.831, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1330831.jpg'}, {'end': 1445.454, 'src': 'embed', 'start': 1418.147, 'weight': 5, 'content': [{'end': 1427.156, 'text': 'What the secondary name node does is that it first copies the FS image and the edit log and adds them together in order to get the updated FS image,', 'start': 1418.147, 'duration': 9.009}, {'end': 1430.219, 'text': 'and then this FS image is copied back to the name node.', 'start': 1427.156, 'duration': 3.063}, {'end': 1437.306, 'text': 'Now the name node has an updated FS image and in the meantime a new edit log is created when the checkpointing is happening.', 'start': 1430.539, 'duration': 6.767}, {'end': 1445.454, 'text': 'So this process keeps going on and hence it helps the name node in order to keep an updated copy of the FS image of the transactions every hour.', 'start': 1437.426, 'duration': 8.028}], 'summary': 'Secondary name node copies and updates fs image and edit log, ensuring an updated copy of transactions every hour.', 'duration': 27.307, 'max_score': 1418.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1418147.jpg'}, {'end': 1613.021, 'src': 'embed', 'start': 1590.261, 'weight': 8, 'content': [{'end': 1600.348, 'text': "How many blocks will it create? All right, so AJ says it's four, and Rohit says it's four, and of course you guys are right, it's four blocks.", 'start': 1590.261, 'duration': 10.087}, {'end': 1607.873, 'text': 'The first three blocks will be 128 megabytes, and the last block will just occupy the remaining file size, which is 116 megabytes.', 'start': 1600.588, 'duration': 7.285}, {'end': 1613.021, 'text': "So now let's discuss block replication.", 'start': 1610.799, 'duration': 2.222}], 'summary': 'Four blocks created, with 3 blocks at 128mb and 1 at 116mb.', 'duration': 22.76, 'max_score': 1590.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1590261.jpg'}, {'end': 1683.83, 'src': 'embed', 'start': 1639.181, 'weight': 6, 'content': [{'end': 1645.263, 'text': 'So my block one is there three times and block two is also in there three times in three different data nodes.', 'start': 1639.181, 'duration': 6.082}, {'end': 1654.425, 'text': 'So we use this replication factor so that if any of the data nodes goes down, we can retrieve the data block back from the two different data nodes.', 'start': 1645.583, 'duration': 8.842}, {'end': 1660.21, 'text': 'So this is how data blocks are replicated in HDFS.', 'start': 1655.245, 'duration': 4.965}, {'end': 1668.699, 'text': "Now in order to do the replication properly, there's an algorithm which is known as Rack Awareness and it provides us optimized fault tolerance.", 'start': 1660.45, 'duration': 8.249}, {'end': 1678.088, 'text': 'Rack Awareness algorithm says that the first replica of a block will be stored in a local rack and the next two replicas will be there in a different rack.', 'start': 1669.119, 'duration': 8.969}, {'end': 1683.83, 'text': 'So we store a data block in rack 1 so that our latency is decreased.', 'start': 1678.869, 'duration': 4.961}], 'summary': 'Data blocks are replicated three times across three different data nodes using a replication factor, with rack awareness algorithm optimizing fault tolerance by storing the first replica in a local rack and the next two replicas in a different rack.', 'duration': 44.649, 'max_score': 1639.181, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1639181.jpg'}], 'start': 1077.253, 'title': 'Big data analytics applications', 'summary': "Highlights the wide applications of big data analytics across industries, emphasizing the importance of efficient big data usage for business growth, and provides a detailed explanation of hadoop's hdfs architecture and its components. it also explains the significance of hdfs checkpointing, interactions with data nodes, and details the block replication process with a focus on default block size and replication factor.", 'chapters': [{'end': 1330.511, 'start': 1077.253, 'title': 'Big data analytics in various industries', 'summary': "Highlights the wide applications of big data analytics across industries like healthcare, finance, retail, transportation, hotels, and food delivery, emphasizing the importance of efficient big data usage for business growth and insights, and provides a detailed explanation of hadoop's hdfs architecture and the functions of its components.", 'duration': 253.258, 'highlights': ['Big data analytics is used in healthcare for diagnosing diseases and finding cures, and in banks and financial services for risk modeling, fraud detection, and credit card scoring. Key applications of big data analytics in healthcare and financial services showcase its impact on disease diagnosis, cure discovery, risk modeling, fraud detection, and credit card scoring.', 'Efficient use of big data analytics in any business field leads to company growth and valuable insights. Emphasizes the importance of efficient big data usage in all business fields for company growth and valuable insights.', 'Big data analysis can be performed using tools like Hadoop and Spark, with a focus on the high demand and importance of learning these tools. Highlights the high demand and importance of learning big data analysis tools like Hadoop and Spark for performing big data analysis.', "Detailed explanation of Hadoop's HDFS architecture and the functions of its components, including the name node, data nodes, and secondary name node. Provides a detailed explanation of Hadoop's HDFS architecture and the functions of its components, such as the name node, data nodes, and secondary name node."]}, {'end': 1683.83, 'start': 1330.831, 'title': 'Hdfs checkpointing and block replication', 'summary': 'Explains the significance of hdfs checkpointing in updating fs image, illustrates how hdfs architecture interacts with data nodes, and details the block replication process with a focus on the default block size and replication factor.', 'duration': 352.999, 'highlights': ['HDFS checkpointing is crucial for combining edit logs with FSImage, ensuring faster failover by providing a backup of metadata, and allowing the setup of a new name node with updated transactional files and metadata from the secondary name node. Checkpointing happens after every hour.', 'The process of checkpointing involves the secondary name node copying the FS image and edit log, combining them, and then copying the updated FS image back to the name node, ensuring an updated copy of FS image every hour. The secondary name node keeps an updated copy of FS image of the transactions every hour.', 'HDFS replicates each block two times, resulting in a replication factor of three to ensure fault tolerance and data retrieval in case of data node failures. Each block is copied two times.', 'Rack Awareness algorithm dictates the storage of the first block replica in a local rack and the next two replicas in different racks, optimizing fault tolerance and decreasing latency. The algorithm ensures the first replica of a block is stored in a local rack.', 'Files in HDFS are broken down into blocks, with a default block size of 128 MB, and each block is replicated two times to ensure fault tolerance. Default block size of each block is 128 MB.']}], 'duration': 606.577, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1077253.jpg', 'highlights': ['Key applications of big data analytics in healthcare and financial services showcase its impact on disease diagnosis, cure discovery, risk modeling, fraud detection, and credit card scoring.', 'Emphasizes the importance of efficient big data usage in all business fields for company growth and valuable insights.', 'Highlights the high demand and importance of learning big data analysis tools like Hadoop and Spark for performing big data analysis.', "Provides a detailed explanation of Hadoop's HDFS architecture and the functions of its components, such as the name node, data nodes, and secondary name node.", 'HDFS checkpointing is crucial for combining edit logs with FSImage, ensuring faster failover by providing a backup of metadata, and allowing the setup of a new name node with updated transactional files and metadata from the secondary name node.', 'The process of checkpointing involves the secondary name node copying the FS image and edit log, combining them, and then copying the updated FS image back to the name node, ensuring an updated copy of FS image every hour.', 'HDFS replicates each block two times, resulting in a replication factor of three to ensure fault tolerance and data retrieval in case of data node failures.', 'Rack Awareness algorithm dictates the storage of the first block replica in a local rack and the next two replicas in different racks, optimizing fault tolerance and decreasing latency.', 'Default block size of each block is 128 MB.']}, {'end': 2128.111, 'segs': [{'end': 1713.458, 'src': 'embed', 'start': 1684.17, 'weight': 0, 'content': [{'end': 1687.811, 'text': 'Now these are the commands that you will use to start your Hadoop daemons.', 'start': 1684.17, 'duration': 3.641}, {'end': 1696.013, 'text': 'Your Hadoop daemons like your name node, your secondary name node, and your data nodes in the slave machine.', 'start': 1688.371, 'duration': 7.642}, {'end': 1700.234, 'text': 'So in order to start all the Hadoop daemons in HDFS and YARN.', 'start': 1696.593, 'duration': 3.641}, {'end': 1705.335, 'text': 'I have not explained YARN yet, but YARN is the processing unit of Hadoop, so it will start all the YARN daemons,', 'start': 1700.234, 'duration': 5.101}, {'end': 1707.956, 'text': 'like the resource manager and the node manager also.', 'start': 1705.335, 'duration': 2.621}, {'end': 1713.458, 'text': 'Then this is the command to stop all the Hadoop daemons with JPS.', 'start': 1708.516, 'duration': 4.942}], 'summary': 'Commands to start and stop hadoop daemons for hdfs and yarn explained.', 'duration': 29.288, 'max_score': 1684.17, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1684170.jpg'}, {'end': 1817.049, 'src': 'embed', 'start': 1773.731, 'weight': 2, 'content': [{'end': 1776.092, 'text': 'This is the terminal of my slave machine.', 'start': 1773.731, 'duration': 2.361}, {'end': 1782.236, 'text': "I'm just going to run JPS here and these are the processes or the daemons that are running in my slave machine.", 'start': 1776.653, 'duration': 5.583}, {'end': 1787.64, 'text': 'So node manager and data node both are slave daemons and they are running in my slave machine.', 'start': 1782.276, 'duration': 5.364}, {'end': 1794.642, 'text': 'If you want to stop all the daemons, you can run the same command, only instead of start, you can just put a stop here.', 'start': 1788.3, 'duration': 6.342}, {'end': 1800.444, 'text': "So since I'm going to use my HDFS, I'm not going to stop it and show you, but the process is the same.", 'start': 1795.362, 'duration': 5.082}, {'end': 1803.825, 'text': 'So these are a few commands that you can use to write or delete a file in Hadoop.', 'start': 1800.504, 'duration': 3.321}, {'end': 1810.647, 'text': 'If you want to copy a file from your local file system to your HDFS, use this command, hadoopfs-put.', 'start': 1803.965, 'duration': 6.682}, {'end': 1812.288, 'text': 'This is the name of your file.', 'start': 1811.148, 'duration': 1.14}, {'end': 1817.049, 'text': 'So you have to type the proper path of the file so that you can copy to HDFS.', 'start': 1812.688, 'duration': 4.361}], 'summary': 'Slave machine runs node manager and data node daemons. commands to manage hdfs files are demonstrated.', 'duration': 43.318, 'max_score': 1773.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1773731.jpg'}, {'end': 1903.242, 'src': 'embed', 'start': 1860.083, 'weight': 1, 'content': [{'end': 1868.666, 'text': 'So when Hadoop came up with its new version, Hadoop 2.0,, it introduced YARN as the new framework and it stands for Yet Another Resource Negotiator,', 'start': 1860.083, 'duration': 8.583}, {'end': 1872.549, 'text': 'and it provides the ability to run non-map-reduced applications.', 'start': 1868.666, 'duration': 3.883}, {'end': 1878.575, 'text': 'and because of YARN we are able to integrate with different tools like Apache, Spark, Hive Pig, etc.', 'start': 1872.549, 'duration': 6.026}, {'end': 1881.899, 'text': ', and it provides us with a paradigm for parallel processing over Hadoop.', 'start': 1878.575, 'duration': 3.324}, {'end': 1886.669, 'text': "Now, when you're dumping all your data into HDFS,", 'start': 1883.226, 'duration': 3.443}, {'end': 1892.233, 'text': 'it is getting distributed and all this distributed data we are processing in parallel and it is done with the help of YARN.', 'start': 1886.669, 'duration': 5.564}, {'end': 1898.338, 'text': 'And you can see over here that the architecture of YARN, so that it is, again, a master-slave topology.', 'start': 1892.954, 'duration': 5.384}, {'end': 1903.242, 'text': 'So the master daemon here is known as resource manager, and slave daemons are known as node manager.', 'start': 1898.538, 'duration': 4.704}], 'summary': 'Hadoop 2.0 introduced yarn, enabling parallel processing and integration with tools like apache, spark, and hive, with a master-slave topology.', 'duration': 43.159, 'max_score': 1860.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1860083.jpg'}, {'end': 1958.653, 'src': 'embed', 'start': 1926.813, 'weight': 5, 'content': [{'end': 1932.516, 'text': 'the resource manager takes that request and passes the request to the corresponding node managers.', 'start': 1926.813, 'duration': 5.703}, {'end': 1935.238, 'text': "Now let's see what is a node manager.", 'start': 1932.896, 'duration': 2.342}, {'end': 1943.242, 'text': 'Node managers are slave daemons and they are installed on every data node so that you know that our data is divided up into blocks and are stored in the data nodes,', 'start': 1935.418, 'duration': 7.824}, {'end': 1945.183, 'text': 'and they are processed in the same machine.', 'start': 1943.242, 'duration': 1.941}, {'end': 1949.326, 'text': 'So, in the same machine where the data node is set,', 'start': 1946.224, 'duration': 3.102}, {'end': 1958.653, 'text': 'a node manager is also present to process all the data present in that data node and it is responsible for the execution of the task on every single data node.', 'start': 1949.326, 'duration': 9.327}], 'summary': 'Node managers process data blocks on each data node.', 'duration': 31.84, 'max_score': 1926.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1926813.jpg'}, {'end': 2006.283, 'src': 'embed', 'start': 1983.183, 'weight': 6, 'content': [{'end': 1990.509, 'text': 'Now an AppMaster is launched for every specific application code or every job or every processing task that the client comes up with.', 'start': 1983.183, 'duration': 7.326}, {'end': 1998.716, 'text': 'So the application master of the AppMaster is responsible to handle and take care of all the resources that is required in order to execute that code.', 'start': 1990.969, 'duration': 7.747}, {'end': 2006.283, 'text': 'So if there is any requirement for any resource, it is the app master who asks for the resources from the resource manager.', 'start': 1999.096, 'duration': 7.187}], 'summary': 'An appmaster is launched for every application code, responsible for handling resources and making resource requests from the resource manager.', 'duration': 23.1, 'max_score': 1983.183, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1983183.jpg'}, {'end': 2095.135, 'src': 'embed', 'start': 2064.315, 'weight': 7, 'content': [{'end': 2069.618, 'text': 'After that, when all the resources are provided, the node manager launches the container and starts the container.', 'start': 2064.315, 'duration': 5.303}, {'end': 2071.539, 'text': 'This is where the job executes.', 'start': 2070.058, 'duration': 1.481}, {'end': 2076.179, 'text': "Now let's take a look at the entire YARN application workflow step by step.", 'start': 2072.495, 'duration': 3.684}, {'end': 2080.683, 'text': 'So the first step is the client submits an application to the resource manager.', 'start': 2076.899, 'duration': 3.784}, {'end': 2084.426, 'text': 'Then the resource manager allocates the container to start the app master.', 'start': 2080.763, 'duration': 3.663}, {'end': 2095.135, 'text': 'Then the app master registers with the resource manager and tells the resource manager that an app master has been created and it is ready to oversee the execution of the code.', 'start': 2084.786, 'duration': 10.349}], 'summary': 'Yarn application workflow: client submits, resource manager allocates, app master oversees.', 'duration': 30.82, 'max_score': 2064.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2064315.jpg'}], 'start': 1684.17, 'title': 'Hadoop daemons, commands, and yarn', 'summary': "Covers starting/stopping hadoop daemons, using jps to check running daemons, and commands to manage files in hdfs, while introducing yarn's architecture, components, workflow, enabling parallel processing, and integration with apache, spark, hive pig, etc.", 'chapters': [{'end': 1840.135, 'start': 1684.17, 'title': 'Hadoop daemons and commands', 'summary': 'Covers starting and stopping hadoop daemons, using jps to check running daemons, and commands to copy, list, and remove files in hdfs.', 'duration': 155.965, 'highlights': ['Hadoop daemons like name node, secondary name node, and data nodes can be started using specific commands. The commands to start Hadoop daemons include name node, secondary name node, and data nodes.', 'YARN, the processing unit of Hadoop, can start resource manager and node manager daemons. YARN is the processing unit of Hadoop and can start resource manager and node manager daemons.', 'Using JPS command helps in checking the running daemons in the master and slave machines. The JPS command can be used to check the running daemons in both master and slave machines.', 'Commands like hadoopfs-put, hadoopfs-ls, and hadoopfs-rm can be used for file operations in HDFS. Specific commands like hadoopfs-put, hadoopfs-ls, and hadoopfs-rm are used for file operations in HDFS.']}, {'end': 2128.111, 'start': 1840.635, 'title': 'Understanding yarn in hadoop', 'summary': 'Introduces yarn as the processing unit of hadoop, explaining its architecture, components, and workflow, enabling parallel processing and integration with tools like apache, spark, hive pig, etc.', 'duration': 287.476, 'highlights': ['YARN introduced as the processing unit of Hadoop, providing parallel processing and integration with various tools like Apache, Spark, Hive Pig, etc. YARN serves as the processing unit of Hadoop, enabling parallel processing and integration with tools like Apache, Spark, Hive Pig, etc.', 'Resource Manager and Node Manager explained, with the former receiving and managing processing requests, and the latter responsible for executing tasks on data nodes. Resource Manager receives and manages processing requests, while Node Manager executes tasks on data nodes.', "AppMaster's role in handling resources for executing specific application codes and tasks, working in coordination with the Resource Manager and Node Managers. AppMaster handles resources for executing specific application codes and tasks, coordinating with Resource Manager and Node Managers.", "Detailed workflow of YARN application, including client's submission, resource allocation, app master's registration, container launch, code execution, and status verification. The detailed workflow of YARN application encompasses client's submission, resource allocation, app master's registration, container launch, code execution, and status verification."]}], 'duration': 443.941, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o1684170.jpg', 'highlights': ['Commands to start Hadoop daemons include name node, secondary name node, and data nodes.', 'YARN is the processing unit of Hadoop and can start resource manager and node manager daemons.', 'The JPS command can be used to check the running daemons in both master and slave machines.', 'Specific commands like hadoopfs-put, hadoopfs-ls, and hadoopfs-rm are used for file operations in HDFS.', 'YARN serves as the processing unit of Hadoop, enabling parallel processing and integration with tools like Apache, Spark, Hive Pig, etc.', 'Resource Manager receives and manages processing requests, while Node Manager executes tasks on data nodes.', 'AppMaster handles resources for executing specific application codes and tasks, coordinating with Resource Manager and Node Managers.', "The detailed workflow of YARN application encompasses client's submission, resource allocation, app master's registration, container launch, code execution, and status verification."]}, {'end': 2910.458, 'segs': [{'end': 2214.65, 'src': 'embed', 'start': 2187.791, 'weight': 2, 'content': [{'end': 2192.995, 'text': "So don't get confused that these will not also be in the chain machine, which is not the case.", 'start': 2187.791, 'duration': 5.204}, {'end': 2197.579, 'text': 'Now let me tell you about the Hadoop cluster hardware specification,', 'start': 2193.396, 'duration': 4.183}, {'end': 2202.964, 'text': 'some of the hardware specifics that you should keep in mind if you want to set up a Hadoop cluster.', 'start': 2197.579, 'duration': 5.385}, {'end': 2209.307, 'text': 'So for the name node you need a ram with 64 gigs and your hard disk should be a minimum of one terabyte.', 'start': 2203.544, 'duration': 5.763}, {'end': 2214.65, 'text': 'The processor should be a xenon 8 core and the ethernet should be 3 by 10 gigabytes.', 'start': 2209.367, 'duration': 5.283}], 'summary': 'Hadoop cluster hardware: name node requires 64gb ram, 1tb hdd, xenon 8-core cpu, 3x10gb ethernet.', 'duration': 26.859, 'max_score': 2187.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2187791.jpg'}, {'end': 2309.956, 'src': 'embed', 'start': 2281.868, 'weight': 0, 'content': [{'end': 2284.529, 'text': 'So these are the hardware specifications required to do that.', 'start': 2281.868, 'duration': 2.661}, {'end': 2288.271, 'text': 'Now let me tell you about some real Hadoop cluster deployment.', 'start': 2285.229, 'duration': 3.042}, {'end': 2291.872, 'text': 'Let us consider our favorite example which is Facebook.', 'start': 2289.091, 'duration': 2.781}, {'end': 2303.774, 'text': 'So Facebook has got 21 petabytes of storage in a single HDFS cluster, and 21 petabytes is equal to 10 raised to the power of 15 bytes,', 'start': 2292.792, 'duration': 10.982}, {'end': 2309.956, 'text': "and they've got 2000 machines per cluster and 32 gig of RAM per machine.", 'start': 2303.774, 'duration': 6.182}], 'summary': "Facebook's hadoop cluster has 21 petabytes of storage, 2000 machines per cluster, and 32gb ram per machine.", 'duration': 28.088, 'max_score': 2281.868, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2281868.jpg'}, {'end': 2405.543, 'src': 'embed', 'start': 2374.794, 'weight': 1, 'content': [{'end': 2382.979, 'text': 'And Spotify has 70 terabytes of RAM and they run more than 25, 000 daily Hadoop jobs.', 'start': 2374.794, 'duration': 8.185}, {'end': 2386.301, 'text': "It's got 43, 000 virtualized cores.", 'start': 2383.499, 'duration': 2.802}, {'end': 2396.872, 'text': "So it's an even larger cluster than Facebook, right? So these were the two use cases who use Hadoop clusters in order to process and store big data.", 'start': 2386.982, 'duration': 9.89}, {'end': 2405.543, 'text': 'Now that you have learned all about Hadoop, the HDFS and YARN, both the storage and the processing components of Hadoop.', 'start': 2397.793, 'duration': 7.75}], 'summary': "Spotify has 70tb ram, 25,000 daily hadoop jobs, 43,000 virtualized cores, larger than facebook's cluster.", 'duration': 30.749, 'max_score': 2374.794, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2374794.jpg'}, {'end': 2457.083, 'src': 'embed', 'start': 2430.254, 'weight': 3, 'content': [{'end': 2434.696, 'text': 'So Apache Spark has got the following components.', 'start': 2430.254, 'duration': 4.442}, {'end': 2436.796, 'text': "It's got the Spark core engine.", 'start': 2434.716, 'duration': 2.08}, {'end': 2440.537, 'text': 'Now the core engine is for the entire Spark frameworks.', 'start': 2437.336, 'duration': 3.201}, {'end': 2444.819, 'text': 'Every component is based on and it is placed in the core engine.', 'start': 2441.058, 'duration': 3.761}, {'end': 2447.82, 'text': "So at first we've got Spark SQL.", 'start': 2445.239, 'duration': 2.581}, {'end': 2455.102, 'text': 'So Spark SQL is a Spark module for structured data processing and you can run a modified Hive queries on existing Hadoop deployments.', 'start': 2448.46, 'duration': 6.642}, {'end': 2457.083, 'text': "And then we've got Spark streaming.", 'start': 2455.602, 'duration': 1.481}], 'summary': 'Apache spark has spark core engine, spark sql for structured data processing, and spark streaming.', 'duration': 26.829, 'max_score': 2430.254, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2430254.jpg'}, {'end': 2599.789, 'src': 'heatmap', 'start': 2522.175, 'weight': 4, 'content': [{'end': 2525.639, 'text': 'You can write all your code in the R shell and Spark will process it for you.', 'start': 2522.175, 'duration': 3.464}, {'end': 2530.84, 'text': "Now let's take a deeper look at a realistic people and all these important components.", 'start': 2525.819, 'duration': 5.021}, {'end': 2539.182, 'text': "So we've got Spark Core and Spark Core is the basic engine for large-scale parallel and distributed data processing.", 'start': 2531.56, 'duration': 7.622}, {'end': 2542.522, 'text': 'The core is the distributed execution engine,', 'start': 2539.722, 'duration': 2.8}, {'end': 2556.826, 'text': 'and Java Scala and Python APIs offer a platform for distributed EDL development and further additional libraries which are built on top of the core allow for diverse streaming SQL and machine learning.', 'start': 2542.522, 'duration': 14.304}, {'end': 2565.05, 'text': "It's also responsible for scheduling, distributing and monitoring jobs in a cluster and also interacting with storage systems.", 'start': 2557.447, 'duration': 7.603}, {'end': 2567.712, 'text': "Let's take a look at the Spark architecture.", 'start': 2565.511, 'duration': 2.201}, {'end': 2577.517, 'text': 'So Apache Spark has a well-defined and layered architecture where all the Spark components and layers are loosely coupled and integrated with various extensions and libraries.', 'start': 2567.952, 'duration': 9.565}, {'end': 2580.238, 'text': "First let's talk about the driver program.", 'start': 2578.177, 'duration': 2.061}, {'end': 2585.681, 'text': 'This is the Spark driver which contains the driver program and Spark context.', 'start': 2581.039, 'duration': 4.642}, {'end': 2590.284, 'text': 'This is the central point and entry point of the Spark shell,', 'start': 2586.622, 'duration': 3.662}, {'end': 2596.707, 'text': 'and the driver program runs the main function of the application and this is the place where Spark context is created.', 'start': 2590.284, 'duration': 6.423}, {'end': 2599.789, 'text': 'well, what is spark context?', 'start': 2598.148, 'duration': 1.641}], 'summary': 'Spark core is the engine for parallel processing, with support for java, scala, and python apis. it also handles job scheduling and interacts with storage systems.', 'duration': 77.614, 'max_score': 2522.175, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2522175.jpg'}, {'end': 2590.284, 'src': 'embed', 'start': 2567.952, 'weight': 5, 'content': [{'end': 2577.517, 'text': 'So Apache Spark has a well-defined and layered architecture where all the Spark components and layers are loosely coupled and integrated with various extensions and libraries.', 'start': 2567.952, 'duration': 9.565}, {'end': 2580.238, 'text': "First let's talk about the driver program.", 'start': 2578.177, 'duration': 2.061}, {'end': 2585.681, 'text': 'This is the Spark driver which contains the driver program and Spark context.', 'start': 2581.039, 'duration': 4.642}, {'end': 2590.284, 'text': 'This is the central point and entry point of the Spark shell,', 'start': 2586.622, 'duration': 3.662}], 'summary': 'Apache spark has a well-defined, layered architecture with a central driver program and spark context.', 'duration': 22.332, 'max_score': 2567.952, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2567952.jpg'}, {'end': 2807.71, 'src': 'embed', 'start': 2778.251, 'weight': 6, 'content': [{'end': 2781.214, 'text': "Now let's take a look at Spark SQL and its architecture.", 'start': 2778.251, 'duration': 2.963}, {'end': 2793.005, 'text': "So Spark SQL is the new module in Spark and it integrates relational processing with Spark's functional programming API and it supports querying of data either via SQL or via Hive query language.", 'start': 2781.254, 'duration': 11.751}, {'end': 2802.209, 'text': 'So, for those of you who have been familiar with RDBM, So Spark SQL will be a very easy transition from your earlier tools,', 'start': 2793.586, 'duration': 8.623}, {'end': 2807.71, 'text': 'because you can extend the boundaries of traditional relational data processing with Spark SQL,', 'start': 2802.209, 'duration': 5.501}], 'summary': "Spark sql integrates relational processing with spark's functional api, supporting sql and hive query language, making it an easy transition for rdbm users.", 'duration': 29.459, 'max_score': 2778.251, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2778251.jpg'}], 'start': 2128.111, 'title': 'Hadoop and spark architectures', 'summary': "Covers hadoop cluster architecture and hardware specifications, with real-world use cases like facebook's 21 petabytes of storage and spotify's 65 petabytes, while also introducing apache spark, its components like spark sql and spark streaming, and describing its architecture comprising the driver program, cluster manager, and executors.", 'chapters': [{'end': 2405.543, 'start': 2128.111, 'title': 'Hadoop cluster architecture and hardware specifications', 'summary': "Introduces the hadoop cluster architecture, hardware specifications, and real-world use cases, including facebook's 21 petabytes of storage and spotify's 65 petabytes of storage, showcasing the significance of hadoop in big data processing.", 'duration': 277.432, 'highlights': ["Facebook's Hadoop cluster has 21 petabytes of storage, 2000 machines per cluster, and 32GB of RAM per machine. Facebook's Hadoop cluster comprises 21 petabytes of storage, 2000 machines per cluster, and 32GB of RAM per machine, surpassing Yahoo's previous 14 petabyte cluster.", "Spotify's Hadoop cluster includes 1650 nodes, 65 petabytes of storage, 70 terabytes of RAM, and runs over 25,000 daily Hadoop jobs. Spotify's Hadoop cluster features 1650 nodes, 65 petabytes of storage, 70 terabytes of RAM, and processes over 25,000 daily Hadoop jobs, indicating a substantial big data processing capacity.", 'Hadoop cluster hardware specifications include 64GB RAM and 1TB hard disk for name node, 16GB RAM and 12TB hard disk for data node, and 32GB RAM and 1TB hard disk for secondary name node. Hadoop cluster hardware specifications entail 64GB RAM and 1TB hard disk for name node, 16GB RAM and 12TB hard disk for data node, and 32GB RAM and 1TB hard disk for secondary name node, crucial for setting up a Hadoop cluster.']}, {'end': 2910.458, 'start': 2405.543, 'title': 'Apache spark components and architecture', 'summary': "Introduces apache spark, an open-source cluster computing framework with components like spark sql, spark streaming, spark mlib, graphx, and sparkr, and describes its architecture, consisting of the driver program, cluster manager, and executors, as well as spark sql's data source api and dataframe api.", 'duration': 504.915, 'highlights': ['Components of Apache Spark The chapter describes various components of Apache Spark such as Spark SQL, Spark Streaming, Spark MLib, GraphX, and SparkR, each serving specific purposes in structured data processing, real-time streaming data processing, machine learning, graph computation, and R language integration.', 'Apache Spark Core Engine The Spark core engine is the basic engine for large-scale parallel and distributed data processing, offering Java, Scala, and Python APIs for distributed ETL development, streaming, SQL, and machine learning, and is responsible for scheduling, distributing, and monitoring jobs in a cluster.', 'Spark Architecture The architecture of Apache Spark consists of the driver program as the central entry point, the cluster manager responsible for acquiring and allocating resources, and the executors as distributed agents for task execution, with the driver program converting application code into a logical DAG and negotiating with the cluster manager for resource allocation.', "Spark SQL and its Architecture Spark SQL integrates relational processing with Spark's functional programming API, supports querying data via SQL or Hive query language, and provides the data source API for loading and storing structured data, as well as the DataFrame API for processing structured and semi-structured data from kilobytes to petabytes."]}], 'duration': 782.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2128111.jpg', 'highlights': ["Facebook's Hadoop cluster has 21 petabytes of storage, 2000 machines per cluster, and 32GB of RAM per machine, surpassing Yahoo's previous 14 petabyte cluster.", "Spotify's Hadoop cluster includes 1650 nodes, 65 petabytes of storage, 70 terabytes of RAM, and runs over 25,000 daily Hadoop jobs, indicating a substantial big data processing capacity.", 'Hadoop cluster hardware specifications include 64GB RAM and 1TB hard disk for name node, 16GB RAM and 12TB hard disk for data node, and 32GB RAM and 1TB hard disk for secondary name node, crucial for setting up a Hadoop cluster.', 'Components of Apache Spark The chapter describes various components of Apache Spark such as Spark SQL, Spark Streaming, Spark MLib, GraphX, and SparkR, each serving specific purposes in structured data processing, real-time streaming data processing, machine learning, graph computation, and R language integration.', 'Apache Spark Core Engine The Spark core engine is the basic engine for large-scale parallel and distributed data processing, offering Java, Scala, and Python APIs for distributed ETL development, streaming, SQL, and machine learning, and is responsible for scheduling, distributing, and monitoring jobs in a cluster.', 'Spark Architecture The architecture of Apache Spark consists of the driver program as the central entry point, the cluster manager responsible for acquiring and allocating resources, and the executors as distributed agents for task execution, with the driver program converting application code into a logical DAG and negotiating with the cluster manager for resource allocation.', "Spark SQL and its Architecture Spark SQL integrates relational processing with Spark's functional programming API, supports querying data via SQL or Hive query language, and provides the data source API for loading and storing structured data, as well as the DataFrame API for processing structured and semi-structured data from kilobytes to petabytes."]}, {'end': 4319.611, 'segs': [{'end': 2953.402, 'src': 'embed', 'start': 2930.431, 'weight': 0, 'content': [{'end': 2937.774, 'text': 'K-means is one of the most simplest unsupervised learning algorithms that solves the well-known clustering problem.', 'start': 2930.431, 'duration': 7.343}, {'end': 2945.677, 'text': 'So the procedure of k-means follows a simple and easy way to classify a data set to a certain number of clusters,', 'start': 2938.294, 'duration': 7.383}, {'end': 2948.838, 'text': 'which is fixed prior to performing the clustering method.', 'start': 2945.677, 'duration': 3.161}, {'end': 2953.402, 'text': 'So the main idea is define k centroids one for each cluster,', 'start': 2949.558, 'duration': 3.844}], 'summary': 'K-means is a simple unsupervised learning algorithm for clustering data into a fixed number of clusters using k centroids.', 'duration': 22.971, 'max_score': 2930.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2930431.jpg'}, {'end': 3193.928, 'src': 'embed', 'start': 3162.645, 'weight': 1, 'content': [{'end': 3167.307, 'text': 'So this is where I need to set my schools up so that my students do not have to travel that much.', 'start': 3162.645, 'duration': 4.662}, {'end': 3171.97, 'text': "So that was all about k-means and now let's talk about Apache Zeppelin.", 'start': 3168.107, 'duration': 3.863}, {'end': 3181.077, 'text': 'This is a web page notebook which brings in data ingestion, data exploration, visualization, sharing, and collaboration features to Hadoop and Spark.', 'start': 3172.511, 'duration': 8.566}, {'end': 3187.202, 'text': 'So remember when I showed you my Zeppelin notebook you can see that we have written the code there.', 'start': 3181.738, 'duration': 5.464}, {'end': 3193.928, 'text': 'We have even run SQL codes there and we have more visualizations by executing code there.', 'start': 3187.703, 'duration': 6.225}], 'summary': 'Implementing schools to reduce student travel; discussing k-means and apache zeppelin for data exploration and visualization.', 'duration': 31.283, 'max_score': 3162.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o3162645.jpg'}, {'end': 3244.484, 'src': 'embed', 'start': 3222.394, 'weight': 2, 'content': [{'end': 3231.737, 'text': 'So the first thing we will do is we will store the data into HDFS, and then we will analyze the data by using Scala, Spark SQL, and Spark ML Lab.', 'start': 3222.394, 'duration': 9.343}, {'end': 3237.059, 'text': "And then finally, we'll find out the results and visualize them using Zeppelin.", 'start': 3231.997, 'duration': 5.062}, {'end': 3241.482, 'text': 'So this was the entire U.S. election solution strategy that I told you.', 'start': 3238.059, 'duration': 3.423}, {'end': 3244.484, 'text': "I don't think I should repeat it again, but if you want me, I can.", 'start': 3241.762, 'duration': 2.722}], 'summary': 'Store data in hdfs, analyze using scala, spark sql, and spark ml lab, visualize results using zeppelin for u.s. election solution strategy.', 'duration': 22.09, 'max_score': 3222.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o3222394.jpg'}, {'end': 3414.665, 'src': 'embed', 'start': 3387.732, 'weight': 3, 'content': [{'end': 3392.155, 'text': 'So for that you have to use the command spark read option header true.', 'start': 3387.732, 'duration': 4.423}, {'end': 3398.599, 'text': 'Header true means that you have mentioned and you have told Spark that my data set already contains column headers,', 'start': 3392.595, 'duration': 6.004}, {'end': 3403.161, 'text': 'Because state as AVVR they are nothing, but they are column headers.', 'start': 3398.979, 'duration': 4.182}, {'end': 3407.442, 'text': "So you don't have to explicitly define the column headers for it.", 'start': 3403.381, 'duration': 4.061}, {'end': 3410.583, 'text': 'Neither will Spark choose any random row as a column header.', 'start': 3407.822, 'duration': 2.761}, {'end': 3414.665, 'text': 'So it will choose only the column headers your data set has.', 'start': 3411.023, 'duration': 3.642}], 'summary': "Use 'spark read option header true' to indicate dataset has column headers, avoiding explicit definition and random selection.", 'duration': 26.933, 'max_score': 3387.732, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o3387732.jpg'}, {'end': 3788.743, 'src': 'embed', 'start': 3759.854, 'weight': 4, 'content': [{'end': 3763.555, 'text': "and what i'm going to do is i'm going to make a temporary Variable again.", 'start': 3759.854, 'duration': 3.701}, {'end': 3766.856, 'text': "so I'm using the temporary variable to store some data temporarily.", 'start': 3763.555, 'duration': 3.301}, {'end': 3772.418, 'text': "So I'm writing to the Spark SQL code to select only the columns that I want.", 'start': 3766.876, 'duration': 5.542}, {'end': 3778.36, 'text': 'I want the state, state abbreviation, county, FIPS, party candidate, votes, fashion votes from election one.', 'start': 3772.818, 'duration': 5.542}, {'end': 3780.1, 'text': "I'm storing everything in DWinner.", 'start': 3778.7, 'duration': 1.4}, {'end': 3785.602, 'text': "I've created this new variable and whatever there was in temp, I'm assigning it to DWinner.", 'start': 3780.54, 'duration': 5.062}, {'end': 3788.743, 'text': "And now I've got only the winner data.", 'start': 3785.922, 'duration': 2.821}], 'summary': 'Using spark sql to select and store specific election data, resulting in winner data.', 'duration': 28.889, 'max_score': 3759.854, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o3759854.jpg'}, {'end': 3925.221, 'src': 'embed', 'start': 3899.126, 'weight': 5, 'content': [{'end': 3909.35, 'text': "First thing, again, you have to define a schema, and this time I'm naming that schema schema 1, since you know that we have got almost 54 columns.", 'start': 3899.126, 'duration': 10.224}, {'end': 3911.451, 'text': 'so I have to define all those 54 columns also.', 'start': 3909.35, 'duration': 2.101}, {'end': 3917.078, 'text': 'So you remember what each of those columns contains.', 'start': 3913.897, 'duration': 3.181}, {'end': 3925.221, 'text': "So this is exactly what I have done, and I don't need to go through every line, but, like I already told you how to define a schema,", 'start': 3917.538, 'duration': 7.683}], 'summary': "Defining schema 'schema 1' with 54 columns for data organization.", 'duration': 26.095, 'max_score': 3899.126, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o3899126.jpg'}], 'start': 2910.718, 'title': 'Data analysis and clustering with apache zeppelin', 'summary': 'Introduces k-means clustering for data clustering and explores its application in real-world problems. it also presents apache zeppelin as a versatile web-based notebook for data ingestion, exploration, and visualization. additionally, it covers spark sql data analysis, creating election tables, refining data, and performing feature selection for analysis.', 'chapters': [{'end': 3262.914, 'start': 2910.718, 'title': 'K-means clustering and apache zeppelin', 'summary': "Introduces k-means clustering as a simple unsupervised learning algorithm used for clustering data into a fixed number of clusters, and discusses the method's application in solving real-world problems such as school placement. additionally, it presents apache zeppelin as a powerful web-based notebook for data ingestion, exploration, visualization, and collaboration, supporting various language interpreters and integration with hadoop and spark.", 'duration': 352.196, 'highlights': ['K-Means clustering is a simple unsupervised learning algorithm used to classify a data set into a fixed number of clusters, and it is applied to solve real-world problems such as school placement based on student location data. The K-means clustering algorithm is introduced as a simple unsupervised learning method used to classify a data set into a fixed number of clusters. It is then applied to solve the problem of school placement based on student location data, aiming to minimize travel distance for students.', 'Apache Zeppelin is a web-based notebook that facilitates data ingestion, exploration, visualization, sharing, and collaboration, supporting various language interpreters and integration with Hadoop and Spark. Apache Zeppelin is presented as a web-based notebook tool that supports data ingestion, exploration, visualization, sharing, and collaboration. It is noted for its support of various language interpreters, including R, Python, and its integration with Hadoop and Spark.', 'The solution strategy for the U.S. county data involves storing the data into HDFS, analyzing the data using Scala, Spark SQL, and Spark ML Lab, and visualizing the results using Zeppelin. The solution strategy for the U.S. county data includes storing the data into HDFS, analyzing it using Scala, Spark SQL, and Spark ML Lab, and visualizing the results using Zeppelin.']}, {'end': 3512.105, 'start': 3263.514, 'title': 'Spark sql data analysis', 'summary': 'Covers importing packages for spark sql and spark ml lib, defining schema, reading data from hdfs, dividing data by party, and assigning analysis tasks for democrat and republican data.', 'duration': 248.591, 'highlights': ['The first step involves importing Spark SQL and Spark ML lib packages for k-means clustering and vector assembler for machine learning functions. The speaker imports Spark SQL and Spark ML lib packages for k-means clustering and vector assembler for machine learning functions.', 'Defining a proper schema is crucial, starting with a struct type and an array of fields, each with specific data types and structures. The importance of defining a proper schema is emphasized, starting with a struct type and an array of fields, each with specific data types and structures.', "The process of reading the data set from HDFS is outlined, including the command 'spark read option header true' to recognize column headers and specifying the schema and file format. The process of reading the data set from HDFS is outlined, including the command 'spark read option header true' to recognize column headers and specifying the schema and file format.", 'Demonstrates how to divide the data set by party, extracting only the Democrat data for analysis. The process of dividing the data set by party is demonstrated, with a focus on extracting only the Democrat data for analysis.', 'Assigns the task of analyzing Republican data as practice for the next class, encouraging students to analyze and present their findings. The speaker assigns the task of analyzing Republican data as practice for the next class, encouraging students to analyze and present their findings.']}, {'end': 3737.707, 'start': 3512.905, 'title': 'Creating election table and analyzing winners', 'summary': "Details the process of creating a table 'election' to analyze winning candidates by refining the data using sql code, resulting in a new table 'election one' with distinct fips entries and corresponding winning candidates.", 'duration': 224.802, 'highlights': ["Creating a table 'election' to store Democrat data and refining it using SQL code to analyze winning candidates. Creation of 'election' table, refining data to analyze winning candidates.", "Using SQL code to select maximum fraction votes for each FIPS to determine the winner and create a new table 'election one' with distinct FIPS entries and winning candidates. Selection of maximum fraction votes for each FIPS, creation of 'election one' table with distinct FIPS entries and winning candidates."]}, {'end': 3897.766, 'start': 3739.128, 'title': 'Data refinement and analysis', 'summary': 'Discusses data refinement by filtering out unnecessary columns, selecting specific columns using spark sql code, and creating new variables for storing refined data, resulting in the creation of tables for winners and states.', 'duration': 158.638, 'highlights': ["The speaker filters out unnecessary columns 'a' and 'b' and selects specific columns including state, state abbreviation, county, FIPS, party candidate, votes, and fraction votes using Spark SQL code, resulting in the creation of a new variable DWinner to store the winner data.", 'The process involves refining the dataset to facilitate drawing conclusions and insights, emphasizing that different approaches can also be utilized based on individual requirements and preferences.', 'The creation of a table for DWinner named Democrat allows for the visualization of the winner data and further analysis to determine the winning candidates in different states, enabling the creation of a new variable dState and the State table view showcasing the winning candidates in various states.']}, {'end': 4319.611, 'start': 3899.126, 'title': 'Data analysis and feature selection', 'summary': 'Details the process of defining schema for a dataset with 54 columns, selecting specific demographic features for analysis, dividing hillary clinton and bernie sanders data to analyze their wins, and performing k-means analysis on selected features.', 'duration': 420.485, 'highlights': ["The chapter details the process of defining schema for a dataset with 54 columns The speaker defines a schema named schema 1 for a dataset with 54 columns, ensuring the understanding of each column's content.", 'Selecting specific demographic features for analysis The speaker chooses specific demographic features like FIPS, state, state abbreviation, area name, and ethnicity percentages to analyze the popularity of Hillary Clinton among different ethnic groups.', 'Dividing Hillary Clinton and Bernie Sanders data to analyze their wins The speaker separates the data of Hillary Clinton and Bernie Sanders, performs one hot encoding to add columns for their wins, and creates views and unions for the counties they won.', 'Performing k-means analysis on selected features The chapter concludes with the speaker defining feature columns for k-means analysis, including demographic and socioeconomic factors, to obtain results and insights from the dataset.']}], 'duration': 1408.893, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o2910718.jpg', 'highlights': ['K-Means clustering is a simple unsupervised learning algorithm used to classify a data set into a fixed number of clusters, and it is applied to solve real-world problems such as school placement based on student location data.', 'Apache Zeppelin is a web-based notebook that facilitates data ingestion, exploration, visualization, sharing, and collaboration, supporting various language interpreters and integration with Hadoop and Spark.', 'The solution strategy for the U.S. county data involves storing the data into HDFS, analyzing the data using Scala, Spark SQL, and Spark ML Lab, and visualizing the results using Zeppelin.', "The process of reading the data set from HDFS is outlined, including the command 'spark read option header true' to recognize column headers and specifying the schema and file format.", "The speaker filters out unnecessary columns 'a' and 'b' and selects specific columns including state, state abbreviation, county, FIPS, party candidate, votes, and fraction votes using Spark SQL code, resulting in the creation of a new variable DWinner to store the winner data.", "The chapter details the process of defining schema for a dataset with 54 columns The speaker defines a schema named schema 1 for a dataset with 54 columns, ensuring the understanding of each column's content."]}, {'end': 5026.736, 'segs': [{'end': 4407.625, 'src': 'embed', 'start': 4366.74, 'weight': 0, 'content': [{'end': 4371.604, 'text': "And then we're going to perform the k-means clustering and we're going to store it in a variable called k-means.", 'start': 4366.74, 'duration': 4.864}, {'end': 4380.111, 'text': "So we're using different functions from Spark library and we have chosen Spark with clustering k-means.", 'start': 4371.784, 'duration': 8.327}, {'end': 4386.555, 'text': 'And you know that in k-means we already defined that how many clusters do we need and we need four.', 'start': 4380.611, 'duration': 5.944}, {'end': 4391.179, 'text': "So we have selected four clusters and then we're going to set feature columns as features.", 'start': 4386.956, 'duration': 4.223}, {'end': 4394.581, 'text': 'Then set prediction column as predictions.', 'start': 4391.899, 'duration': 2.682}, {'end': 4397.903, 'text': "So after that, we're going to make a model.", 'start': 4395.462, 'duration': 2.441}, {'end': 4403.844, 'text': "And we have defined our input and output columns in rows, so we're going to use keynes.fitrow.", 'start': 4398.243, 'duration': 5.601}, {'end': 4407.625, 'text': "And whatever predictions we will get, we're going to store it in a model.", 'start': 4404.304, 'duration': 3.321}], 'summary': 'Performed k-means clustering with 4 clusters using spark library.', 'duration': 40.885, 'max_score': 4366.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o4366740.jpg'}, {'end': 4503.524, 'src': 'embed', 'start': 4474.704, 'weight': 1, 'content': [{'end': 4486.172, 'text': 'And then if you observe the differences in the cluster centers for each feature, here you can see that there is not much difference, not even here.', 'start': 4474.704, 'duration': 11.468}, {'end': 4499.541, 'text': "So it's 50, 49, 49, 51, and then it's, well, again it is not much of a difference, but if you see here that it's 9 and it's going to 16.", 'start': 4486.272, 'duration': 13.269}, {'end': 4503.524, 'text': 'So you can do a more detailed analysis on black or African American.', 'start': 4499.541, 'duration': 3.983}], 'summary': 'Cluster centers show little difference, e.g., 50, 49, 49, 51. more analysis on black/african american needed.', 'duration': 28.82, 'max_score': 4474.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o4474704.jpg'}, {'end': 4698.224, 'src': 'embed', 'start': 4661.329, 'weight': 5, 'content': [{'end': 4664.55, 'text': "And that's what the conclusion we can find out from this scatterplot.", 'start': 4661.329, 'duration': 3.221}, {'end': 4673.552, 'text': 'And we can see that as the number of foreign people increases, the popularity of Hillary Clinton is more in larger groups of foreign people.', 'start': 4664.57, 'duration': 8.982}, {'end': 4679.014, 'text': 'You can also choose different parameters out of all the different features that you have chosen.', 'start': 4674.452, 'duration': 4.562}, {'end': 4682.776, 'text': 'So remember that we have also seen the variation in veterans.', 'start': 4679.595, 'duration': 3.181}, {'end': 4684.677, 'text': "So let's choose veterans and y-axis.", 'start': 4682.796, 'duration': 1.881}, {'end': 4688.059, 'text': "So let's also change x-axis and let me just use y alone here.", 'start': 4684.697, 'duration': 3.362}, {'end': 4698.224, 'text': 'So you can see here that there is the x-axis that has white alone and this is the veterans.', 'start': 4688.499, 'duration': 9.725}], 'summary': "As the number of foreign people increases, hillary clinton's popularity increases in larger groups. there is also variation in veterans.", 'duration': 36.895, 'max_score': 4661.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o4661329.jpg'}, {'end': 4907.126, 'src': 'embed', 'start': 4854.83, 'weight': 3, 'content': [{'end': 4858.513, 'text': 'So again, you can go ahead and we have created the same graph.', 'start': 4854.83, 'duration': 3.683}, {'end': 4862.076, 'text': "It's only just area graph instead of a line graph.", 'start': 4858.573, 'duration': 3.503}, {'end': 4865.539, 'text': 'The key here are state and candidates.', 'start': 4862.856, 'duration': 2.683}, {'end': 4870.805, 'text': "So I've got states and candidates over here and the values is counties.", 'start': 4865.579, 'duration': 5.226}, {'end': 4877.432, 'text': 'Once, if you just hover onto this bar chart, you can see that in Connecticut, Bernie Sanders won 115 counties in Connecticut.', 'start': 4871.385, 'duration': 6.047}, {'end': 4881.957, 'text': 'Hillary Clinton won 55 only.', 'start': 4879.854, 'duration': 2.103}, {'end': 4887.004, 'text': 'So in Florida, Hillary Clinton is 58 and in Florida, Bernie Sanders is 9.', 'start': 4882.138, 'duration': 4.866}, {'end': 4891.43, 'text': 'And here you can see in the Maine, Bernie Sanders won 462.', 'start': 4887.004, 'duration': 4.426}, {'end': 4894.755, 'text': 'So Bernie Sanders got a majority of votes for Maine.', 'start': 4891.43, 'duration': 3.325}, {'end': 4898.758, 'text': 'So you can also classify it statewide.', 'start': 4896.556, 'duration': 2.202}, {'end': 4901.821, 'text': 'You can find out which are the states.', 'start': 4899.038, 'duration': 2.783}, {'end': 4903.843, 'text': 'And as Donald Trump.', 'start': 4902.582, 'duration': 1.261}, {'end': 4907.126, 'text': 'now you will know that which are the states that you can target right?', 'start': 4903.843, 'duration': 3.283}], 'summary': 'Graph shows bernie sanders won majority of votes in maine with 462 counties.', 'duration': 52.296, 'max_score': 4854.83, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o4854830.jpg'}], 'start': 4320.211, 'title': 'Analyzing u.s. county data', 'summary': "Discusses the use of k-means clustering in spark to create four clusters based on features like winning percentage, demographic groups, and voting patterns. it also explains the analysis of u.s. county data to identify clusters and draw insights such as hillary clinton's popularity and potential campaign targeting.", 'chapters': [{'end': 4553.562, 'start': 4320.211, 'title': 'K-means clustering in spark', 'summary': 'Discusses using the vector assembler and k-means algorithm in spark to create four clusters based on different features, such as winning percentage, demographic groups, and voting patterns, with specific focus on analyzing variations and differences within the clusters.', 'duration': 233.351, 'highlights': ['The chapter explains using the vector assembler and k-means algorithm in Spark to create four clusters based on different features, such as winning percentage, demographic groups, and voting patterns, with specific focus on analyzing variations and differences within the clusters.', 'The transcript details the process of transforming input data to create a final table view and performing k-means clustering with four clusters using Spark, with a specific emphasis on setting feature columns, prediction columns, and creating a model.', 'The discussion includes insights into the differences in cluster centers for features such as winning chances, demographic groups, and voting patterns, and highlights the potential for detailed analysis on specific groups, such as black or African American, Hispanic or Latino, and veterans based on the variations observed within the clusters.']}, {'end': 5026.736, 'start': 4554.302, 'title': 'Analyzing u.s. county data', 'summary': "Explains the process of analyzing u.s. county data to identify key clusters, visualize results, and draw insights such as hillary clinton's popularity among foreign people, the significance of maine, and the potential targeting of states for campaigns.", 'duration': 472.434, 'highlights': ["Hillary Clinton's popularity among foreign people and those speaking different languages Hillary Clinton's popularity among foreign-born people and those speaking languages other than English is evident, as seen in the scatter plot analysis.", 'Significance of Maine in the election results Maine stands out as a state where Bernie Sanders received a majority of votes, suggesting potential targeting opportunities for political campaigns.', 'Visualization of election results across states The area graph and bar chart provide a comprehensive visualization of the distribution of votes for Hillary Clinton and Bernie Sanders across different states, offering insights for strategic campaign planning.']}], 'duration': 706.525, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o4320211.jpg', 'highlights': ['The chapter explains using the vector assembler and k-means algorithm in Spark to create four clusters based on different features, such as winning percentage, demographic groups, and voting patterns, with specific focus on analyzing variations and differences within the clusters.', 'The discussion includes insights into the differences in cluster centers for features such as winning chances, demographic groups, and voting patterns, and highlights the potential for detailed analysis on specific groups, such as black or African American, Hispanic or Latino, and veterans based on the variations observed within the clusters.', 'The transcript details the process of transforming input data to create a final table view and performing k-means clustering with four clusters using Spark, with a specific emphasis on setting feature columns, prediction columns, and creating a model.', 'Significance of Maine in the election results Maine stands out as a state where Bernie Sanders received a majority of votes, suggesting potential targeting opportunities for political campaigns.', 'Visualization of election results across states The area graph and bar chart provide a comprehensive visualization of the distribution of votes for Hillary Clinton and Bernie Sanders across different states, offering insights for strategic campaign planning.', "Hillary Clinton's popularity among foreign people and those speaking different languages Hillary Clinton's popularity among foreign-born people and those speaking languages other than English is evident, as seen in the scatter plot analysis."]}, {'end': 5746.854, 'segs': [{'end': 5075.724, 'src': 'embed', 'start': 5052.114, 'weight': 0, 'content': [{'end': 5058.619, 'text': 'that is the point where we will find the maximum pickups, and then we will also have to find out what is the peak hour of the day.', 'start': 5052.114, 'duration': 6.505}, {'end': 5060.86, 'text': 'So this was the entire strategy.', 'start': 5059.099, 'duration': 1.761}, {'end': 5066.021, 'text': "So we've got the uber pickup data set and then we store the data into HDFS.", 'start': 5060.92, 'duration': 5.101}, {'end': 5075.724, 'text': 'We will transform the data set and make predictions by using k-means clustering on the latitude and longitude and find out the beehive point.', 'start': 5066.161, 'duration': 9.563}], 'summary': 'Analyzing uber pickup data using k-means clustering to find peak hour and beehive points.', 'duration': 23.61, 'max_score': 5052.114, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o5052114.jpg'}, {'end': 5141.861, 'src': 'embed', 'start': 5113.016, 'weight': 2, 'content': [{'end': 5114.196, 'text': "Now I don't have many fields.", 'start': 5113.016, 'duration': 1.18}, {'end': 5116.057, 'text': "I've got only four fields if I remember.", 'start': 5114.236, 'duration': 1.821}, {'end': 5120.477, 'text': 'So the first field was the date and timestamp that defines the time.', 'start': 5116.117, 'duration': 4.36}, {'end': 5122.778, 'text': "We're defining it as DT.", 'start': 5120.937, 'duration': 1.841}, {'end': 5126.018, 'text': 'The next field is the latitude, the longitude, and base.', 'start': 5122.918, 'duration': 3.1}, {'end': 5127.859, 'text': "Then I'm going to read my data set.", 'start': 5126.058, 'duration': 1.801}, {'end': 5131.979, 'text': 'This is the path in my HDFS where my Uber data set is.', 'start': 5128.259, 'duration': 3.72}, {'end': 5136.24, 'text': 'There So I define schema as schema here.', 'start': 5132.559, 'duration': 3.681}, {'end': 5141.861, 'text': "The header is true because again my data set contains column headers and I'm going to store in DF.", 'start': 5136.62, 'duration': 5.241}], 'summary': 'Data set contains four fields: date, timestamp, latitude, longitude. stored in hdfs as schema with headers.', 'duration': 28.845, 'max_score': 5113.016, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o5113016.jpg'}, {'end': 5254.787, 'src': 'embed', 'start': 5224.418, 'weight': 3, 'content': [{'end': 5227.579, 'text': 'and that is where we are replacing the school or building the new school.', 'start': 5224.418, 'duration': 3.161}, {'end': 5234.162, 'text': 'So, similarly, this is going to be my beehive point, and this is where I will place my maximum number of cabs, okay?', 'start': 5228.019, 'duration': 6.143}, {'end': 5237.123, 'text': 'So we found out the beehive points.', 'start': 5234.762, 'duration': 2.361}, {'end': 5245.465, 'text': 'The next thing we will need to do is we need to find the peak hours because I also need to know at what time should I place my cabs in the location.', 'start': 5237.183, 'duration': 8.282}, {'end': 5247.145, 'text': "So what we're doing now?", 'start': 5245.965, 'duration': 1.18}, {'end': 5254.787, 'text': 'we are taking a new variable called q and we are selecting hour from the timestamp column, and then the alias name should be hour,', 'start': 5247.145, 'duration': 7.642}], 'summary': 'Identifying beehive points for maximum cab placement and determining peak hours using timestamp data', 'duration': 30.369, 'max_score': 5224.418, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o5224418.jpg'}, {'end': 5285.417, 'src': 'embed', 'start': 5260.308, 'weight': 4, 'content': [{'end': 5268.831, 'text': "so now we are grouping it and then we'll have the different hours of the day and then it will just show me the pickups at the different hours of the day in the location.", 'start': 5260.308, 'duration': 8.523}, {'end': 5274.793, 'text': "that we found out are the beehive points, and then we're going to count.", 'start': 5268.831, 'duration': 5.962}, {'end': 5278.695, 'text': 'uh, how many pickups we are going to get from that place.', 'start': 5274.793, 'duration': 3.902}, {'end': 5285.417, 'text': "right. so we're ordering it by descending, so the smaller pickup count will be the first and then the larger will be at the bottom.", 'start': 5278.695, 'duration': 6.722}], 'summary': 'Grouping pickups by hour to count and analyze activity in beehive locations.', 'duration': 25.109, 'max_score': 5260.308, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o5260308.jpg'}], 'start': 5028.077, 'title': 'Analyzing cab and uber pickups', 'summary': 'Discusses the analysis of uber and cab pickup data using k-means clustering to identify beehive locations with maximum pickups, determining peak hours for deployment, and visualizing pickup distribution, contributing to efficient cab and uber placements and operations.', 'chapters': [{'end': 5160.955, 'start': 5028.077, 'title': 'Uber pickup analysis', 'summary': 'Discusses analyzing uber pickup data to find beehive locations with maximum pickups and peak hour using k-means clustering on latitude and longitude, and storing the data into hdfs.', 'duration': 132.878, 'highlights': ["The chapter discusses analyzing Uber pickup data to find beehive locations with maximum pickups and peak hour using k-means clustering on latitude and longitude. This involves utilizing the Uber dataset containing pickup time, location (latitude and longitude), and driver's license number to determine beehive locations with maximum pickups and peak hour.", 'Storing the data into HDFS and transforming the dataset for making predictions using k-means clustering on the latitude and longitude to find out the beehive point. The process involves storing the Uber pickup dataset into HDFS and transforming it to make predictions using k-means clustering on latitude and longitude to identify the beehive point with the maximum pickups.', 'Defining a schema for the Uber dataset and reading the data into a DataFrame for analysis. The chapter explains defining a schema for the Uber dataset with fields for date, timestamp, latitude, longitude, and base, and reading the data into a DataFrame for further analysis.']}, {'end': 5746.854, 'start': 5161.395, 'title': 'Analysis of cab pickups and peak hours', 'summary': 'Discusses the analysis of cab pickups using k-means clustering to find beehive points for cab placements, determining peak hours for cab deployment, and visualizing the distribution of pickups at different hours and locations.', 'duration': 585.459, 'highlights': ["The beehive points found after k-means clustering will indicate the exact locations for maximum cab pickups, with a recommendation to place cabs around 4 or 5 o'clock in the evening based on the analysis of peak hours. Identification of beehive points for cab placements and recommendation for peak hour deployment.", 'Visualization of the distribution of pickups at different hours and locations provides insights such as the peak hours for cab deployments and the sparse pickup counts during certain hours. Insights gained from the visualization of pickup distribution at different hours and locations.', 'The Edureka LMS provides access to course content, recorded classes, and projects, with lifetime access to older classes and 24/7 support. Features of the Edureka LMS including access to course content, recorded classes, and projects.']}], 'duration': 718.777, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/YirUQhskJ3o/pics/YirUQhskJ3o5028077.jpg', 'highlights': ['Analyzing Uber pickup data to find beehive locations with maximum pickups and peak hour using k-means clustering on latitude and longitude.', 'Storing the Uber pickup dataset into HDFS and transforming it to make predictions using k-means clustering on latitude and longitude to identify the beehive point with the maximum pickups.', 'Defining a schema for the Uber dataset with fields for date, timestamp, latitude, longitude, and base, and reading the data into a DataFrame for further analysis.', 'Identification of beehive points for cab placements and recommendation for peak hour deployment.', 'Insights gained from the visualization of pickup distribution at different hours and locations.']}], 'highlights': ['The chapter covers the fundamentals of Hadoop and Spark, along with the application in two big data use cases - the U.S. primary election and instant cabs startup.', 'The dataset contains fields such as population in 2014 and 2010, sex ratio, ethnicity breakdown, and age group distribution, providing comprehensive data for analysis.', 'Key applications of big data analytics in healthcare and financial services showcase its impact on disease diagnosis, cure discovery, risk modeling, fraud detection, and credit card scoring.', 'Commands to start Hadoop daemons include name node, secondary name node, and data nodes.', "Facebook's Hadoop cluster has 21 petabytes of storage, 2000 machines per cluster, and 32GB of RAM per machine, surpassing Yahoo's previous 14 petabyte cluster.", 'K-Means clustering is a simple unsupervised learning algorithm used to classify a data set into a fixed number of clusters, and it is applied to solve real-world problems such as school placement based on student location data.', 'The chapter explains using the vector assembler and k-means algorithm in Spark to create four clusters based on different features, such as winning percentage, demographic groups, and voting patterns, with specific focus on analyzing variations and differences within the clusters.', 'Analyzing Uber pickup data to find beehive locations with maximum pickups and peak hour using k-means clustering on latitude and longitude.']}