title
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka
description
π₯ Apache Spark Training (Use Code "πππππππππ"): https://www.edureka.co/apache-spark-scala-certification-training
This Edureka Spark Tutorial (Spark Blog Series: https://goo.gl/WrEKX9) will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
02:13 Big Data Introduction
13:02 Batch vs Real Time Analytics
1:00:02 What is Apache Spark?
1:01:16 Why Apache Spark?
1:03:27 Using Spark with Hadoop
1:06:37 Apache Spark Features
1:14:58 Apache Spark Ecosystem
1:18:01 Brief introduction to complete Spark Ecosystem Stack
1:40:24 Demo: Earthquake Detection Using Apache Spark
Subscribe to our channel to get video updates. Hit the subscribe button above.
PG in Big Data Engineering with NIT Rourkela : https://www.edureka.co/post-graduate/big-data-engineering (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies)
#edureka #edurekaSpark #SparkTutorial #SparkOnlineTraining
Check our complete Apache Spark and Scala playlist here: https://goo.gl/ViRJ2K
How it Works?
1. This is a 4 Week Instructor led Online Course, 32 hours of assignment and 20 hours of project work
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. At the end of the training you will have to work on a project, based on which we will provide you a Grade and a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
This Spark training will enable learners to understand how Spark executes in-memory data processing and runs much faster than Hadoop MapReduce. Learners will master Scala programming and will get trained on different APIs which Spark offers such as Spark Streaming, Spark SQL, Spark RDD, Spark MLlib and Spark GraphX. This Edureka course is an integral part of Big Data developer's learning path.
After completing the Apache Spark and Scala training, you will be able to:
1) Understand Scala and its implementation
2) Master the concepts of Traits and OOPS in Scala programming
3) Install Spark and implement Spark operations on Spark Shell
4) Understand the role of Spark RDD
5) Implement Spark applications on YARN (Hadoop)
6) Learn Spark Streaming API
7) Implement machine learning algorithms in Spark MLlib API
8) Analyze Hive and Spark SQL architecture
9) Understand Spark GraphX API and implement graph algorithms
10) Implement Broadcast variable and Accumulators for performance tuning
11) Spark Real-time Projects
- - - - - - - - - - - - - -
Who should go for this Course?
This course is a must for anyone who aspires to embark into the field of big data and keep abreast of the latest developments around fast and efficient processing of ever-growing data using Spark and related projects. The course is ideal for:
1. Big Data enthusiasts
2. Software Architects, Engineers and Developers
3. Data Scientists and Analytics professionals
- - - - - - - - - - - - - -
Why learn Apache Spark?
In this era of ever growing data, the need for analyzing it for meaningful business insights is paramount. There are different big data processing alternatives like Hadoop, Spark, Storm and many more. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast big data analysis platforms.
The following Edureka blogs will help you understand the significance of Spark training:
5 Reasons to Learn Spark: https://goo.gl/7nMcS0
Apache Spark with Hadoop, Why it matters: https://goo.gl/I2MCeP
For more information, Please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: 18338555775 (toll-free).
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Telegram: https://t.me/edurekaupdates
Customer Review:
Michael Harkins, System Architect, Hortonworks says: βThe courses are top rate. The best part is live instruction, with playback. But my favorite feature is viewing a previous class. Also, they are always there to answer questions, and prompt when you open an issue if you are having any trouble. Added bonus ~ you get lifetime access to the course you took!!! Edureka lets you go back later, when your boss says "I want this ASAP!" ~ This is the killer education app... I've taken two courses, and I'm taking two more.β
detail
{'title': 'Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka', 'heatmap': [{'end': 4807.262, 'start': 4663.935, 'weight': 1}], 'summary': 'This tutorial comprehensively covers apache spark, highlighting its features, use cases, and hands-on examples, including its 100x faster processing speed, real-time and batch processing capabilities, and practical applications such as earthquake prediction, emphasizing its significance in big data analytics and industries like stock market, social media, and medical domain.', 'chapters': [{'end': 133.501, 'segs': [{'end': 114.211, 'src': 'embed', 'start': 40.841, 'weight': 0, 'content': [{'end': 41.982, 'text': 'What is Apache Spark?', 'start': 40.841, 'duration': 1.141}, {'end': 43.643, 'text': 'Why Apache Spark?', 'start': 42.543, 'duration': 1.1}, {'end': 45.485, 'text': 'Why we are learning this new technology?', 'start': 43.743, 'duration': 1.742}, {'end': 49.228, 'text': 'And today, what you must be hearing a lot about this Apache Spark.', 'start': 45.905, 'duration': 3.323}, {'end': 52.33, 'text': 'that Apache Spark is the next big thing in the world.', 'start': 49.228, 'duration': 3.102}, {'end': 57.154, 'text': 'why?. Why people are talking about that Apache Spark is the next big thing?', 'start': 52.33, 'duration': 4.824}, {'end': 62.978, 'text': 'What are the features in Apache Spark due to which we are talking like that?', 'start': 57.954, 'duration': 5.024}, {'end': 65.16, 'text': 'that Apache Spark is the next big thing?', 'start': 62.978, 'duration': 2.182}, {'end': 69.125, 'text': 'again?. what are the use cases related to Apache Spark?', 'start': 65.16, 'duration': 3.965}, {'end': 71.927, 'text': 'How Apache Spark ecosystem looks like?', 'start': 69.665, 'duration': 2.262}, {'end': 81.775, 'text': 'We will also do some hands-on example during the session and in the end I will walk you through a project which will be related to Apache Spark.', 'start': 72.387, 'duration': 9.388}, {'end': 85.778, 'text': 'So that is what you can expect from this session moving further.', 'start': 82.215, 'duration': 3.563}, {'end': 91.381, 'text': 'Now, first we, before we even talk about what is Apache Spark.', 'start': 86.338, 'duration': 5.043}, {'end': 96.823, 'text': "it's very important to understand big data, because that is what we are going to use.", 'start': 91.381, 'duration': 5.442}, {'end': 99.024, 'text': 'right, Apache Spark will be used on big data.', 'start': 96.823, 'duration': 2.201}, {'end': 104.367, 'text': 'Now what is this keyword big data that is first thing which we are going to discuss.', 'start': 99.444, 'duration': 4.923}, {'end': 111.85, 'text': 'Now, if I ask you what is big data, what do you understand by big data?', 'start': 105.347, 'duration': 6.503}, {'end': 114.211, 'text': 'what would be your response?', 'start': 111.85, 'duration': 2.361}], 'summary': 'Introduction to apache spark, its features, use cases, and hands-on examples for big data processing.', 'duration': 73.37, 'max_score': 40.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo40841.jpg'}], 'start': 0.629, 'title': 'Apache spark: next big thing', 'summary': 'Introduces the significance of apache spark, its features, use cases, and hands-on examples, emphasizing its importance in the big data ecosystem, aiming to provide a comprehensive understanding of the technology.', 'chapters': [{'end': 133.501, 'start': 0.629, 'title': 'Apache spark: next big thing', 'summary': 'Introduces the significance of apache spark, its features, use cases, and hands-on examples, emphasizing its importance in the big data ecosystem, aiming to provide a comprehensive understanding of the technology.', 'duration': 132.872, 'highlights': ["Apache Spark's significance and features The chapter explores the reasons behind the buzz around Apache Spark, emphasizing its status as the next big thing in the world of technology and delving into the features that contribute to its prominence.", 'Introduction to big data and its relevance to Apache Spark Before discussing Apache Spark, the chapter stresses the importance of understanding big data, highlighting its essential role in utilizing Apache Spark and encouraging interactive participation for comprehensive learning.', 'Interactive learning approach The speaker encourages interactive participation to enhance understanding, ensuring that participants will gain a solid comprehension of Apache Spark by the end of the session.']}], 'duration': 132.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo629.jpg', 'highlights': ['Introduction to big data and its relevance to Apache Spark Before discussing Apache Spark, the chapter stresses the importance of understanding big data, highlighting its essential role in utilizing Apache Spark and encouraging interactive participation for comprehensive learning.', "Apache Spark's significance and features The chapter explores the reasons behind the buzz around Apache Spark, emphasizing its status as the next big thing in the world of technology and delving into the features that contribute to its prominence.", 'Interactive learning approach The speaker encourages interactive participation to enhance understanding, ensuring that participants will gain a solid comprehension of Apache Spark by the end of the session.']}, {'end': 1621.852, 'segs': [{'end': 270.17, 'src': 'embed', 'start': 240.932, 'weight': 0, 'content': [{'end': 255.734, 'text': 'he mentioned it as a fun fact and mentioned that Facebook today have number of users equivalent to number of people living in this globe 100 years ago.', 'start': 240.932, 'duration': 14.802}, {'end': 257.435, 'text': "That's a big statement.", 'start': 256.414, 'duration': 1.021}, {'end': 261.808, 'text': "We can also deal with unstructured data, I'm coming to that point.", 'start': 259.207, 'duration': 2.601}, {'end': 267.51, 'text': 'So they are talking a big thing, right? So now this is a challenge with Facebook.', 'start': 262.308, 'duration': 5.202}, {'end': 270.17, 'text': 'You can imagine how much big amount of data is talking about.', 'start': 267.53, 'duration': 2.64}], 'summary': "Facebook today has as many users as the world's population 100 years ago, posing a big data challenge.", 'duration': 29.238, 'max_score': 240.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo240932.jpg'}, {'end': 394.255, 'src': 'embed', 'start': 344.371, 'weight': 1, 'content': [{'end': 354.575, 'text': 'because what if I have a unstructured data, even if it is small in nature, but still it we have to use, still use this Hadoop tools, big data tools,', 'start': 344.371, 'duration': 10.204}, {'end': 355.115, 'text': 'to solve them.', 'start': 354.575, 'duration': 0.54}, {'end': 361.557, 'text': 'So in those cases also use the data tools because RDBMS is not efficient to solve all those kind of problem.', 'start': 355.435, 'duration': 6.122}, {'end': 368.72, 'text': 'So that is one thing now whatever data you get they also can have some sort of problems.', 'start': 362.378, 'duration': 6.342}, {'end': 376.965, 'text': 'Like there can be a missing data, there can be a corrupt data, how to deal with that data that is called veracity.', 'start': 369.101, 'duration': 7.864}, {'end': 380.467, 'text': 'That is also one property of big data.', 'start': 377.365, 'duration': 3.102}, {'end': 390.852, 'text': 'So you can see big data is not just about now volume part, it consists of multiple other factors like velocity, variety, veracity.', 'start': 380.707, 'duration': 10.145}, {'end': 394.255, 'text': 'All these are important component of big data.', 'start': 391.452, 'duration': 2.803}], 'summary': 'Use big data tools for unstructured data, handling missing/corrupt data and veracity issues, alongside volume, velocity, and variety considerations.', 'duration': 49.884, 'max_score': 344.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo344371.jpg'}, {'end': 477.273, 'src': 'embed', 'start': 445.023, 'weight': 3, 'content': [{'end': 446.924, 'text': 'We can talk about Facebook.', 'start': 445.023, 'duration': 1.901}, {'end': 452.566, 'text': 'every minute, so many users are posting the things or liking something.', 'start': 446.924, 'duration': 5.642}, {'end': 454.327, 'text': 'so much of event is occurring.', 'start': 452.566, 'duration': 1.761}, {'end': 458.749, 'text': 'We can talk about Twitter every minute 3,47,222 tweets are happening.', 'start': 454.647, 'duration': 4.102}, {'end': 464.751, 'text': 'So much of activity is happening per minute.', 'start': 462.05, 'duration': 2.701}, {'end': 468.831, 'text': 'We are talking about per minute, so you can imagine what must be happening now.', 'start': 465.071, 'duration': 3.76}, {'end': 477.273, 'text': "So there's a fun fact which there's a statistics in fact what tells that every two years data is getting doubled.", 'start': 469.151, 'duration': 8.122}], 'summary': 'Facebook and twitter see high user activity, with millions of posts and tweets per minute, and data doubling every two years.', 'duration': 32.25, 'max_score': 445.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo445023.jpg'}, {'end': 593.481, 'src': 'embed', 'start': 567.889, 'weight': 2, 'content': [{'end': 577.552, 'text': 'and imagine, right now it says when you go to this indeed.com or nokri.com, you see so many jobs popping up for Apache, Spark, Big Data and all.', 'start': 567.889, 'duration': 9.663}, {'end': 581.213, 'text': 'Imagine what is going to happen in 2020.', 'start': 577.932, 'duration': 3.281}, {'end': 585.056, 'text': 'there will be a huge demand and less supply of people.', 'start': 581.213, 'duration': 3.843}, {'end': 593.481, 'text': "I generally say this so in your company, if you're working, let's say in a database company, you must be seeing your managers.", 'start': 585.756, 'duration': 7.725}], 'summary': 'Anticipate high demand for apache, spark, and big data jobs in 2020.', 'duration': 25.592, 'max_score': 567.889, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo567889.jpg'}, {'end': 689.172, 'src': 'embed', 'start': 659.049, 'weight': 7, 'content': [{'end': 665.374, 'text': 'In fact, lot of people have also come to this level and said that people in next five years,', 'start': 659.049, 'duration': 6.325}, {'end': 673.339, 'text': 'the companies who will not be transforming towards big data or Apache SPA, they will not even be able to survive in the market.', 'start': 665.374, 'duration': 7.965}, {'end': 676.682, 'text': 'This is also being said by the analyst.', 'start': 673.76, 'duration': 2.922}, {'end': 682.486, 'text': 'Now, imagine, by 2020, how much of data we will be dealing with you.', 'start': 677.342, 'duration': 5.144}, {'end': 689.172, 'text': 'talk from animals, shopping cart, vehicles, any sort of event which is generating data.', 'start': 682.486, 'duration': 6.686}], 'summary': 'Analysts predict companies need to embrace big data to survive market in 5 years.', 'duration': 30.123, 'max_score': 659.049, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo659049.jpg'}, {'end': 984.017, 'src': 'embed', 'start': 958.151, 'weight': 4, 'content': [{'end': 964.336, 'text': 'If we talk about real-time analysis I just talked about few use cases just right now, like credit card in banking.', 'start': 958.151, 'duration': 6.185}, {'end': 965.057, 'text': "it's very important.", 'start': 964.336, 'duration': 0.721}, {'end': 968.46, 'text': "For government agencies, you're applying for Aadhaar cards and all.", 'start': 965.457, 'duration': 3.003}, {'end': 970.682, 'text': "So if you're in India, you might be doing it.", 'start': 968.76, 'duration': 1.922}, {'end': 974.485, 'text': 'Can you give one more instance for real-time processing? This is in front of you, Sameer, now.', 'start': 970.702, 'duration': 3.783}, {'end': 979.992, 'text': 'now, if we talk about any stock market analysis right stock market analysis.', 'start': 974.905, 'duration': 5.087}, {'end': 984.017, 'text': 'if we talk about that right now, immediately what happens?', 'start': 979.992, 'duration': 4.025}], 'summary': 'Real-time analysis is crucial in banking, government, and stock market for immediate processing.', 'duration': 25.866, 'max_score': 958.151, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo958151.jpg'}, {'end': 1193.468, 'src': 'embed', 'start': 1168.528, 'weight': 5, 'content': [{'end': 1173.711, 'text': "Is it the only advantage? No, let's understand few more things with respect to Apache Spark.", 'start': 1168.528, 'duration': 5.183}, {'end': 1182.279, 'text': 'now when we talk about Hadoop, right so as it just into this part, like it happens with batch processing now when it comes to Spark,', 'start': 1174.432, 'duration': 7.847}, {'end': 1186.122, 'text': 'it happens with respect to your real-time processing now.', 'start': 1182.279, 'duration': 3.843}, {'end': 1188.124, 'text': 'so the same thing which I just explained you.', 'start': 1186.122, 'duration': 2.002}, {'end': 1191.286, 'text': 'so with Hadoop, you can have handle the data from multiple sources.', 'start': 1188.124, 'duration': 3.162}, {'end': 1193.468, 'text': 'you can process the data in real time.', 'start': 1191.286, 'duration': 2.182}], 'summary': 'Apache spark allows real-time processing and handling data from multiple sources.', 'duration': 24.94, 'max_score': 1168.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo1168528.jpg'}], 'start': 133.921, 'title': 'Big data and analytics', 'summary': 'Delves into the concept of big data, its exponential growth on social media platforms, the projected increase in demand for big data professionals, and the significance of big data analytics. it also covers real-time processing in various industries, the limitations of hadoop, and the advantages of apache spark for historical and real-time data processing.', 'chapters': [{'end': 394.255, 'start': 133.921, 'title': 'Understanding big data', 'summary': 'Discusses the concept of big data, emphasizing its multiple properties including volume, variety, and its impact on platforms like facebook, with a notable reference to the exponential growth of data and the challenges posed by unstructured data.', 'duration': 260.334, 'highlights': ["The exponential growth of data on platforms like Facebook is highlighted, with the CEO mentioning that the number of users is equivalent to the world's population 100 years ago. The CEO of Facebook mentioned that the platform now has a number of users equivalent to the world's population 100 years ago, showcasing the exponential growth of data.", 'The discussion on unstructured data and its challenges is presented, emphasizing the limitations of RDBMS in dealing with unstructured data. The limitations of RDBMS in handling unstructured data are emphasized, highlighting the need for big data tools like Hadoop to address unstructured data challenges.', 'The comprehensive definition of big data is provided, encompassing properties such as volume, variety, velocity, and veracity. The chapter provides a comprehensive definition of big data, encompassing properties like volume, variety, velocity, and veracity as essential components of big data.']}, {'end': 658.549, 'start': 394.555, 'title': 'Big data explosion', 'summary': 'Discusses the exponential growth of data on social media platforms like facebook, instagram, youtube, and twitter, the challenges faced by companies in leveraging big data, and the projected increase in demand for big data professionals by 2020.', 'duration': 263.994, 'highlights': ['The exponential growth of data on social media platforms like Facebook, Instagram, YouTube, and Twitter is discussed, with specific statistics on user activities and data volume per minute. Facebook, Instagram, YouTube, and Twitter are mentioned, with specific statistics on user activities and data volume per minute, such as 17,36,111 posts on Instagram, three hours of video uploaded on YouTube, and 3,47,222 tweets on Twitter per minute.', 'The challenges faced by companies in leveraging big data, including hesitancy to transition to big data tools such as Hadoop and concerns about support and user base, are outlined. Only 4-5% of companies working with data have realized the potential of big data, but many are hesitant to transition to big data tools like Hadoop due to concerns about support and user base.', 'The projected increase in demand for big data professionals by 2020 and the expected shortage of skilled individuals are emphasized, indicating the need for companies to adapt to the big data domain. It is expected that the 5% of companies currently working with big data will grow to 40% by 2020, leading to a huge demand and less supply of skilled individuals in the big data domain.']}, {'end': 835.474, 'start': 659.049, 'title': 'Importance of big data analytics', 'summary': 'Explains the significance of big data analytics, predicting that companies not adopting big data may struggle to survive in the market by 2020 due to the exponential growth of data from various sources such as iot, and discusses the challenges of analyzing large volumes of data, along with the concepts of batch and real-time analytics.', 'duration': 176.425, 'highlights': ['Companies not adopting big data may struggle to survive in the market by 2020 The speaker emphasizes the urgency of adopting big data analytics, suggesting that companies not transforming towards big data or Apache SPA may face survival challenges in the market by 2020.', 'Exponential growth of data from various sources such as IoT The transcript predicts an exponential increase in data volume from sources like IoT, indicating the need for big data analytics to handle the vast amount of generated data.', 'Challenges of analyzing large volumes of data The speaker discusses the major challenge of analyzing large volumes of data, questioning the feasibility of gaining business insights from such extensive data and highlighting the importance of understanding this domain.', 'Concepts of batch and real-time analytics The chapter introduces the concepts of batch and real-time analytics, using the analogy of processing clothes with a washing machine to explain batch processing and real-time processing, providing a clear understanding of the two types of analytics.']}, {'end': 1621.852, 'start': 836.087, 'title': 'Real-time processing and apache spark', 'summary': 'Discusses real-time processing in banking, stock market analysis, and healthcare, highlighting the importance of real-time analysis and the limitations of hadoop for real-time processing. it also explains the advantages of apache spark over hadoop, focusing on its ability to handle both historical and real-time data processing, its ease of use, and faster processing speed.', 'duration': 785.765, 'highlights': ['Real-time Processing in Banking, Stock Market Analysis, and Healthcare Real-time processing is crucial in banking for credit card transactions, stock market analysis by companies like Tower Research and Goldman Sachs, and healthcare for immediate patient insights and treatment.', 'Limitations of Hadoop for Real-time Processing Hadoop is limited to batch processing and cannot handle real-time processing, leading to delays in data assessment and processing, making it unsuitable for real-time data analysis.', 'Advantages of Apache Spark over Hadoop Apache Spark can handle both historical and real-time data processing, offers ease of use compared to MapReduce programming in Hadoop, and provides faster data processing, making it superior to Hadoop for real-time and batch processing.']}], 'duration': 1487.931, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo133921.jpg', 'highlights': ["The exponential growth of data on platforms like Facebook is highlighted, with the CEO mentioning that the number of users is equivalent to the world's population 100 years ago.", 'The comprehensive definition of big data is provided, encompassing properties such as volume, variety, velocity, and veracity.', 'The projected increase in demand for big data professionals by 2020 and the expected shortage of skilled individuals are emphasized, indicating the need for companies to adapt to the big data domain.', 'Facebook, Instagram, YouTube, and Twitter are mentioned, with specific statistics on user activities and data volume per minute, such as 17,36,111 posts on Instagram, three hours of video uploaded on YouTube, and 3,47,222 tweets on Twitter per minute.', 'Real-time processing is crucial in banking for credit card transactions, stock market analysis by companies like Tower Research and Goldman Sachs, and healthcare for immediate patient insights and treatment.', 'Advantages of Apache Spark over Hadoop Apache Spark can handle both historical and real-time data processing, offers ease of use compared to MapReduce programming in Hadoop, and provides faster data processing, making it superior to Hadoop for real-time and batch processing.', 'The limitations of RDBMS in handling unstructured data are emphasized, highlighting the need for big data tools like Hadoop to address unstructured data challenges.', 'It is expected that the 5% of companies currently working with big data will grow to 40% by 2020, leading to a huge demand and less supply of skilled individuals in the big data domain.']}, {'end': 2565.773, 'segs': [{'end': 1667.969, 'src': 'embed', 'start': 1642.256, 'weight': 1, 'content': [{'end': 1652.083, 'text': 'Now how MapReduce have solved this? How MapReduce have solving this? What is the correct solution for this? So let us see how we can solve this.', 'start': 1642.256, 'duration': 9.827}, {'end': 1656.906, 'text': 'So from where this bottleneck start? Bottleneck started because we were looking back.', 'start': 1652.103, 'duration': 4.803}, {'end': 1662.09, 'text': "How about if we remove that bottleneck? So let's see, so let me remove this solution from here.", 'start': 1657.407, 'duration': 4.683}, {'end': 1667.969, 'text': 'and now what we are going to do is let us give a better solution.', 'start': 1663.766, 'duration': 4.203}], 'summary': 'Using mapreduce to address bottleneck and provide a better solution.', 'duration': 25.713, 'max_score': 1642.256, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo1642256.jpg'}, {'end': 1855.261, 'src': 'embed', 'start': 1827.026, 'weight': 0, 'content': [{'end': 1831.587, 'text': 'So these are the three steps involved in MapReduce programming.', 'start': 1827.026, 'duration': 4.561}, {'end': 1835.368, 'text': 'Now this is how you will be solving your problem.', 'start': 1832.347, 'duration': 3.021}, {'end': 1840.269, 'text': 'Now, okay, we understood this part, but why MapReduce was lower?', 'start': 1835.848, 'duration': 4.421}, {'end': 1848.13, 'text': 'This is still a mystery to us, right, because we want to definitely understand that why we were talking about that MapReduce is lower.', 'start': 1840.509, 'duration': 7.621}, {'end': 1850.777, 'text': 'in order to solve this.', 'start': 1849.216, 'duration': 1.561}, {'end': 1855.261, 'text': "what we are doing is so I'm right now assuming that my replication factor is one again.", 'start': 1850.777, 'duration': 4.484}], 'summary': 'Three steps in mapreduce programming, mystery of why mapreduce was lower needs understanding.', 'duration': 28.235, 'max_score': 1827.026, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo1827026.jpg'}, {'end': 2246.345, 'src': 'embed', 'start': 2221.092, 'weight': 2, 'content': [{'end': 2225.855, 'text': 'and why that happened is because there are so much of input-output operation.', 'start': 2221.092, 'duration': 4.763}, {'end': 2226.835, 'text': 'thanks a lot.', 'start': 2225.855, 'duration': 0.98}, {'end': 2227.836, 'text': "let's move further.", 'start': 2226.835, 'duration': 1.001}, {'end': 2230.617, 'text': 'so this is the problem with MapReduce.', 'start': 2227.836, 'duration': 2.781}, {'end': 2234.519, 'text': "now let's see that how Apache Spark is solving this problem.", 'start': 2230.617, 'duration': 3.902}, {'end': 2245.364, 'text': 'how Apache Spark solve the problems and why it is faster, why we are saying that I will be able to give the output in faster time,', 'start': 2234.881, 'duration': 10.483}, {'end': 2246.345, 'text': "so let's understand that.", 'start': 2245.364, 'duration': 0.981}], 'summary': 'Apache spark solves input-output problems faster than mapreduce.', 'duration': 25.253, 'max_score': 2221.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2221092.jpg'}, {'end': 2499.051, 'src': 'embed', 'start': 2464.463, 'weight': 3, 'content': [{'end': 2475.989, 'text': 'In Apache Spark also there is one main entry point without which none of your application will work and that entry point is called Spark context.', 'start': 2464.463, 'duration': 11.526}, {'end': 2480.792, 'text': 'We also denote Spark context as SC.', 'start': 2476.93, 'duration': 3.862}, {'end': 2487.615, 'text': 'Now this is the main entry point and this decides at the master machine.', 'start': 2481.712, 'duration': 5.903}, {'end': 2492.658, 'text': 'So we will be keeping this SC here.', 'start': 2488.416, 'duration': 4.242}, {'end': 2499.051, 'text': "okay?. for when you write your Java programs, let's say you have written one project.", 'start': 2492.658, 'duration': 6.393}], 'summary': 'The main entry point in apache spark is spark context, denoted as sc, which is crucial for application functionality and is located at the master machine.', 'duration': 34.588, 'max_score': 2464.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2464463.jpg'}], 'start': 1621.852, 'title': 'Mapreduce and apache spark', 'summary': 'Discusses how mapreduce solves bottlenecks by improving algorithm efficiency, explains mapreduce programming steps, and introduces the inefficiency of mapreduce programs due to excessive input-output operations and how apache spark aims to improve speed and efficiency.', 'chapters': [{'end': 1667.969, 'start': 1621.852, 'title': 'Mapreduce: solving bottlenecks', 'summary': 'Discusses how mapreduce solves the bottleneck problem by eliminating the need to repeatedly check for previously encountered elements, thus improving algorithm efficiency and performance.', 'duration': 46.117, 'highlights': ['MapReduce eliminates the bottleneck by removing the need to repeatedly check for previously encountered elements, leading to improved algorithm efficiency.', "The bottleneck is caused by the need to check whether each new entry has occurred before, creating a major hindrance to the algorithm's performance."]}, {'end': 2123.005, 'start': 1667.969, 'title': 'Mapreduce programming steps', 'summary': 'Explains the steps involved in mapreduce programming, including mapper, sort and shuffle, and reducer phases, along with the challenges of input/output operations and memory management.', 'duration': 455.036, 'highlights': ['The steps involved in MapReduce programming are mapper, sort and shuffle, and reducer phases. This detail is the most relevant as it summarizes the main content of the chapter.', 'Input/output operations in MapReduce can lead to a degradation in performance due to disk seek and copying data to memory. This highlight explains a significant challenge in MapReduce programming and its impact on performance.', 'Memory management in MapReduce is crucial, and data is divided into 128 MB blocks to ensure efficient processing. This detail provides insight into the specific memory management practices in MapReduce.']}, {'end': 2565.773, 'start': 2124.136, 'title': 'Mapreduce input-output operations', 'summary': 'Explains the inefficiency of mapreduce programs due to excessive input-output operations, leading to slower execution times, and introduces the problem that apache spark aims to solve by improving speed and efficiency.', 'duration': 441.637, 'highlights': ['MapReduce programs are slower due to excessive input-output operations, which is evident in the example provided, causing a delay in execution times. The chapter provides an example demonstrating how the numerous input-output operations within MapReduce programs contribute to slower execution times.', "Introduction to the problem that Apache Spark aims to solve by improving speed and efficiency, setting the stage for the subsequent explanation of Apache Spark's approach. The chapter sets the stage for explaining how Apache Spark addresses the inefficiencies of MapReduce by improving speed and efficiency.", 'Explanation of how Apache Spark addresses the inefficiencies of MapReduce by improving speed and efficiency, introducing the concept of Spark context as the main entry point. The chapter introduces the concept of Spark context as the main entry point in Apache Spark, essential for improving speed and efficiency in contrast to MapReduce.']}], 'duration': 943.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo1621852.jpg', 'highlights': ['The steps involved in MapReduce programming are mapper, sort and shuffle, and reducer phases. This detail is the most relevant as it summarizes the main content of the chapter.', 'MapReduce eliminates the bottleneck by removing the need to repeatedly check for previously encountered elements, leading to improved algorithm efficiency.', "Introduction to the problem that Apache Spark aims to solve by improving speed and efficiency, setting the stage for the subsequent explanation of Apache Spark's approach.", 'Explanation of how Apache Spark addresses the inefficiencies of MapReduce by improving speed and efficiency, introducing the concept of Spark context as the main entry point.']}, {'end': 3583.26, 'segs': [{'end': 2600.75, 'src': 'embed', 'start': 2566.794, 'weight': 0, 'content': [{'end': 2582.682, 'text': 'What this text file API will do in Apache Spark would be whatever file you have written inside it f.txt it will go and search that file and will load it in the memory of your machine.', 'start': 2566.794, 'duration': 15.888}, {'end': 2585.043, 'text': 'what does I mean by that?', 'start': 2582.682, 'duration': 2.361}, {'end': 2594.606, 'text': 'now? in this case, f.txt is where, in three machine, f.txt is B1 block, B2 block, B3 block.', 'start': 2585.043, 'duration': 9.563}, {'end': 2598.748, 'text': 'so what is going to happen would be your B1 block.', 'start': 2594.606, 'duration': 4.142}, {'end': 2600.75, 'text': 'let me create this.', 'start': 2599.369, 'duration': 1.381}], 'summary': 'Apache spark text file api loads f.txt into memory across 3 machines.', 'duration': 33.956, 'max_score': 2566.794, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2566794.jpg'}, {'end': 2662.693, 'src': 'embed', 'start': 2633.955, 'weight': 1, 'content': [{'end': 2636.036, 'text': 'Now what is going to happen?', 'start': 2633.955, 'duration': 2.081}, {'end': 2639.837, 'text': 'So we have just understood that B1, B2, B3 will be there here.', 'start': 2636.316, 'duration': 3.521}, {'end': 2646.639, 'text': 'I am assuming that my memory is big enough to hold all this data.', 'start': 2639.837, 'duration': 6.802}, {'end': 2649.139, 'text': 'Now what happens in case?', 'start': 2647.399, 'duration': 1.74}, {'end': 2651.48, 'text': "will all the blocks say it's not mandatory, sort of?", 'start': 2649.139, 'duration': 2.341}, {'end': 2656.368, 'text': 'it is not mandatory that all your block size should be same.', 'start': 2652.125, 'duration': 4.243}, {'end': 2658.95, 'text': 'it can be different as well.', 'start': 2656.368, 'duration': 2.582}, {'end': 2662.693, 'text': "okay, it doesn't matter whatever would be the block size.", 'start': 2658.95, 'duration': 3.743}], 'summary': 'B1, b2, b3 present; block size not mandatory, can be different.', 'duration': 28.738, 'max_score': 2633.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2633955.jpg'}, {'end': 2740.193, 'src': 'embed', 'start': 2705.057, 'weight': 2, 'content': [{'end': 2712.121, 'text': 'What is RDD? RDD is the distributed data sitting in memory.', 'start': 2705.057, 'duration': 7.064}, {'end': 2720.226, 'text': 'What is the full form of RDD? Full form of RDD is resilient distributed data.', 'start': 2712.742, 'duration': 7.484}, {'end': 2722.427, 'text': 'Now let me ask you one question.', 'start': 2720.886, 'duration': 1.541}, {'end': 2731.993, 'text': 'Is this a distributed data or not? Is it a distributed data or not? Yes it is, right? It is a distributed data.', 'start': 2722.888, 'duration': 9.105}, {'end': 2736.551, 'text': 'what do you understand by resilient?', 'start': 2733.169, 'duration': 3.382}, {'end': 2737.832, 'text': 'can I get an answer?', 'start': 2736.551, 'duration': 1.281}, {'end': 2739.593, 'text': 'what do you understand by this keyword?', 'start': 2737.832, 'duration': 1.761}, {'end': 2740.193, 'text': 'does it do?', 'start': 2739.593, 'duration': 0.6}], 'summary': "Rdd is resilient distributed data stored in memory, it's distributed and resilient.", 'duration': 35.136, 'max_score': 2705.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2705057.jpg'}, {'end': 2909.434, 'src': 'embed', 'start': 2878.316, 'weight': 3, 'content': [{'end': 2884.638, 'text': 'so that is the way RDD ensures that it is resilient.', 'start': 2878.316, 'duration': 6.322}, {'end': 2891.7, 'text': 'even if you lose the data or if you lose any of the machine, it does not matter, it takes care of it.', 'start': 2884.638, 'duration': 7.062}, {'end': 2894.681, 'text': 'so this is called your resilient portion.', 'start': 2891.7, 'duration': 2.981}, {'end': 2904.361, 'text': "Now let's move further so we just understood what is RDD and secondly how it is resilient.", 'start': 2896.101, 'duration': 8.26}, {'end': 2906.948, 'text': "Let's take one more step.", 'start': 2905.063, 'duration': 1.885}, {'end': 2909.434, 'text': 'So we have created our number already.', 'start': 2907.349, 'duration': 2.085}], 'summary': 'Rdd ensures resilience, even with data or machine loss.', 'duration': 31.118, 'max_score': 2878.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2878316.jpg'}, {'end': 2994.952, 'src': 'embed', 'start': 2964.124, 'weight': 4, 'content': [{'end': 2965.304, 'text': 'it can be anything.', 'start': 2964.124, 'duration': 1.18}, {'end': 2969.082, 'text': 'whatever program you want to write you can mention it here.', 'start': 2965.959, 'duration': 3.123}, {'end': 2979.392, 'text': 'So whatever code you will be writing your map function will be responsible or your map API will be responsible to execute it.', 'start': 2969.382, 'duration': 10.01}, {'end': 2982.155, 'text': 'Now what we are doing here.', 'start': 2980.273, 'duration': 1.882}, {'end': 2987.326, 'text': 'One more point RDDs are always immutable.', 'start': 2982.762, 'duration': 4.564}, {'end': 2994.952, 'text': 'whenever, say I say that RDDs are immutable, that means if you have already put the block B1 into the memory,', 'start': 2987.326, 'duration': 7.626}], 'summary': 'Rdds are immutable in spark, and the map function or map api is responsible for executing program code.', 'duration': 30.828, 'max_score': 2964.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2964124.jpg'}, {'end': 3230.321, 'src': 'embed', 'start': 3198.468, 'weight': 6, 'content': [{'end': 3200.649, 'text': 'but still Spark have a way.', 'start': 3198.468, 'duration': 2.181}, {'end': 3206.43, 'text': 'if your RAM is low also it can handle it and that concept is called pipelining concept.', 'start': 3200.649, 'duration': 5.781}, {'end': 3214.751, 'text': "I'm not going to cover it in this session but yes there is a way even if your memory is less Spark take care of it.", 'start': 3206.75, 'duration': 8.001}, {'end': 3216.772, 'text': 'Very interesting concept in Spark.', 'start': 3215.151, 'duration': 1.621}, {'end': 3224.313, 'text': "Okay yes again that's a very interesting concept that Spark can still handle if you have little less memory.", 'start': 3217.112, 'duration': 7.201}, {'end': 3230.321, 'text': 'So that makes Spark very smart framework that is the reason people are going for this framework.', 'start': 3225.04, 'duration': 5.281}], 'summary': 'Spark can handle low ram using pipelining concept, making it a smart framework.', 'duration': 31.853, 'max_score': 3198.468, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3198468.jpg'}, {'end': 3332.09, 'src': 'embed', 'start': 3296.959, 'weight': 7, 'content': [{'end': 3298.419, 'text': 'right? On my number, RDD.', 'start': 3296.959, 'duration': 1.46}, {'end': 3302.88, 'text': 'My number RDD is dependent on something? Yes, f.php.', 'start': 3298.699, 'duration': 4.181}, {'end': 3306.02, 'text': 'So this file it is keeping.', 'start': 3303.34, 'duration': 2.68}, {'end': 3309.781, 'text': 'Now can you see, this is a graph which I just submitted here?', 'start': 3306.461, 'duration': 3.32}, {'end': 3319.766, 'text': 'This graph is maintained by Spark context as soon as you execute all this statement and this tag.', 'start': 3310.161, 'duration': 9.605}, {'end': 3325.868, 'text': 'this is called that directed acyclic graph is also called as lineage.', 'start': 3319.766, 'duration': 6.102}, {'end': 3332.09, 'text': 'So, in lineage, what happens?', 'start': 3330.329, 'duration': 1.761}], 'summary': 'The rdd number is dependent on f.php, maintained by spark context as a directed acyclic graph (dag) or lineage.', 'duration': 35.131, 'max_score': 3296.959, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3296959.jpg'}, {'end': 3430.782, 'src': 'embed', 'start': 3361.636, 'weight': 8, 'content': [{'end': 3368.821, 'text': 'this B5 block got generated due to B2 block and this B6 block will be generated due to B3 block.', 'start': 3361.636, 'duration': 7.185}, {'end': 3376.86, 'text': 'In other terms I can say this F filter one RDD got generated with the help of number RDD.', 'start': 3369.815, 'duration': 7.045}, {'end': 3385.726, 'text': 'So number was also an RDD but from that number RDD I created a new RDD that is called as filter one RDD.', 'start': 3377.2, 'duration': 8.526}, {'end': 3389.889, 'text': 'This step is called a transformation step.', 'start': 3386.207, 'duration': 3.682}, {'end': 3394.233, 'text': 'So this step we called it as transformation step.', 'start': 3390.77, 'duration': 3.463}, {'end': 3401.59, 'text': 'Now are we printing any output here? No, we are keeping only the data and memory.', 'start': 3396.534, 'duration': 5.056}, {'end': 3406.031, 'text': 'In Java we used to use this print statement and all right.', 'start': 3402.01, 'duration': 4.021}, {'end': 3413.433, 'text': "In a Spark we don't have print statement but instead of that we have a collect statement.", 'start': 3406.591, 'duration': 6.842}, {'end': 3424.537, 'text': "So let's say if I want to print B4, B5, B6 that means I want to print filter1.add, I can write filter1.collect.", 'start': 3413.833, 'duration': 10.704}, {'end': 3430.782, 'text': 'This will print B4, B5, B6 to your console.', 'start': 3425.237, 'duration': 5.545}], 'summary': 'B6 block generated from b3. transformation step in spark. using collect to print output.', 'duration': 69.146, 'max_score': 3361.636, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3361636.jpg'}, {'end': 3516.807, 'src': 'embed', 'start': 3488.285, 'weight': 5, 'content': [{'end': 3490.467, 'text': 'moving further, this is how it is done.', 'start': 3488.285, 'duration': 2.182}, {'end': 3492.868, 'text': 'so this I just discussed about this part.', 'start': 3490.467, 'duration': 2.401}, {'end': 3496.25, 'text': 'that spark provides you faster processing.', 'start': 3492.868, 'duration': 3.382}, {'end': 3498.992, 'text': 'so basically, RDD creation, start with the transformation.', 'start': 3496.25, 'duration': 2.742}, {'end': 3500.113, 'text': 'yes, yes, sir.', 'start': 3498.992, 'duration': 1.121}, {'end': 3507.244, 'text': 'Now, this faster processing is the part which we have just discussed and can you also see, it is very easy to use, right?', 'start': 3500.882, 'duration': 6.362}, {'end': 3510.785, 'text': 'it is very easy to use in comparison to my MapReduce.', 'start': 3507.244, 'duration': 3.541}, {'end': 3516.807, 'text': "if you have already done MapReduce programming or if you remember that Apple Boren's banana example right.", 'start': 3510.785, 'duration': 6.022}], 'summary': 'Spark provides faster processing and is easier to use than mapreduce.', 'duration': 28.522, 'max_score': 3488.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3488285.jpg'}, {'end': 3562.163, 'src': 'embed', 'start': 3536.663, 'weight': 10, 'content': [{'end': 3542.065, 'text': "Now moving further let's understand Spark success story what are the things we have.", 'start': 3536.663, 'duration': 5.402}, {'end': 3544.305, 'text': 'Now Spark success stories.', 'start': 3542.465, 'duration': 1.84}, {'end': 3548.866, 'text': 'there are a lot of people who are using it these days, like if we talk about stock market.', 'start': 3544.305, 'duration': 4.561}, {'end': 3555.768, 'text': 'stock market is using Apache Spark a lot because of faster processing capability, easier in nature,', 'start': 3548.866, 'duration': 6.902}, {'end': 3559.489, 'text': 'plus a lot of things which are available with Spark easily.', 'start': 3555.768, 'duration': 3.721}, {'end': 3562.163, 'text': 'Twitter sentiment analysis.', 'start': 3560.421, 'duration': 1.742}], 'summary': 'Apache spark is widely used, especially in stock market for faster processing and in twitter sentiment analysis.', 'duration': 25.5, 'max_score': 3536.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3536663.jpg'}], 'start': 2566.794, 'title': 'Apache spark fundamentals', 'summary': 'Covers the apache spark text file api for distributing file blocks, the concept of resilient distributed data (rdd) in spark, and the success stories of spark in industries like stock market, social media analysis, and banking for fraud detection.', 'chapters': [{'end': 2662.693, 'start': 2566.794, 'title': 'Apache spark text file api', 'summary': 'Explains how the apache spark text file api loads and distributes file blocks across machine memories, where each block is copied into the memory of different machines, and the block sizes can vary.', 'duration': 95.899, 'highlights': ['The Apache Spark text file API loads file blocks into the memory of machines, with each block being copied into different machine memories.', 'Block sizes can vary and it is not mandatory for all block sizes to be the same.']}, {'end': 3268.03, 'start': 2662.693, 'title': 'Understanding resilient distributed data in spark', 'summary': 'Explains the concept of resilient distributed data (rdd) in spark, highlighting its role in distributed data storage, its resilience through replication factor, and its immutability. it also discusses the speed advantages of spark over mapreduce due to in-memory processing and the capability to handle low memory through pipelining.', 'duration': 605.337, 'highlights': ['The concept of Resilient Distributed Data (RDD) in Spark is explained, emphasizing its role as distributed data sitting in memory and its full form as resilient distributed data.', 'The resilience of RDD is elaborated, detailing its reliability through the replication factor, ensuring data is not lost even if a machine fails.', 'The immutability of RDDs is highlighted, explaining that once a block is in memory, it cannot be changed, and new blocks are created through processing without altering the original data.', 'The speed advantages of Spark over MapReduce are discussed, emphasizing the reduction in input-output operations and the use of in-memory processing, leading to faster output.', 'The capability of Spark to handle low memory through pipelining is mentioned, showcasing its smart framework design and its appeal for users.', "The chapter concludes by hinting at further detailed topics covered in sessions, indicating the extensive coverage of Spark's handling mechanisms and addressing reader curiosity about specific concepts."]}, {'end': 3583.26, 'start': 3268.03, 'title': 'Apache spark: transformation, action, and success', 'summary': 'Discusses the concepts of transformation and action in apache spark, emphasizing the creation of rdds, maintenance of lineage, and the simplicity and speed of spark in contrast to mapreduce. it also highlights the success stories of spark in industries such as stock market, social media analysis, and banking for fraud detection.', 'duration': 315.23, 'highlights': ['The maintenance of lineage in Apache Spark, represented by the directed acyclic graph, is essential for tracking dependencies, such as how filter one RDD is dependent on number RDD and number RDD is dependent on f.txt, providing crucial information for transformations. (Relevance: 5)', 'The process of transformation in Apache Spark involves creating new RDDs from existing ones, exemplified by the generation of filter one RDD from number RDD, which is a crucial step in the Spark context. (Relevance: 4)', 'The action step in Apache Spark, represented by the collect statement, is used to print output, and is a significant aspect of working with Spark, distinct from the transformation step. (Relevance: 3)', 'Apache Spark is favored for its faster processing, simplicity, and ease of use, particularly when compared to MapReduce, making it a popular choice for various industries such as stock market, social media analysis, and banking for fraud detection. (Relevance: 2)', 'Industries like stock market, social media analysis, and banking for fraud detection utilize Apache Spark due to its faster processing, ease of use, and availability of features, contributing to its widespread success. (Relevance: 1)']}], 'duration': 1016.466, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo2566794.jpg', 'highlights': ['The Apache Spark text file API loads file blocks into the memory of machines, with each block being copied into different machine memories.', 'Block sizes can vary and it is not mandatory for all block sizes to be the same.', 'The concept of Resilient Distributed Data (RDD) in Spark is explained, emphasizing its role as distributed data sitting in memory and its full form as resilient distributed data.', 'The resilience of RDD is elaborated, detailing its reliability through the replication factor, ensuring data is not lost even if a machine fails.', 'The immutability of RDDs is highlighted, explaining that once a block is in memory, it cannot be changed, and new blocks are created through processing without altering the original data.', 'The speed advantages of Spark over MapReduce are discussed, emphasizing the reduction in input-output operations and the use of in-memory processing, leading to faster output.', 'The capability of Spark to handle low memory through pipelining is mentioned, showcasing its smart framework design and its appeal for users.', 'The maintenance of lineage in Apache Spark, represented by the directed acyclic graph, is essential for tracking dependencies, such as how filter one RDD is dependent on number RDD and number RDD is dependent on f.txt, providing crucial information for transformations.', 'The process of transformation in Apache Spark involves creating new RDDs from existing ones, exemplified by the generation of filter one RDD from number RDD, which is a crucial step in the Spark context.', 'The action step in Apache Spark, represented by the collect statement, is used to print output, and is a significant aspect of working with Spark, distinct from the transformation step.', 'Apache Spark is favored for its faster processing, simplicity, and ease of use, particularly when compared to MapReduce, making it a popular choice for various industries such as stock market, social media analysis, and banking for fraud detection.', 'Industries like stock market, social media analysis, and banking for fraud detection utilize Apache Spark due to its faster processing, ease of use, and availability of features, contributing to its widespread success.']}, {'end': 4186.712, 'segs': [{'end': 3633.612, 'src': 'embed', 'start': 3607.284, 'weight': 0, 'content': [{'end': 3611.366, 'text': 'So now in Spark, we have already seen real-time processing and everything.', 'start': 3607.284, 'duration': 4.082}, {'end': 3614.008, 'text': 'now Apache Spark is an open source cluster.', 'start': 3611.366, 'duration': 2.642}, {'end': 3616.289, 'text': "it is available to you, that it's free of cost.", 'start': 3614.008, 'duration': 2.281}, {'end': 3618.091, 'text': 'you may not pay to work on that.', 'start': 3616.289, 'duration': 1.802}, {'end': 3621.613, 'text': 'that is also one of the very important part why Spark is famous.', 'start': 3618.091, 'duration': 3.522}, {'end': 3622.073, 'text': 'it can.', 'start': 3621.613, 'duration': 0.46}, {'end': 3628.077, 'text': 'you can perform real-time processing, batch kind of processing, every kind of processing you can perform on it.', 'start': 3622.073, 'duration': 6.004}, {'end': 3631.05, 'text': 'you can perform your programming part.', 'start': 3628.709, 'duration': 2.341}, {'end': 3633.612, 'text': 'okay, you can do a data parallelism.', 'start': 3631.05, 'duration': 2.562}], 'summary': 'Apache spark is free, open source for real-time, batch, and other data processing.', 'duration': 26.328, 'max_score': 3607.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3607284.jpg'}, {'end': 3741.09, 'src': 'embed', 'start': 3710.772, 'weight': 1, 'content': [{'end': 3713.674, 'text': 'I do not even require my LCFS to work on it.', 'start': 3710.772, 'duration': 2.902}, {'end': 3715.075, 'text': "that's the fun part, right.", 'start': 3713.674, 'duration': 1.401}, {'end': 3717.216, 'text': 'so many advantages you can make out on your own.', 'start': 3715.075, 'duration': 2.141}, {'end': 3722.058, 'text': 'now Spark is giving almost 100x time faster speed.', 'start': 3717.216, 'duration': 4.842}, {'end': 3724.84, 'text': "don't you think it's an awesome speed?", 'start': 3722.058, 'duration': 2.782}, {'end': 3728.322, 'text': "100x? I'm not talking about double the speed or triple the speed.", 'start': 3724.84, 'duration': 3.482}, {'end': 3735.349, 'text': "I'm talking about 100x time faster, which makes now Spark very powerful.", 'start': 3728.322, 'duration': 7.027}, {'end': 3741.09, 'text': 'you might be hearing a lot that lot of companies are migrating from MapReduce to Apache Spark.', 'start': 3735.349, 'duration': 5.741}], 'summary': 'Apache spark offers 100x faster speed, leading to migration from mapreduce.', 'duration': 30.318, 'max_score': 3710.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3710772.jpg'}, {'end': 3834.49, 'src': 'embed', 'start': 3807.403, 'weight': 4, 'content': [{'end': 3810.525, 'text': 'So using Hadoop through Apache Spark.', 'start': 3807.403, 'duration': 3.122}, {'end': 3812.526, 'text': "So let's see how we can do all that.", 'start': 3810.625, 'duration': 1.901}, {'end': 3822.593, 'text': 'Now Spark with HDFS makes it more powerful because you can execute your Spark applications on top of HDFS very easily.', 'start': 3812.947, 'duration': 9.646}, {'end': 3828.057, 'text': 'Now second thing is Spark with MapReduce programming.', 'start': 3823.154, 'duration': 4.903}, {'end': 3834.49, 'text': 'where Spark can be used along with MapReduce programming in the same Hadoop cluster.', 'start': 3828.487, 'duration': 6.003}], 'summary': 'Using apache spark with hdfs enhances power and ease of execution in hadoop clusters.', 'duration': 27.087, 'max_score': 3807.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3807403.jpg'}, {'end': 4186.712, 'src': 'embed', 'start': 4150.452, 'weight': 2, 'content': [{'end': 4157.913, 'text': 'they stopped using this Mahout, in fact the core developer of this Mahout part.', 'start': 4150.452, 'duration': 7.461}, {'end': 4163.752, 'text': 'they themselves migrated towards the MLLib site.', 'start': 4158.447, 'duration': 5.305}, {'end': 4175.403, 'text': 'Now, even if you talk to those co-developers of Mahot, they themselves are recommending that if you want to execute machine learning program better,', 'start': 4164.233, 'duration': 11.17}, {'end': 4186.712, 'text': 'execute it in the Spark framework, only execute it by using Spark MLLib rather than executing it in your Hadoop.', 'start': 4175.403, 'duration': 11.309}], 'summary': 'Developers migrated from mahout to mllib, recommending spark mllib for better machine learning execution.', 'duration': 36.26, 'max_score': 4150.452, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4150452.jpg'}], 'start': 3583.74, 'title': 'Apache spark in medical domain and its features', 'summary': "Discusses apache spark's use in the medical domain, emphasizing its real-time processing, batch processing, and data parallelism capabilities. it also explains the features and integration of apache spark, such as its 100x faster processing speed, compatibility with hadoop and mapreduce, and the superiority of spark's mllib framework over mahout for machine learning algorithms.", 'chapters': [{'end': 3647.54, 'start': 3583.74, 'title': 'Apache spark in medical domain', 'summary': 'Discusses the use of apache spark in the medical domain, highlighting its ability to handle real-time processing, batch processing, and data parallelism, making it a free and reliable tool for various processing needs.', 'duration': 63.8, 'highlights': ['Apache Spark is used extensively in the medical domain for various processing needs, including real-time processing and batch processing.', 'Apache Spark is an open source cluster that is free of cost, allowing users to perform different kinds of processing, such as real-time and batch processing, along with data parallelism and fault tolerance.', 'Apache Spark is renowned for its ability to handle real-time processing, batch processing, and data parallelism, making it a reliable and cost-effective tool for a variety of processing tasks.']}, {'end': 4186.712, 'start': 3647.54, 'title': 'Apache spark: features and integration', 'summary': "Explains the features and integration of apache spark, including its standalone use, 100x faster processing speed, compatibility with hadoop and mapreduce, and the superiority of spark's mllib framework over mahout for machine learning algorithms.", 'duration': 539.172, 'highlights': ['Spark provides 100x faster processing speed, making it very powerful compared to MapReduce.', 'Spark can be used standalone on a simple Windows machine without requiring Hadoop or HDFS, making it very flexible for developers.', "Spark's MLlib framework is recommended over Mahout for machine learning programs due to its in-memory processing, eliminating input/output operations and significantly improving execution speed.", 'Spark can be integrated with HDFS, allowing for easy execution of Spark applications on top of HDFS.', 'Spark can be used with MapReduce programming in the same Hadoop cluster, eliminating the need for separate clusters for Spark and MapReduce.', 'Spark is not intended to replace Hadoop but can be seen as an extension of the Hadoop framework, leveraging features such as HDFS and Yarn.', 'Companies are migrating from MapReduce to Spark due to its 100x faster speed and powerful features like caching and persistence.']}], 'duration': 602.972, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo3583740.jpg', 'highlights': ['Apache Spark is renowned for its ability to handle real-time processing, batch processing, and data parallelism, making it a reliable and cost-effective tool for a variety of processing tasks.', 'Spark provides 100x faster processing speed, making it very powerful compared to MapReduce.', "Spark's MLlib framework is recommended over Mahout for machine learning programs due to its in-memory processing, eliminating input/output operations and significantly improving execution speed.", 'Apache Spark is used extensively in the medical domain for various processing needs, including real-time processing and batch processing.', 'Spark can be integrated with HDFS, allowing for easy execution of Spark applications on top of HDFS.']}, {'end': 4855.492, 'segs': [{'end': 4246.266, 'src': 'embed', 'start': 4186.953, 'weight': 0, 'content': [{'end': 4188.995, 'text': "So that's the reason.", 'start': 4186.953, 'duration': 2.042}, {'end': 4196.141, 'text': 'in machine learning algorithms on big data, everyone is moving toward path.', 'start': 4188.995, 'duration': 7.146}, {'end': 4198.463, 'text': "and then let's see all this part in detail.", 'start': 4196.141, 'duration': 2.322}, {'end': 4209.002, 'text': 'now, when we talk about now, in the speed part right and we just were discussing about this people Spark can run 100 X time faster.', 'start': 4198.463, 'duration': 10.539}, {'end': 4211.324, 'text': 'why?. We already know the answer now, right?', 'start': 4209.002, 'duration': 2.322}, {'end': 4212.766, 'text': 'We have already seen that part.', 'start': 4211.545, 'duration': 1.221}, {'end': 4222.954, 'text': 'Now when we talk about polygon, we have just discussed that you can write in Scala, Python, Java and R, like so many languages are being supported.', 'start': 4213.306, 'duration': 9.648}, {'end': 4226.937, 'text': 'Now next part, this is important.', 'start': 4223.534, 'duration': 3.403}, {'end': 4229.76, 'text': 'Lazy evaluation.', 'start': 4227.758, 'duration': 2.002}, {'end': 4233.438, 'text': 'let me again take you back to my PPT.', 'start': 4229.76, 'duration': 3.678}, {'end': 4237.821, 'text': 'so in this case, now what actually happens?', 'start': 4233.438, 'duration': 4.383}, {'end': 4240.602, 'text': 'How this execution happens here?', 'start': 4238.141, 'duration': 2.461}, {'end': 4246.266, 'text': 'So, first of all, what is happening here is it is not like that.', 'start': 4241.203, 'duration': 5.063}], 'summary': 'In machine learning on big data, spark can run 100x faster, supporting multiple languages and employing lazy evaluation.', 'duration': 59.313, 'max_score': 4186.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4186953.jpg'}, {'end': 4364.84, 'src': 'embed', 'start': 4337.507, 'weight': 7, 'content': [{'end': 4342.289, 'text': 'so this thing is called as lazy evaluation.', 'start': 4337.507, 'duration': 4.782}, {'end': 4352.232, 'text': 'means till that time you will not hit an action, it will not print, it will not do any execution beforehand.', 'start': 4342.289, 'duration': 9.943}, {'end': 4359.316, 'text': 'so all the executions start only at the time when you hit an action.', 'start': 4352.731, 'duration': 6.585}, {'end': 4364.84, 'text': "if you're coming from big programming background you might have already seen this feature.", 'start': 4359.316, 'duration': 5.524}], 'summary': 'Lazy evaluation delays execution until action is triggered, common in programming.', 'duration': 27.333, 'max_score': 4337.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4337507.jpg'}, {'end': 4420.566, 'src': 'embed', 'start': 4391.378, 'weight': 2, 'content': [{'end': 4399.483, 'text': 'we will not be doing any execution so that the data should not remain in the memory unnecessarily.', 'start': 4391.378, 'duration': 8.105}, {'end': 4401.925, 'text': 'so this is called a lazy evaluation.', 'start': 4399.483, 'duration': 2.442}, {'end': 4406.099, 'text': 'Clear about this part, okay? This is called your lazy evaluation.', 'start': 4402.588, 'duration': 3.511}, {'end': 4407.723, 'text': 'Let us come back to the slides now.', 'start': 4406.259, 'duration': 1.464}, {'end': 4409.502, 'text': 'Now look at this part.', 'start': 4408.542, 'duration': 0.96}, {'end': 4412.463, 'text': 'so this is the lazy valuation property.', 'start': 4409.502, 'duration': 2.961}, {'end': 4420.566, 'text': 'now, the real-time computing, like at the real time, as in when the data is coming, you can immediately start processing in the memory itself.', 'start': 4412.463, 'duration': 8.103}], 'summary': 'Lazy evaluation reduces unnecessary memory usage for real-time computing.', 'duration': 29.188, 'max_score': 4391.378, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4391378.jpg'}, {'end': 4643.246, 'src': 'embed', 'start': 4560.005, 'weight': 3, 'content': [{'end': 4568.791, 'text': 'because here the algorithm which were taking ads in your Hadoop site will take only few seconds in Spark MLLib.', 'start': 4560.005, 'duration': 8.786}, {'end': 4571.773, 'text': "That's a major improvement in MLLib.", 'start': 4569.132, 'duration': 2.641}, {'end': 4579.057, 'text': 'that the reason people are shifting towards this part, Graphics, where you can perform your graph kind of computation.', 'start': 4571.773, 'duration': 7.284}, {'end': 4582.918, 'text': 'you can public your friends recommendation and of Facebook right.', 'start': 4579.057, 'duration': 3.861}, {'end': 4586.56, 'text': 'so there it generate internally graph and give you output.', 'start': 4582.918, 'duration': 3.642}, {'end': 4590.382, 'text': 'so any graph sort of computation is done using graphics.', 'start': 4586.56, 'duration': 3.822}, {'end': 4593.743, 'text': 'Spark. this is a newly developed framework.', 'start': 4590.402, 'duration': 3.341}, {'end': 4595.304, 'text': "they're still working on it.", 'start': 4593.743, 'duration': 1.561}, {'end': 4598.045, 'text': "that's right now in the beta phase versions.", 'start': 4595.304, 'duration': 2.741}, {'end': 4603.984, 'text': 'now here, R is an open source language used by analyst.', 'start': 4598.045, 'duration': 5.939}, {'end': 4606.507, 'text': 'now, what Spark community want?', 'start': 4603.984, 'duration': 2.523}, {'end': 4616.896, 'text': 'that they want to bring all those analysts to the Spark framework, and for that they are working hard by bringing this Spark R.', 'start': 4606.507, 'duration': 10.389}, {'end': 4622.441, 'text': 'Spark R have already made it and this is going to be the next big thing in the market.', 'start': 4616.896, 'duration': 5.545}, {'end': 4625.948, 'text': 'how this ecosystem looks like.', 'start': 4624.046, 'duration': 1.902}, {'end': 4627.81, 'text': 'so there will be multiple things.', 'start': 4625.948, 'duration': 1.862}, {'end': 4636.94, 'text': 'for example, when we talk about Spark SQL, most of the times every computation happen with respect to your RDDs, but, like in Spark SQL,', 'start': 4627.81, 'duration': 9.13}, {'end': 4639.202, 'text': 'we have something called as data frame.', 'start': 4636.94, 'duration': 2.262}, {'end': 4643.246, 'text': 'Now, data frame is very analogous to your RDD,', 'start': 4639.602, 'duration': 3.644}], 'summary': 'Spark mllib offers major improvement, spark r is the next big thing in the market.', 'duration': 83.241, 'max_score': 4560.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4560005.jpg'}, {'end': 4807.262, 'src': 'heatmap', 'start': 4663.935, 'weight': 1, 'content': [{'end': 4666.136, 'text': 'instead, we call it as data frame.', 'start': 4663.935, 'duration': 2.201}, {'end': 4677.188, 'text': 'similarly in your machine learning also, we have something called as ML pipeline, which helps you to make it easier to combine multiple algorithms.', 'start': 4666.88, 'duration': 10.308}, {'end': 4681.591, 'text': 'so that is what your ML pipeline do in terms of MLM.', 'start': 4677.188, 'duration': 4.403}, {'end': 4684.053, 'text': "Now let's talk about Spark Core.", 'start': 4681.972, 'duration': 2.081}, {'end': 4688.597, 'text': 'so Spark Core, we already discussed any data which is residing in memory.', 'start': 4684.053, 'duration': 4.544}, {'end': 4695.24, 'text': 'we call that data as RDD, and this is all about your Spark core component,', 'start': 4689.037, 'duration': 6.203}, {'end': 4702.243, 'text': 'where you will be able to walk on large scale parallel system because all the data will be finally sitting distributedly sitting,', 'start': 4695.24, 'duration': 7.003}, {'end': 4705.645, 'text': 'so all the computation will also happen family.', 'start': 4702.243, 'duration': 3.402}, {'end': 4708.767, 'text': 'so this is about your Spark core component.', 'start': 4706.005, 'duration': 2.762}, {'end': 4717.114, 'text': 'when we talk about the architecture of Spark now, you can relate this as your name node where your driver program is sitting,', 'start': 4708.767, 'duration': 8.347}, {'end': 4719.576, 'text': 'which we call it as master machine.', 'start': 4717.114, 'duration': 2.462}, {'end': 4723.059, 'text': 'so on your master machine, your Spark context will be there.', 'start': 4719.576, 'duration': 3.483}, {'end': 4726.522, 'text': 'similarly, worker node is called as data node.', 'start': 4723.059, 'duration': 3.463}, {'end': 4730.345, 'text': 'so in Spark we denote this data nodes as worker node.', 'start': 4726.522, 'duration': 3.823}, {'end': 4735.249, 'text': 'now there must be a place in memory where you will be keeping your block.', 'start': 4730.345, 'duration': 4.904}, {'end': 4739.991, 'text': 'that space in memory we called it as executors.', 'start': 4735.769, 'duration': 4.222}, {'end': 4740.812, 'text': 'as you can see,', 'start': 4739.991, 'duration': 0.821}, {'end': 4751.277, 'text': 'there are two data nodes here or worker node here we are having executed means the space in your RAM where you will be keeping all the block will be called as executors.', 'start': 4740.812, 'duration': 10.465}, {'end': 4754.079, 'text': 'Now the blocks which are residing.', 'start': 4751.837, 'duration': 2.242}, {'end': 4759.962, 'text': 'right. for example, you were doing that dot map logic to get the values less than 10.', 'start': 4754.079, 'duration': 5.883}, {'end': 4767.626, 'text': 'right now that logic, the code, what you are executing on your RDD, is called as task.', 'start': 4759.962, 'duration': 7.664}, {'end': 4769.607, 'text': 'okay, so that is called as task.', 'start': 4767.626, 'duration': 1.981}, {'end': 4775.49, 'text': 'now, in middle, there will be a cluster manager, just like yarn or something yarn missiles, whatever you want to keep.', 'start': 4769.607, 'duration': 5.883}, {'end': 4777.671, 'text': 'that will be an intermediate thing.', 'start': 4775.49, 'duration': 2.181}, {'end': 4781.513, 'text': 'now everything will be moving towards this cycle path context.', 'start': 4777.671, 'duration': 3.842}, {'end': 4784.433, 'text': 'then Jan will be taking care of the execution.', 'start': 4781.929, 'duration': 2.504}, {'end': 4788.718, 'text': 'then, in your executor, your code will be sitting, where you will be performing your task.', 'start': 4784.433, 'duration': 4.285}, {'end': 4792.624, 'text': 'You can also cache your data if you wish to, you can cache or process your data.', 'start': 4788.738, 'duration': 3.886}, {'end': 4796.854, 'text': "Let's talk about Spark streaming, Spark streaming.", 'start': 4793.712, 'duration': 3.142}, {'end': 4803.639, 'text': 'we have already discussing from good time that we have real-time kind of processing available here.', 'start': 4796.854, 'duration': 6.785}, {'end': 4807.262, 'text': 'so what happens is here you will be as soon as you are getting the data.', 'start': 4803.639, 'duration': 3.623}], 'summary': 'Data frame and ml pipeline in spark, with distributed data and executors for computation.', 'duration': 143.327, 'max_score': 4663.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4663935.jpg'}, {'end': 4831.398, 'src': 'embed', 'start': 4807.262, 'weight': 4, 'content': [{'end': 4814.788, 'text': 'you will be splitting the data into batches small data and you will immediately do processing on it in memory.', 'start': 4807.262, 'duration': 7.526}, {'end': 4817.83, 'text': 'that is done with the help of Spark streaming.', 'start': 4814.788, 'duration': 3.042}, {'end': 4823.651, 'text': "and the micro batch of data what you're creating is also called as these things.", 'start': 4818.306, 'duration': 5.345}, {'end': 4831.398, 'text': "Now, right now we're just talking at a very high level of all this topic, because we just want to give you an idea about how things works,", 'start': 4823.891, 'duration': 7.507}], 'summary': 'Data is split into small batches and processed in memory using spark streaming.', 'duration': 24.136, 'max_score': 4807.262, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4807262.jpg'}], 'start': 4186.953, 'title': 'Machine learning algorithms on big data', 'summary': "Discusses the shift towards machine learning algorithms on big data, emphasizing spark's 100x faster speed, support for multiple languages like scala, python, java, and r, and the concept of lazy evaluation. it also explains lazy evaluation in spark, enabling real-time processing and machine learning capabilities, and discusses components like spark core, spark sql, spark streaming, spark mllib, and spark r, as well as the architecture of spark and its ecosystem.", 'chapters': [{'end': 4246.266, 'start': 4186.953, 'title': 'Machine learning algorithms on big data', 'summary': 'Discusses the reasons for the shift towards machine learning algorithms on big data, emphasizing the 100x faster speed of spark and the support for multiple languages like scala, python, java, and r, as well as the concept of lazy evaluation.', 'duration': 59.313, 'highlights': ['Spark can run 100x faster, contributing to the shift towards machine learning algorithms on big data.', 'Support for multiple languages like Scala, Python, Java, and R is an important aspect in the discussion of machine learning algorithms on big data.', 'Lazy evaluation is an important concept to understand in the context of machine learning algorithms on big data.']}, {'end': 4855.492, 'start': 4246.266, 'title': 'Lazy evaluation in spark', 'summary': 'Explains the concept of lazy evaluation in spark, which delays execution until an action is triggered, enabling real-time processing and machine learning capabilities. it also discusses components like spark core, spark sql, spark streaming, spark mllib, and spark r, as well as the architecture of spark and the ecosystem.', 'duration': 609.226, 'highlights': ["Spark MLLib is a replacement for Mahout and allows algorithms that took minutes in Hadoop to take only seconds in Spark MLLib. Spark MLLib's efficiency is highlighted, with algorithms that previously took minutes in Hadoop now only taking seconds, showcasing significant improvement in processing speed.", "Spark Streaming enables real-time processing by splitting data into small batches and processing it in memory. Spark Streaming's capability to split data into small batches and process it in memory for real-time processing is emphasized, showcasing its ability to handle data streams effectively.", 'Spark SQL allows writing SQL queries, which are internally converted for in-memory computation, and it also introduces the concept of DataFrames for tabular data representation. The introduction of DataFrames in Spark SQL for tabular data representation, along with the ability to write SQL queries that are internally converted for in-memory computation, is explained.', 'Spark R aims to bring analysts to the Spark framework by providing an open source language for analysis, which is anticipated to be a significant advancement in the market. The significance of Spark R in attracting analysts to the Spark framework and its potential impact on the market is highlighted as a major advancement.', 'The concept of lazy evaluation in Spark is explained, emphasizing its role in delaying execution until an action is triggered to avoid unnecessary memory burden. The concept of lazy evaluation in Spark is emphasized for delaying execution until an action is triggered, thus avoiding unnecessary memory burden and displaying data only when needed.']}], 'duration': 668.539, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4186953.jpg', 'highlights': ['Spark can run 100x faster, contributing to the shift towards machine learning algorithms on big data.', 'Support for multiple languages like Scala, Python, Java, and R is an important aspect in the discussion of machine learning algorithms on big data.', 'Lazy evaluation is an important concept to understand in the context of machine learning algorithms on big data.', "Spark MLLib's efficiency is highlighted, with algorithms that previously took minutes in Hadoop now only taking seconds, showcasing significant improvement in processing speed.", "Spark Streaming's capability to split data into small batches and process it in memory for real-time processing is emphasized, showcasing its ability to handle data streams effectively.", 'The introduction of DataFrames in Spark SQL for tabular data representation, along with the ability to write SQL queries that are internally converted for in-memory computation, is explained.', 'The significance of Spark R in attracting analysts to the Spark framework and its potential impact on the market is highlighted as a major advancement.', 'The concept of lazy evaluation in Spark is emphasized for delaying execution until an action is triggered, thus avoiding unnecessary memory burden and displaying data only when needed.']}, {'end': 5414.619, 'segs': [{'end': 4881.337, 'src': 'embed', 'start': 4855.492, 'weight': 0, 'content': [{'end': 4859.874, 'text': 'it is converting your things to your RDD way and helping you to process the data.', 'start': 4855.492, 'duration': 4.382}, {'end': 4863.243, 'text': 'Okay, that is the role of the Spark code set.', 'start': 4860.441, 'duration': 2.802}, {'end': 4868.027, 'text': 'Now. similarly, when we talk about Spark streaming now, Spark streaming, as I was talking about,', 'start': 4863.663, 'duration': 4.364}, {'end': 4872.47, 'text': 'you can get the real-time data and all now the data from where you can be pulled up.', 'start': 4868.027, 'duration': 4.443}, {'end': 4874.211, 'text': 'it can be for multiple sources.', 'start': 4872.47, 'duration': 1.741}, {'end': 4876.853, 'text': 'you can use Kafka, you can use HBase.', 'start': 4874.211, 'duration': 2.642}, {'end': 4880.056, 'text': 'it can be pulled up from packet format, any sort of data.', 'start': 4876.853, 'duration': 3.203}, {'end': 4881.337, 'text': 'at the real time.', 'start': 4880.416, 'duration': 0.921}], 'summary': 'Spark code set converts things to rdd for real-time data processing from various sources.', 'duration': 25.845, 'max_score': 4855.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4855492.jpg'}, {'end': 5054.44, 'src': 'embed', 'start': 5021.675, 'weight': 1, 'content': [{'end': 5025.236, 'text': 'So all those things are possible in your Spark SQL.', 'start': 5021.675, 'duration': 3.561}, {'end': 5029.758, 'text': 'The performance if I compare with your Hive is very high.', 'start': 5025.476, 'duration': 4.282}, {'end': 5031.587, 'text': 'in Hadoop system.', 'start': 5030.345, 'duration': 1.242}, {'end': 5035.433, 'text': 'if this is a red mark, whichever is there, is your Hadoop system.', 'start': 5031.587, 'duration': 3.846}, {'end': 5042.323, 'text': 'you can easily see that we are taking so much of less time in comparison to your Hadoop systems.', 'start': 5035.433, 'duration': 6.89}, {'end': 5046.717, 'text': 'So that is the major advantage when using this Spark SQL.', 'start': 5042.776, 'duration': 3.941}, {'end': 5054.44, 'text': 'Now it uses the JDBC driver which is a Java driver or ODBC driver which is the Oracle driver for your connection for creating connection.', 'start': 5047.037, 'duration': 7.403}], 'summary': 'Spark sql offers significantly higher performance than hive in hadoop system, with much less time taken for processing.', 'duration': 32.765, 'max_score': 5021.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo5021675.jpg'}, {'end': 5135.222, 'src': 'embed', 'start': 5110.686, 'weight': 2, 'content': [{'end': 5118.892, 'text': 'convert it to a Spark way, doing the computation, your Spark SQL services will be running and in the end you will be doing the output.', 'start': 5110.686, 'duration': 8.206}, {'end': 5123.155, 'text': 'So this is a high level picture of how Spark SQL works.', 'start': 5119.192, 'duration': 3.963}, {'end': 5126.057, 'text': "Now let's talk about MLit.", 'start': 5123.675, 'duration': 2.382}, {'end': 5128.637, 'text': 'which is machine learning library.', 'start': 5126.835, 'duration': 1.802}, {'end': 5130.798, 'text': 'there are two kinds of algorithm.', 'start': 5128.637, 'duration': 2.161}, {'end': 5132.84, 'text': 'one is supervised algorithm.', 'start': 5130.798, 'duration': 2.042}, {'end': 5135.222, 'text': 'second is unsupervised algorithm.', 'start': 5132.84, 'duration': 2.382}], 'summary': 'Spark sql uses computation to output data. mllib includes supervised and unsupervised algorithms.', 'duration': 24.536, 'max_score': 5110.686, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo5110686.jpg'}, {'end': 5402.282, 'src': 'embed', 'start': 5377.316, 'weight': 3, 'content': [{'end': 5384.646, 'text': 'Now that computation, what all is happening to compute all the graphs, checking that it will take less time.', 'start': 5377.316, 'duration': 7.33}, {'end': 5387.991, 'text': 'right computation of all that is done with the help of graphics.', 'start': 5384.646, 'duration': 3.345}, {'end': 5391.415, 'text': 'Similarly, there are a lot of examples for fraud detections.', 'start': 5388.431, 'duration': 2.984}, {'end': 5393.218, 'text': 'also, banks are using this graphic.', 'start': 5391.415, 'duration': 1.803}, {'end': 5396.459, 'text': 'you can also see this Twitter or LinkedIn.', 'start': 5393.778, 'duration': 2.681}, {'end': 5398.5, 'text': 'they give you recommendations of friends.', 'start': 5396.459, 'duration': 2.041}, {'end': 5402.282, 'text': 'right, that is, all examples can be done with the help of graph.', 'start': 5398.5, 'duration': 3.782}], 'summary': 'Computation for graphs is time-efficient and widely used in fraud detection and social media recommendations.', 'duration': 24.966, 'max_score': 5377.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo5377316.jpg'}], 'start': 4855.492, 'title': 'Spark ecosystem overview', 'summary': 'Provides an introduction to the spark ecosystem, including the spark code set for data processing, spark streaming for real-time processing, spark sql for fast execution of sql queries, mllib for machine learning, and graphics library for graph problem-solving.', 'chapters': [{'end': 5414.619, 'start': 4855.492, 'title': 'Spark ecosystem overview', 'summary': 'Introduces the spark ecosystem, covering spark code set for data processing, spark streaming for real-time data processing from multiple sources like kafka and hbase, spark sql for fast execution of sql queries on structured and semi-structured data, and mllib for handling supervised and unsupervised machine learning algorithms with examples of use cases. graphics library is also discussed for solving problems with graphs.', 'duration': 559.127, 'highlights': ['Spark streaming allows real-time data processing from multiple sources like Kafka and HBase, enabling immediate cross-sync and data storage in various formats. Real-time data can be pulled from sources like Kafka, HBase, and various data formats, enabling immediate data processing and storage in formats like HDFS, databases, or UI dashboards.', 'Spark SQL enables fast execution of SQL queries on structured and semi-structured data, outperforming Hive in terms of performance. Spark SQL can handle structured and semi-structured data, perform SQL queries, and outperforms Hive in terms of performance in Hadoop systems.', 'MLlib can handle both supervised and unsupervised machine learning algorithms, such as classification, regression, and collaborative filtering, with examples of use cases in spam email detection, news grouping, and recommendation systems. MLlib supports supervised and unsupervised machine learning algorithms like classification, regression, and collaborative filtering, with examples in spam email detection, news grouping, and recommendation systems.', 'Graphics library provides powerful tools for solving problems with graphs, such as in Google Maps for optimal route computation and fraud detection in banks, and in social media platforms for friend recommendations. Graphics library is used for solving problems with graphs, such as optimal route computation in Google Maps and fraud detection in banks, and for friend recommendations in social media platforms.']}], 'duration': 559.127, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo4855492.jpg', 'highlights': ['Spark streaming enables real-time data processing from multiple sources like Kafka and HBase, allowing immediate cross-sync and storage in various formats.', 'Spark SQL performs fast execution of SQL queries on structured and semi-structured data, outperforming Hive in terms of performance.', 'MLlib supports supervised and unsupervised machine learning algorithms like classification, regression, and collaborative filtering, with examples in spam email detection, news grouping, and recommendation systems.', 'Graphics library provides powerful tools for solving problems with graphs, such as optimal route computation in Google Maps and fraud detection in banks.']}, {'end': 6955.19, 'segs': [{'end': 5439.821, 'src': 'embed', 'start': 5415.1, 'weight': 5, 'content': [{'end': 5423.731, 'text': 'Now before I move to the project I want to show you some practical part how we will be executing Spark things.', 'start': 5415.1, 'duration': 8.631}, {'end': 5428.759, 'text': 'So let me take you to the VM machine which will be provided by Edureka.', 'start': 5424.178, 'duration': 4.581}, {'end': 5431.479, 'text': 'So all these machines are also provided by Edureka.', 'start': 5429.119, 'duration': 2.36}, {'end': 5438.04, 'text': 'So you need not worry about from where I will be getting the software, what I would be doing with my control there.', 'start': 5431.719, 'duration': 6.321}, {'end': 5439.821, 'text': 'everything is taken care by Edureka.', 'start': 5438.04, 'duration': 1.781}], 'summary': 'Edureka provides vm machines and software for executing spark projects.', 'duration': 24.721, 'max_score': 5415.1, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo5415100.jpg'}, {'end': 6284.354, 'src': 'embed', 'start': 6254.09, 'weight': 0, 'content': [{'end': 6261.056, 'text': 'so they collected the data, immediately process it and as soon as they detected that earthquake, they immediately informed it.', 'start': 6254.09, 'duration': 6.966}, {'end': 6263.038, 'text': 'in fact, this happened in 2011.', 'start': 6261.056, 'duration': 1.982}, {'end': 6271.646, 'text': 'now they based using it very frequently, because Japan is one of the area which is very frequently kind of affected by all this.', 'start': 6263.038, 'duration': 8.608}, {'end': 6276.09, 'text': 'So, as I said, the main thing is we should be able to process the data in real time.', 'start': 6272.026, 'duration': 4.064}, {'end': 6276.831, 'text': "that's the major thing.", 'start': 6276.09, 'duration': 0.741}, {'end': 6284.354, 'text': 'you should be able to handle the data from multiple sources, because data may be coming from multiple sources, maybe a different, different sources.', 'start': 6277.27, 'duration': 7.084}], 'summary': 'In 2011, data was processed in real time to detect earthquakes in japan, a frequently affected area.', 'duration': 30.264, 'max_score': 6254.09, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo6254090.jpg'}, {'end': 6554.075, 'src': 'embed', 'start': 6527.765, 'weight': 2, 'content': [{'end': 6537.845, 'text': "Now what we are going to do let's come out of this go to your main project folder and from here you will be writing SBT package.", 'start': 6527.765, 'duration': 10.08}, {'end': 6540.227, 'text': 'It will start downloading.', 'start': 6538.386, 'duration': 1.841}, {'end': 6543.369, 'text': 'with respect to your SBT, it will check your program.', 'start': 6540.227, 'duration': 3.142}, {'end': 6551.593, 'text': 'whatever dependency you require for Spark Core Spark Streaming, Spark MLX, it will download and install it.', 'start': 6543.369, 'duration': 8.224}, {'end': 6554.075, 'text': 'It will just download and install it.', 'start': 6552.074, 'duration': 2.001}], 'summary': 'Using sbt to download and install spark dependencies in the main project folder.', 'duration': 26.31, 'max_score': 6527.765, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo6527765.jpg'}, {'end': 6807.678, 'src': 'embed', 'start': 6782.603, 'weight': 4, 'content': [{'end': 6790.927, 'text': 'for example, I kind of create a model, just like how Walmart do it, how Walmart may be creating whatever sales is happening.', 'start': 6782.603, 'duration': 8.324}, {'end': 6791.868, 'text': 'with respect to that,', 'start': 6790.927, 'duration': 0.941}, {'end': 6798.771, 'text': "they are using Apache Spark and at the end they are kind of making you visualize the output of doing whatever analytics they're doing.", 'start': 6791.868, 'duration': 6.903}, {'end': 6800.472, 'text': 'So that is all doing the spark.', 'start': 6799.191, 'duration': 1.281}, {'end': 6807.678, 'text': 'so all those things we walking through when we do the course session, all the things you learn there and feel that all these projects are easy.', 'start': 6800.472, 'duration': 7.206}], 'summary': 'Using apache spark, we create walmart-like sales models and visualize analytics output.', 'duration': 25.075, 'max_score': 6782.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo6782603.jpg'}], 'start': 5415.1, 'title': 'Spark in earthquake prediction', 'summary': 'Covers practical execution of spark on vm provided by edureka, including using spark and pyspark, working with rdds, and demonstrating the use of spark in earthquake detection. it also discusses how japan uses apache spark for real-time earthquake prediction, citing the 2011 earthquake incident and the subsequent implementation of real-time alerts, saving millions of lives by sending alerts in 60 seconds. additionally, the chapter explains setting up a spark project using sbt, creating a folder structure, building and running the project, and visualizing output, emphasizing the importance of apache spark and its applications, such as earthquake alert systems and analytics for companies like walmart.', 'chapters': [{'end': 6065.209, 'start': 5415.1, 'title': 'Executing spark practical and use case', 'summary': 'Covers practical execution of spark on vm provided by edureka, including using spark and pyspark, working with rdds, and demonstrating the use of spark in earthquake detection.', 'duration': 650.109, 'highlights': ['The chapter covers practical execution of Spark on VM provided by Edureka, including using Spark and PySpark, working with RDDs, and demonstrating the use of Spark in earthquake detection.', 'The speaker demonstrates how to work with Spark on a VM provided by Edureka, ensuring the availability of necessary software and control.', "The speaker explains how to work with Scala to execute Spark programs, demonstrating the process of opening the terminal and using the 'Spark-shell' command.", "The speaker demonstrates the process of creating and reading RDDs in Scala, highlighting the use of 'sc.textFile' to create an RDD and 'number.collect' to read the data.", "The speaker explains the use of 'val' and 'var' in Scala and their differences, emphasizing that 'val' is a constant and 'var' is a variable.", "The speaker demonstrates how to load a file into memory from the local disk using the 'file' keyword in Scala, showcasing the difference between loading from HDFS and the local disk.", 'The speaker explains the concept of lazy evaluation in Spark, demonstrating its impact on error handling and the need for connecting the correct path before execution.', 'The speaker briefly introduces the use case of earthquake detection with Apache Spark, providing a glimpse of the problem-solving capabilities of Spark.']}, {'end': 6382.154, 'start': 6065.209, 'title': 'Real-time earthquake prediction with apache spark', 'summary': 'Discusses how japan uses apache spark for real-time earthquake prediction, citing the 2011 earthquake incident and the subsequent implementation of real-time alerts, saving millions of lives by sending alerts in 60 seconds.', 'duration': 316.945, 'highlights': ['Japan uses Apache Spark for real-time earthquake prediction, sending alerts in 60 seconds, saving millions of lives. Japan utilized Apache Spark for real-time earthquake prediction, successfully sending alerts within 60 seconds, ultimately saving millions of lives during the 2011 earthquake incident.', 'The importance of processing data in real time and handling data from multiple sources for accurate earthquake prediction. Emphasizes the significance of processing data in real time and managing data from various sources to ensure accurate earthquake prediction and early warnings.', 'Differentiating between primary and secondary waves in earthquakes and the potential damage caused by secondary waves. Explains the distinction between primary and secondary waves in earthquakes, highlighting the severe impact of secondary waves that can lead to maximum damage.']}, {'end': 6955.19, 'start': 6382.154, 'title': 'Setting up a spark project', 'summary': 'Explains setting up a spark project using sbt, creating a folder structure, building and running the project, and visualizing output, emphasizing the importance of apache spark and its applications, such as earthquake alert systems and analytics for companies like walmart.', 'duration': 573.036, 'highlights': ['Setting up project dependencies using SBT Explains using SBT to manage project dependencies, specifying dependencies like Spark core and Spark streaming with versions, e.g., using Spark 1.5.2, and the ease of writing the build.sbt file compared to Maven.', "Creating a folder structure and writing program in Scala Describes creating a folder structure with SRC/main/scala and writing a program in Scala, showcasing an example of streaming data to a Scala network and building the project using 'SBT package'.", 'Visualizing output and applications of Apache Spark Highlights the visualization of output, such as ROC, in the context of an earthquake alert system for Japan, and mentions the broader applications of Apache Spark in analytics for companies like Walmart.']}], 'duration': 1540.09, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9mELEARcxJo/pics/9mELEARcxJo5415100.jpg', 'highlights': ['Japan utilized Apache Spark for real-time earthquake prediction, successfully sending alerts within 60 seconds, ultimately saving millions of lives during the 2011 earthquake incident.', 'Emphasizes the significance of processing data in real time and managing data from various sources to ensure accurate earthquake prediction and early warnings.', 'Explains using SBT to manage project dependencies, specifying dependencies like Spark core and Spark streaming with versions, e.g., using Spark 1.5.2, and the ease of writing the build.sbt file compared to Maven.', "Describes creating a folder structure with SRC/main/scala and writing a program in Scala, showcasing an example of streaming data to a Scala network and building the project using 'SBT package'.", 'Highlights the visualization of output, such as ROC, in the context of an earthquake alert system for Japan, and mentions the broader applications of Apache Spark in analytics for companies like Walmart.', 'Covers practical execution of Spark on VM provided by Edureka, including using Spark and PySpark, working with RDDs, and demonstrating the use of Spark in earthquake detection.']}], 'highlights': ['Japan utilized Apache Spark for real-time earthquake prediction, successfully sending alerts within 60 seconds, ultimately saving millions of lives during the 2011 earthquake incident.', 'Spark provides 100x faster processing speed, making it very powerful compared to MapReduce.', 'Apache Spark is renowned for its ability to handle real-time processing, batch processing, and data parallelism, making it a reliable and cost-effective tool for a variety of processing tasks.', 'The steps involved in MapReduce programming are mapper, sort and shuffle, and reducer phases. This detail is the most relevant as it summarizes the main content of the chapter.', 'Spark can run 100x faster, contributing to the shift towards machine learning algorithms on big data.', 'Spark streaming enables real-time data processing from multiple sources like Kafka and HBase, allowing immediate cross-sync and storage in various formats.', 'The comprehensive definition of big data is provided, encompassing properties such as volume, variety, velocity, and veracity.', 'The projected increase in demand for big data professionals by 2020 and the expected shortage of skilled individuals are emphasized, indicating the need for companies to adapt to the big data domain.', "Apache Spark's significance and features The chapter explores the reasons behind the buzz around Apache Spark, emphasizing its status as the next big thing in the world of technology and delving into the features that contribute to its prominence.", "The exponential growth of data on platforms like Facebook is highlighted, with the CEO mentioning that the number of users is equivalent to the world's population 100 years ago."]}