title
What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn

description
πŸ”₯Post Graduate Program In Data Engineering: https://www.simplilearn.com/pgp-data-engineering-certification-training-course?utm_campaign=Hadoop-znBa13Earms&utm_medium=DescriptionFirstFold&utm_source=youtube πŸ”₯Big Data Engineer Masters Program (Discount Code - YTBE15): https://www.simplilearn.com/big-data-engineer-masters-program?utm_campaign=Hadoop-znBa13Earms&utm_medium=DescriptionFirstFold&utm_source=youtube This video on What Is Apache Spark? covers all the basics of Apache Spark that a beginner needs to know. In this introduction to Apache Spark video, we will discuss what is Apache Spark, the history of Spark, Hadoop vs Spark, Spark features, components of Apache Spark, Spark core, Spark SQL, Spark streaming, applications of Spark, etc. Below topics are explained in this Apache Spark Tutorial: 00.00 Introduction 00:41 History of Spark 01:22 What is Spark? 02:26 Hadoop vs Spark 05:29 Spark Features 08:27 Components of Apache Spark 10:24 Spark Core 11:28 Resilient Distributed Dataset 18:08 Spark SQL 21:28 Spark Streaming 24:57 Spark MLlib 25:54 GraphX 27:20 Spark architecture 32:16 Spark Cluster Managers 33:59 Applications of Spark 36:01 Spark use case 38:02 Conclusion To learn more about Spark, subscribe to our YouTube channel: https://www.youtube.com/user/Simplilearn?sub_confirmation=1 To access the slides, click here: https://www.slideshare.net/Simplilearn/what-is-apache-spark-introduction-to-apache-spark-apache-spark-tutorial-simplilearn/Simplilearn/what-is-apache-spark-introduction-to-apache-spark-apache-spark-tutorial-simplilearn Watch more videos on Spark Training: https://www.youtube.com/playlist?list=PLEiEAq2VkUUK3tuBXyd01meHuDj7RLjHv #WhatIsApacheSpark #ApacheSpark #ApacheSparkTutorial #SparkTutorialForBeginners #SimplilearnApacheSpark #SparkTutorial #Simplilearn Introduction to Apache Spark: Apache Spark Is an open-source cluster computing framework that was initially developed at UC Berkeley in the AMPLab. As compared to the disk-based, two-stage MapReduce of Hadoop, Spark provides up to 100 times faster performance for a few applications with in-memory primitives. This makes it suitable for machine learning algorithms, as it allows programs to load data into the memory of a cluster and query the data constantly. A Spark project contains various components such as Spark Core and Resilient Distributed Datasets or RDDs, Spark SQL, Spark Streaming, Machine Learning Library or Mllib, and GraphX. πŸ”₯ Enroll for FREE Big Data Hadoop Spark Course & Get your Completion Certificate: https://www.simplilearn.com/learn-hadoop-spark-basics-skillup?utm_campaign=Hadoop&utm_medium=Description&utm_source=youtube ➑️ About Post Graduate Program In Data Engineering This Data Engineering course is ideal for professionals, covering critical topics like the Hadoop framework, Data Processing using Spark, Data Pipelines with Kafka, Big Data on AWS, and Azure cloud infrastructures. This program is delivered via live sessions, industry projects, IBM hackathons, and Ask Me Anything sessions. βœ… Key Features Post Graduate Program Certificate and Alumni Association membership - Exclusive Master Classes and Ask me Anything sessions by IBM - 8X higher live interaction in live Data Engineering online classes by industry experts - Capstone from 3 domains and 14+ Projects with Industry datasets from YouTube, Glassdoor, Facebook etc. - Simplilearn's JobAssist helps you get noticed by top hiring companies βœ… Skills Covered - Real-Time Data Processing - Data Pipelining - Big Data Analytics - Data Visualization - Provisioning data storage services - Apache Hadoop - Ingesting Streaming and Batch Data - Transforming Data - Implementing Security Requirements - Data Protection - Encryption Techniques - Data Governance and Compliance Controls πŸ‘‰ Learn More At: https://www.simplilearn.com/pgp-data-engineering-certification-training-course?utm_campaign=Hadoop-znBa13Earms&utm_medium=Description&utm_source=youtube πŸ”₯πŸ”₯ Interested in Attending Live Classes? Call Us: IN - 18002127688 / US - +18445327688

detail
{'title': 'What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn', 'heatmap': [{'end': 693.089, 'start': 666.218, 'weight': 0.706}, {'end': 1679.861, 'start': 1623.933, 'weight': 1}, {'end': 1731.95, 'start': 1700.459, 'weight': 0.795}], 'summary': "Introduces apache spark, highlighting its history, components, and in-demand status. it compares spark and hadoop, emphasizes its 100 times faster data processing, and discusses its use cases like walmart and macy's. additionally, it covers spark's features, rdds, fault tolerance, spark sql, machine learning algorithms, streaming applications, real-time customer analysis, spark streaming, mllib, graphx, and its architecture, with examples from various industries.", 'chapters': [{'end': 332.046, 'segs': [{'end': 112.594, 'src': 'embed', 'start': 57.089, 'weight': 0, 'content': [{'end': 70.003, 'text': 'In 2013, Spark became an Apache top-level project and in 2014, used by Databricks to sort large-scale datasets and it set a new world record.', 'start': 57.089, 'duration': 12.914}, {'end': 72.186, 'text': "So that's how Apache Spark started.", 'start': 70.124, 'duration': 2.062}, {'end': 79.914, 'text': 'And today it is one of the most in-demand processing framework, or I would say in-memory computing framework,', 'start': 72.746, 'duration': 7.168}, {'end': 82.437, 'text': 'which is used across the big data industry.', 'start': 79.914, 'duration': 2.523}, {'end': 85.34, 'text': "So what is Apache Spark? Let's learn about this.", 'start': 82.777, 'duration': 2.563}, {'end': 92.948, 'text': 'Apache Spark is a open source in-memory computing framework or you could say data processing engine.', 'start': 85.9, 'duration': 7.048}, {'end': 100.91, 'text': 'which is used to process data in batch and also in real time across various cluster computers,', 'start': 93.228, 'duration': 7.682}, {'end': 104.931, 'text': 'and it has a very simple programming language behind the scenes.', 'start': 100.91, 'duration': 4.021}, {'end': 112.594, 'text': 'that is Scala, which is used, although if users would want to work on Spark, they can work with Python, they can work with Scala,', 'start': 104.931, 'duration': 7.663}], 'summary': 'Apache spark, started in 2013, is a popular in-memory computing framework used for processing big data in batch and real-time, with support for scala and python.', 'duration': 55.505, 'max_score': 57.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms57089.jpg'}, {'end': 189.942, 'src': 'embed', 'start': 166.894, 'weight': 1, 'content': [{'end': 176.717, 'text': 'If you talk about Spark, Spark can process the same data 100 times faster than MapReduce as it is a in-memory computing framework.', 'start': 166.894, 'duration': 9.823}, {'end': 179.518, 'text': 'Well, there can always be conflicting ideas,', 'start': 176.957, 'duration': 2.561}, {'end': 188.081, 'text': 'saying what if my Spark application is not really efficiently coded and my MapReduce application has been very efficiently coded?', 'start': 179.518, 'duration': 8.563}, {'end': 189.942, 'text': "Well then it's a different case.", 'start': 188.221, 'duration': 1.721}], 'summary': 'Spark processes data 100x faster than mapreduce due to in-memory computing.', 'duration': 23.048, 'max_score': 166.894, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms166894.jpg'}, {'end': 317.374, 'src': 'embed', 'start': 284.868, 'weight': 3, 'content': [{'end': 286.289, 'text': 'as it is implemented in Scala.', 'start': 284.868, 'duration': 1.421}, {'end': 291.271, 'text': 'And Scala is a statically typed, dynamically inferred language.', 'start': 286.309, 'duration': 4.962}, {'end': 293.252, 'text': "It's very, very concise.", 'start': 291.732, 'duration': 1.52}, {'end': 299.035, 'text': 'And the benefit is it has features from both functional programming and object-oriented language.', 'start': 293.732, 'duration': 5.303}, {'end': 306.879, 'text': 'And in case of Scala, whatever code is written, that is converted into bytecodes, and then it runs in the JVM.', 'start': 299.455, 'duration': 7.424}, {'end': 310.408, 'text': 'Now, Hadoop supports Kerberos authentication.', 'start': 307.185, 'duration': 3.223}, {'end': 312.59, 'text': 'There are different kinds of authentication mechanisms.', 'start': 310.448, 'duration': 2.142}, {'end': 317.374, 'text': 'Kerberos is one of the well-known ones and it can really get difficult to manage.', 'start': 312.69, 'duration': 4.684}], 'summary': 'Scala is concise, supports functional and object-oriented programming, and runs on jvm. hadoop supports kerberos authentication, which can be difficult to manage.', 'duration': 32.506, 'max_score': 284.868, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms284868.jpg'}], 'start': 7.329, 'title': "Apache spark's in-demand status", 'summary': "Introduces apache spark, emphasizing its history, components, and its in-demand status as an in-memory computing framework. apache spark, incepted in 2009 at uc berkeley amp labs, open sourced in 2010, became an apache top-level project in 2013, and set a new world record in 2014. additionally, it compares apache spark and hadoop, highlighting spark's 100 times faster data processing, support for both batch and real-time processing, and its use cases like retail giants walmart and macy's.", 'chapters': [{'end': 139.27, 'start': 7.329, 'title': 'Apache spark: in-demand processing framework', 'summary': 'Introduces apache spark, highlighting its history, components, and its in-demand status as an in-memory computing framework, used across the big data industry, beginning with its inception in 2009 at uc berkeley amp labs, open sourcing in 2010, becoming an apache top-level project in 2013, and setting a new world record in 2014.', 'duration': 131.941, 'highlights': ['Apache Spark is one of the most in-demand processing frameworks in the big data world. Apache Spark is highlighted as one of the most in-demand processing frameworks in the big data world.', 'In 2014, Apache Spark was used by Databricks to sort large-scale datasets and it set a new world record. In 2014, Apache Spark, used by Databricks, set a new world record in sorting large-scale datasets.', 'Apache Spark is an open source in-memory computing framework or data processing engine, used to process data in batch and real time across various cluster computers. Apache Spark is defined as an open source in-memory computing framework, used for batch and real-time data processing across cluster computers.']}, {'end': 332.046, 'start': 139.27, 'title': 'Apache spark vs hadoop', 'summary': "Compares apache spark and hadoop, highlighting spark's 100 times faster data processing, support for both batch and real-time processing, and concise code implementation in scala, making it a preferred choice for real-time data processing in use cases like retail giants walmart and macy's.", 'duration': 192.776, 'highlights': ['Spark can process the same data 100 times faster than MapReduce, making it a preferred choice for efficient data processing.', 'Spark can perform both batch and real-time processing, catering to various use cases like real-time or near real-time processing, which is crucial for industries like retail.', "Spark's implementation in Scala results in concise code with fewer lines, offering the benefits of both functional programming and object-oriented language, making it a preferred choice for efficient coding and execution."]}], 'duration': 324.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms7329.jpg', 'highlights': ['Apache Spark is an open source in-memory computing framework, used to process data in batch and real time across various cluster computers.', 'Spark can process the same data 100 times faster than MapReduce, making it a preferred choice for efficient data processing.', 'In 2014, Apache Spark, used by Databricks, set a new world record in sorting large-scale datasets.', "Spark's implementation in Scala results in concise code with fewer lines, offering the benefits of both functional programming and object-oriented language, making it a preferred choice for efficient coding and execution.", 'Apache Spark is highlighted as one of the most in-demand processing frameworks in the big data world.']}, {'end': 737.879, 'segs': [{'end': 382.063, 'src': 'embed', 'start': 332.506, 'weight': 1, 'content': [{'end': 337.548, 'text': 'when we talk about spark features, one of the key features is fast processing.', 'start': 332.506, 'duration': 5.042}, {'end': 341.59, 'text': 'so spark contains resilient, distributed data sets.', 'start': 337.548, 'duration': 4.042}, {'end': 347.232, 'text': 'so rdds are the building blocks for spark, and we learn more about rdds later.', 'start': 341.59, 'duration': 5.642}, {'end': 354.535, 'text': 'so spark contains rdds, which saves huge time taken in reading and writing operations.', 'start': 347.232, 'duration': 7.303}, {'end': 359.317, 'text': 'so it can be 100 times, or you can say 10 to 100 times faster than hadoop.', 'start': 354.535, 'duration': 4.782}, {'end': 367.559, 'text': 'when we say in-memory computing, here i would like to make a note that there is a difference between caching and in-memory computing.', 'start': 359.617, 'duration': 7.942}, {'end': 368.439, 'text': 'think about it.', 'start': 367.559, 'duration': 0.88}, {'end': 377.402, 'text': 'caching is mainly to support read ahead mechanism where you have your data pre-loaded so that it can benefit further queries.', 'start': 368.439, 'duration': 8.963}, {'end': 382.063, 'text': 'however, when we say in-memory computing, we are talking about lazy valuation.', 'start': 377.402, 'duration': 4.661}], 'summary': 'Spark features include resilient distributed datasets (rdds), saving significant time in operations, and achieving 10 to 100 times faster processing than hadoop.', 'duration': 49.557, 'max_score': 332.506, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms332506.jpg'}, {'end': 471.978, 'src': 'embed', 'start': 445.009, 'weight': 2, 'content': [{'end': 452.613, 'text': 'So failure of one worker node in the cluster will really not affect the RDDs because that portion can be recomputed.', 'start': 445.009, 'duration': 7.604}, {'end': 461.015, 'text': 'so it ensures loss of data, it ensures that there is no data loss and it is absolutely fault tolerant.', 'start': 452.993, 'duration': 8.022}, {'end': 463.016, 'text': 'it is for better analytics.', 'start': 461.015, 'duration': 2.001}, {'end': 471.978, 'text': 'so spark has rich set of sql queries, machine learning algorithms, complex analytics, all of this supported by various spark components,', 'start': 463.016, 'duration': 8.962}], 'summary': 'Spark is fault-tolerant, ensuring no data loss, supporting rich sql queries, ml algorithms, and complex analytics.', 'duration': 26.969, 'max_score': 445.009, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms445009.jpg'}, {'end': 539.741, 'src': 'embed', 'start': 511.612, 'weight': 3, 'content': [{'end': 521.957, 'text': 'And Spark SQL internally has components or features like data frames and data sets, which can be used to process your structured data in a much,', 'start': 511.612, 'duration': 10.345}, {'end': 523.337, 'text': 'much faster way.', 'start': 521.957, 'duration': 1.38}, {'end': 524.977, 'text': 'you have spark streaming.', 'start': 523.337, 'duration': 1.64}, {'end': 530.859, 'text': "now that's again an important component of spark, which allows you to create your spark streaming applications,", 'start': 524.977, 'duration': 5.882}, {'end': 539.741, 'text': 'which not only works on data which is being streamed in or data which is constantly getting generated, but you would also, or you could also,', 'start': 530.859, 'duration': 8.882}], 'summary': 'Spark sql includes data frames and data sets for faster processing, along with spark streaming for real-time data applications.', 'duration': 28.129, 'max_score': 511.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms511612.jpg'}, {'end': 693.089, 'src': 'heatmap', 'start': 666.218, 'weight': 0.706, 'content': [{'end': 671.102, 'text': "now that storage could be your sdfs, that is, hadoop's distributed file system.", 'start': 666.218, 'duration': 4.884}, {'end': 679.048, 'text': 'it could be a database like no sql database, such as hbase, or it could be any other database, say rdbms,', 'start': 671.102, 'duration': 7.946}, {'end': 686.81, 'text': 'from where you could connect your Spark and then fetch the data, extract the data, process it, analyze it.', 'start': 679.048, 'duration': 7.762}, {'end': 693.089, 'text': "So let's learn a little bit about your RDDs, Resilient Distributed Datasets.", 'start': 687.344, 'duration': 5.745}], 'summary': 'Learn about connecting spark to various databases and processing data using rdds.', 'duration': 26.871, 'max_score': 666.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms666218.jpg'}, {'end': 720.748, 'src': 'embed', 'start': 693.589, 'weight': 0, 'content': [{'end': 701.055, 'text': 'Now Spark Core, which is the base engine or the core engine, is embedded with the building blocks of Spark,', 'start': 693.589, 'duration': 7.466}, {'end': 704.658, 'text': 'which is nothing but your resilient distributed data set.', 'start': 701.055, 'duration': 3.603}, {'end': 706.639, 'text': 'So as the name says, it is resilient.', 'start': 704.698, 'duration': 1.941}, {'end': 711.663, 'text': 'So it is existing for a shorter period of time distributed.', 'start': 706.879, 'duration': 4.784}, {'end': 713.725, 'text': 'So it is distributed across nodes.', 'start': 711.964, 'duration': 1.761}, {'end': 720.748, 'text': 'And it is a data set where the data will be loaded or where the data will be existing for processing.', 'start': 714.284, 'duration': 6.464}], 'summary': 'Spark core is embedded with resilient distributed data set, distributed across nodes for data processing.', 'duration': 27.159, 'max_score': 693.589, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms693589.jpg'}], 'start': 332.506, 'title': 'Spark features and apache spark overview', 'summary': "Discusses spark's fast processing capabilities, with rdds that can be 10 to 100 times faster than hadoop, and covers apache spark's fault tolerance, multiple language support, spark sql, machine learning algorithms, streaming applications, and spark core as the base engine for parallel and distributed data processing.", 'chapters': [{'end': 404.77, 'start': 332.506, 'title': 'Spark features: fast processing and in-memory computing', 'summary': 'Discusses the key features of spark, emphasizing its fast processing capabilities with resilient distributed datasets (rdds) that can be 10 to 100 times faster than hadoop, and the distinction between caching and in-memory computing, where data is stored in ram and loaded only when a specific action is invoked.', 'duration': 72.264, 'highlights': ['Spark contains Resilient Distributed Datasets (RDDs) which can be 10 to 100 times faster than Hadoop, serving as the building blocks for Spark.', 'In-memory computing involves data being loaded into memory only when a specific action is invoked, utilizing RAM not only for processing but also for storage.', 'Caching is primarily for supporting read-ahead mechanism, while in-memory computing involves lazy evaluation and data stored in RAM.']}, {'end': 737.879, 'start': 404.77, 'title': 'Apache spark overview', 'summary': "Covers the key features of apache spark, including fault tolerance, multiple language support, rich set of sql queries, machine learning algorithms, spark components, and the base engine spark core. it emphasizes that spark ensures fault tolerance, supports structured data processing through spark sql, allows streaming applications, machine learning algorithm building through mllib, and graph-based processing through graphics. spark core forms the base engine for large-scale parallel and distributed data processing with rdds as its building blocks, and relies on external storage systems such as hadoop's distributed file system or databases like hbase or rdbms.", 'duration': 333.109, 'highlights': ['Fault Tolerance and RDDs Spark ensures fault tolerance through Resilient Distributed Datasets (RDDs) distributed across nodes, preventing data loss and allowing recomputation in case of worker node failure.', 'Spark Components and Functionality Spark offers a rich set of SQL queries, machine learning algorithms, complex analytics, and various components including Spark Core, Spark SQL, Spark Streaming, MLlib, and Graphics for structured data processing, streaming applications, machine learning algorithm building, and graph-based processing.', 'Spark Core and RDDs Management Spark Core is the base engine for large-scale parallel and distributed data processing, responsible for handling RDDs as resilient distributed collections of objects, with operations performed on them and relying on external storage systems for data access.']}], 'duration': 405.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms332506.jpg', 'highlights': ['Spark Core serves as the base engine for parallel and distributed data processing.', 'Resilient Distributed Datasets (RDDs) in Spark can be 10 to 100 times faster than Hadoop.', 'Spark ensures fault tolerance through distributed RDDs, preventing data loss and allowing recomputation in case of failure.', 'Spark offers a rich set of SQL queries, machine learning algorithms, and various components for structured data processing and machine learning.', 'In-memory computing involves lazy evaluation and data stored in RAM, utilizing RAM for processing and storage.']}, {'end': 1053.207, 'segs': [{'end': 790.144, 'src': 'embed', 'start': 738.179, 'weight': 0, 'content': [{'end': 745.162, 'text': 'Now here I could write a simple code in Scala and that would basically mean something like this.', 'start': 738.179, 'duration': 6.983}, {'end': 755.406, 'text': 'So if I say val, which is to declare a variable, I would say val x, and then I could use what we call a spark context,', 'start': 745.242, 'duration': 10.164}, {'end': 759.388, 'text': 'which is basically the most important entry point of your application.', 'start': 755.406, 'duration': 3.982}, {'end': 762.727, 'text': 'So then I could use a method of Spark context.', 'start': 759.724, 'duration': 3.003}, {'end': 765.109, 'text': 'For example, that is text file.', 'start': 763.207, 'duration': 1.902}, {'end': 768.391, 'text': 'And then I could point it to a particular file.', 'start': 765.529, 'duration': 2.862}, {'end': 775.998, 'text': 'So this is just a method of your Spark context and Spark context is the entry point of your application.', 'start': 768.611, 'duration': 7.387}, {'end': 779.621, 'text': 'Now here I could just give a path in this method.', 'start': 776.038, 'duration': 3.583}, {'end': 783.564, 'text': 'So what does this step do? It does not do any valuation.', 'start': 779.901, 'duration': 3.663}, {'end': 787.187, 'text': "So when I say val x, I'm creating an immutable variable.", 'start': 783.724, 'duration': 3.463}, {'end': 790.144, 'text': "and to that variable i'm assigning a file.", 'start': 787.683, 'duration': 2.461}], 'summary': 'Using scala, declaring variable x with spark context, reading text file, and pointing to a specific file.', 'duration': 51.965, 'max_score': 738.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms738179.jpg'}, {'end': 917.768, 'src': 'embed', 'start': 891.377, 'weight': 5, 'content': [{'end': 897.603, 'text': 'Now this second step is again creating an RDD, a resilient distributed data set.', 'start': 891.377, 'duration': 6.226}, {'end': 900.045, 'text': 'You can say second step in my DAG.', 'start': 897.883, 'duration': 2.162}, {'end': 907.724, 'text': 'Okay And here you have a external RDD, one more RDD created, which depends on the first RDD.', 'start': 900.461, 'duration': 7.263}, {'end': 914.507, 'text': 'So my first RDD becomes the base RDD or parent RDD and the resultant RDD becomes the child RDD.', 'start': 907.884, 'duration': 6.623}, {'end': 917.768, 'text': 'Then we can go further and we could say val Z.', 'start': 914.787, 'duration': 2.981}], 'summary': 'Creating rdds in a dag, with first rdd as base and resultant rdd as child.', 'duration': 26.391, 'max_score': 891.377, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms891377.jpg'}, {'end': 974.704, 'src': 'embed', 'start': 947.168, 'weight': 3, 'content': [{'end': 952.953, 'text': 'So this is my tag, which is a series of steps which will be executed.', 'start': 947.168, 'duration': 5.785}, {'end': 954.395, 'text': 'Now here.', 'start': 953.234, 'duration': 1.161}, {'end': 955.876, 'text': 'when does the execution happen??', 'start': 954.395, 'duration': 1.481}, {'end': 956.977, 'text': 'When the data get?', 'start': 955.936, 'duration': 1.041}, {'end': 960.4, 'text': 'when will the data get loaded into these RDDs?', 'start': 956.977, 'duration': 3.423}, {'end': 971.79, 'text': 'So all of this that is using a method, using a transformation like map, using a transformation like filter or flat map or anything else.', 'start': 960.86, 'duration': 10.93}, {'end': 974.704, 'text': 'these are your transformations.', 'start': 971.79, 'duration': 2.914}], 'summary': 'Explanation of steps and timing for rdd data transformations.', 'duration': 27.536, 'max_score': 947.168, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms947168.jpg'}, {'end': 1027.973, 'src': 'embed', 'start': 1001.19, 'weight': 4, 'content': [{'end': 1008.976, 'text': 'So those are actions which will actually trigger the execution of this DAG right from the beginning.', 'start': 1001.19, 'duration': 7.786}, {'end': 1020.324, 'text': "So if I here say z.count, where I would want to just count the number of words which I'm filtering, this is an action which is invoked.", 'start': 1009.276, 'duration': 11.048}, {'end': 1024.851, 'text': 'And this will trigger the execution of DAG right from the beginning.', 'start': 1020.928, 'duration': 3.923}, {'end': 1027.973, 'text': 'So this is what happens in a Spark.', 'start': 1025.29, 'duration': 2.683}], 'summary': 'Actions trigger dag execution in spark.', 'duration': 26.783, 'max_score': 1001.19, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1001190.jpg'}], 'start': 738.179, 'title': 'Using spark context in scala and understanding rdd in spark', 'summary': "Demonstrates using spark context in scala to declare a variable 'x' and utilize the text file method. it also explains the creation of rdd, its usage, and the distinction between transformations and actions, emphasizing delayed execution logic and the trigger of execution via actions.", 'chapters': [{'end': 790.144, 'start': 738.179, 'title': 'Using spark context in scala', 'summary': "Demonstrates using the spark context in scala to declare a variable 'x' and utilize the text file method to point to a specific file, creating an immutable variable without any valuation.", 'duration': 51.965, 'highlights': ["The chapter demonstrates using the Spark context in Scala to declare a variable 'x' and utilize the text file method to point to a specific file, creating an immutable variable without any valuation.", "The Spark context serves as the most important entry point of the application, and the text file method is used to point to a particular file, with 'val x' creating an immutable variable.", "The 'val' keyword is used to declare a variable in Scala, and the 'text file' method is utilized within the Spark context to specify a file."]}, {'end': 1053.207, 'start': 790.144, 'title': 'Understanding rdd in spark', 'summary': 'Explains the creation of rdd, its usage in spark, and the distinction between transformations and actions, emphasizing the delayed execution logic and the trigger of execution via actions.', 'duration': 263.063, 'highlights': ['RDD creation and its usage in Spark The chapter explains the process of creating RDDs in Spark and their usage in executing logical data sets across nodes, demonstrating the concept of base and child RDDs.', 'Concept of transformations and actions in RDD The chapter distinguishes between transformations (e.g., map, filter) and actions, highlighting that transformations only create execution logic without data evaluation, while actions trigger the execution of the Directed Acyclic Graph (DAG).', 'Triggering execution via actions It is emphasized that invoking actions, such as count and print, triggers the execution of the DAG, causing the data to be loaded and the operations to be evaluated, providing a clear understanding of RDD functionality.']}], 'duration': 315.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms738179.jpg', 'highlights': ["The chapter demonstrates using the Spark context in Scala to declare a variable 'x' and utilize the text file method to point to a specific file, creating an immutable variable without any valuation.", "The 'val' keyword is used to declare a variable in Scala, and the 'text file' method is utilized within the Spark context to specify a file.", "The Spark context serves as the most important entry point of the application, and the text file method is used to point to a particular file, with 'val x' creating an immutable variable.", 'Concept of transformations and actions in RDD The chapter distinguishes between transformations (e.g., map, filter) and actions, highlighting that transformations only create execution logic without data evaluation, while actions trigger the execution of the Directed Acyclic Graph (DAG).', 'Triggering execution via actions It is emphasized that invoking actions, such as count and print, triggers the execution of the DAG, causing the data to be loaded and the operations to be evaluated, providing a clear understanding of RDD functionality.', 'RDD creation and its usage in Spark The chapter explains the process of creating RDDs in Spark and their usage in executing logical data sets across nodes, demonstrating the concept of base and child RDDs.']}, {'end': 1303.018, 'segs': [{'end': 1134.125, 'src': 'embed', 'start': 1053.628, 'weight': 0, 'content': [{'end': 1056.49, 'text': 'So mainly in Spark, there are two kinds of operations.', 'start': 1053.628, 'duration': 2.862}, {'end': 1061.534, 'text': 'One is your transformations and one is your actions.', 'start': 1057.031, 'duration': 4.503}, {'end': 1070.382, 'text': 'Transformations or using a method of Spark context will always and always create an RTD or you could say a step in the tag.', 'start': 1061.955, 'duration': 8.427}, {'end': 1075.466, 'text': 'Actions are something which will invoke the executions.', 'start': 1070.822, 'duration': 4.644}, {'end': 1082.789, 'text': 'which will invoke the execution from the first RDD till the last RDD where you can get your result.', 'start': 1075.785, 'duration': 7.004}, {'end': 1085.591, 'text': 'So, this is how your RDDs work.', 'start': 1083.15, 'duration': 2.441}, {'end': 1091.375, 'text': "Now, when we talk about components of Spark, let's learn a little bit about Spark SQL.", 'start': 1086.072, 'duration': 5.303}, {'end': 1099.36, 'text': 'So, Spark SQL is a component type processing framework which is used for structured and semi-structured data processing.', 'start': 1091.655, 'duration': 7.705}, {'end': 1109.832, 'text': 'so usually people might have their structured data stored in rdbms or in files where data is structured with particular delimiters and has a pattern.', 'start': 1099.6, 'duration': 10.232}, {'end': 1120.321, 'text': 'and if one wants to process this structured data, if one wants to use spark to do in-memory processing and work on this structured data,', 'start': 1109.832, 'duration': 10.489}, {'end': 1122.342, 'text': 'they would prefer to use ParkSQL.', 'start': 1120.321, 'duration': 2.021}, {'end': 1127.623, 'text': 'So you can work on different data formats, say CSV, JSON.', 'start': 1122.602, 'duration': 5.021}, {'end': 1134.125, 'text': 'You can even work on smarter formats like Avro, Parquet, even your binary files or sequence files.', 'start': 1127.843, 'duration': 6.282}], 'summary': 'Spark has two kinds of operations: transformations and actions. spark sql is used for structured and semi-structured data processing, working with various data formats like csv, json, avro, parquet, and binary files.', 'duration': 80.497, 'max_score': 1053.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1053628.jpg'}, {'end': 1184.068, 'src': 'embed', 'start': 1157.052, 'weight': 1, 'content': [{'end': 1162.154, 'text': 'data frames, in short, you can visualize or imagine as rows and columns,', 'start': 1157.052, 'duration': 5.102}, {'end': 1168.136, 'text': 'or if your data can be represented in the form of rows and columns with some column headings.', 'start': 1162.154, 'duration': 5.982}, {'end': 1172.439, 'text': 'So DataFrame API allows you to create DataFrames.', 'start': 1168.516, 'duration': 3.923}, {'end': 1177.483, 'text': 'So, like my previous example, when you work on a file, when you want to process it,', 'start': 1173.02, 'duration': 4.463}, {'end': 1184.068, 'text': 'you would convert that into an RDD using a method of Spark context or by doing some transformations.', 'start': 1177.483, 'duration': 6.585}], 'summary': 'Dataframe api allows you to create dataframes for processing data efficiently.', 'duration': 27.016, 'max_score': 1157.052, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1157052.jpg'}, {'end': 1261.713, 'src': 'embed', 'start': 1205.824, 'weight': 5, 'content': [{'end': 1211.708, 'text': 'now, in case of data frames instead of sc, you would be using, say, spark dot, something.', 'start': 1205.824, 'duration': 5.884}, {'end': 1217.351, 'text': 'so spark context is available for your data frames api to be used.', 'start': 1211.708, 'duration': 5.643}, {'end': 1224.013, 'text': 'in older versions like spark 1.6 and so on, we were using hive context or sql context.', 'start': 1217.351, 'duration': 6.662}, {'end': 1231.576, 'text': 'so if you were working with spark 1.6, you would be saying val x equals sql context dot.', 'start': 1224.013, 'duration': 7.563}, {'end': 1234.297, 'text': 'here we would be using spark dot.', 'start': 1231.576, 'duration': 2.721}, {'end': 1240.341, 'text': 'So DataFrame API basically allows you to create data frames out of your structured data,', 'start': 1234.857, 'duration': 5.484}, {'end': 1244.723, 'text': 'which also lets Spark know that data is already in a particular structure.', 'start': 1240.341, 'duration': 4.382}, {'end': 1250.147, 'text': "It follows a format and based on that, your Spark's backend DAG scheduler.", 'start': 1244.984, 'duration': 5.163}, {'end': 1256.17, 'text': 'Right So when I say about that, I talk about your sequence of steps.', 'start': 1250.467, 'duration': 5.703}, {'end': 1261.713, 'text': 'So Spark is already aware of what are the different steps involved in your application.', 'start': 1256.65, 'duration': 5.063}], 'summary': 'Data frames api in spark allows creating structured data frames and enables backend dag scheduler.', 'duration': 55.889, 'max_score': 1205.824, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1205824.jpg'}, {'end': 1311.487, 'src': 'embed', 'start': 1282.984, 'weight': 7, 'content': [{'end': 1289.287, 'text': 'so to learn more about data frames, follow in the next sessions when you talk about spark streaming.', 'start': 1282.984, 'duration': 6.303}, {'end': 1294.169, 'text': 'now, this is very interesting for organizations who would want to work on streaming data.', 'start': 1289.287, 'duration': 4.882}, {'end': 1300.696, 'text': "Imagine a store like Macy's, where they would want to have machine learning algorithms.", 'start': 1294.629, 'duration': 6.067}, {'end': 1303.018, 'text': 'Now, what would these machine learning algorithms do?', 'start': 1300.976, 'duration': 2.042}, {'end': 1311.487, 'text': 'Suppose you have a lot of customers walking in the store and they are searching for a particular product or particular item.', 'start': 1303.298, 'duration': 8.189}], 'summary': "Learn about data frames in upcoming sessions for spark streaming, beneficial for organizations like macy's implementing machine learning algorithms for streaming data.", 'duration': 28.503, 'max_score': 1282.984, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1282984.jpg'}], 'start': 1053.628, 'title': 'Spark operations, sql, and dataframes', 'summary': 'Covers spark operations, transformations, actions, spark sql for structured and semi-structured data, supporting various formats. it also explores dataframes in spark, including methods to convert rdds, significance in structuring data, working with streaming data, and implementing machine learning algorithms.', 'chapters': [{'end': 1177.483, 'start': 1053.628, 'title': 'Spark operations and spark sql', 'summary': 'Explains the operations in spark, including transformations and actions, and introduces spark sql for processing structured and semi-structured data, supporting various formats like csv, json, avro, parquet, and binary files.', 'duration': 123.855, 'highlights': ['Spark SQL is a component type processing framework used for structured and semi-structured data processing. Spark SQL is a component type processing framework used for structured and semi-structured data processing, allowing for in-memory processing of structured data stored in RDBMS or files with specific delimiters and patterns.', 'Data frames in Spark SQL can be visualized as rows and columns, and the DataFrame API allows the creation of DataFrames. Data frames in Spark SQL can be visualized as rows and columns, and the DataFrame API facilitates the creation of DataFrames, enabling efficient processing of structured data.', 'Spark supports various data formats such as CSV, JSON, Avro, Parquet, and binary files, and allows data extraction from RDBMS using JDBC connection. Spark supports various data formats such as CSV, JSON, Avro, Parquet, and binary files, and provides the capability to extract data from RDBMS using JDBC connection for diverse data processing needs.', 'Actions in Spark invoke the execution from the first RDD till the last RDD to obtain the result. Actions in Spark invoke the execution from the first RDD till the last RDD to obtain the result, providing a clear understanding of the execution flow in Spark.', 'Transformations in Spark always create an RDD or a step in the DAG, using the method of Spark context. Transformations in Spark always create an RDD or a step in the DAG, using the method of Spark context, forming the fundamental building blocks for data processing in Spark.']}, {'end': 1303.018, 'start': 1177.483, 'title': 'Data frames in spark', 'summary': "Explains the usage of dataframes in spark, including the methods to convert rdds to dataframes, the use of spark context and sql context, and the significance of dataframe api in structuring data for spark's backend dag scheduler. it also touches upon the relevance of dataframes in working with streaming data and implementing machine learning algorithms for organizations like macy's.", 'duration': 125.535, 'highlights': ["DataFrames can be created from structured data, indicating the data format and allowing Spark to understand the data's structure. This feature of DataFrames enables Spark to efficiently handle structured data, optimizing the backend DAG scheduler for better performance and resource management.", "Usage of Spark context or SQL context for working with DataFrames, with examples from earlier versions like using Hive context or SQL context in Spark 1.6. Understanding the historical context of using different context types for DataFrames provides insights into the evolution of Spark's data processing capabilities.", "Relevance of DataFrames in working with streaming data, particularly in the context of organizations like Macy's implementing machine learning algorithms. Highlighting the practical implications of using DataFrames for real-time data processing and advanced analytics, demonstrating its significance in modern data-driven organizations."]}], 'duration': 249.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1053628.jpg', 'highlights': ['Spark SQL is a component type processing framework used for structured and semi-structured data processing, allowing for in-memory processing of structured data stored in RDBMS or files with specific delimiters and patterns.', 'Data frames in Spark SQL can be visualized as rows and columns, and the DataFrame API facilitates the creation of DataFrames, enabling efficient processing of structured data.', 'Spark supports various data formats such as CSV, JSON, Avro, Parquet, and binary files, and provides the capability to extract data from RDBMS using JDBC connection for diverse data processing needs.', 'Actions in Spark invoke the execution from the first RDD till the last RDD to obtain the result, providing a clear understanding of the execution flow in Spark.', 'Transformations in Spark always create an RDD or a step in the DAG, using the method of Spark context, forming the fundamental building blocks for data processing in Spark.', "DataFrames can be created from structured data, indicating the data format and allowing Spark to understand the data's structure. This feature of DataFrames enables Spark to efficiently handle structured data, optimizing the backend DAG scheduler for better performance and resource management.", "Usage of Spark context or SQL context for working with DataFrames, with examples from earlier versions like using Hive context or SQL context in Spark 1.6. Understanding the historical context of using different context types for DataFrames provides insights into the evolution of Spark's data processing capabilities.", "Relevance of DataFrames in working with streaming data, particularly in the context of organizations like Macy's implementing machine learning algorithms. Highlighting the practical implications of using DataFrames for real-time data processing and advanced analytics, demonstrating its significance in modern data-driven organizations."]}, {'end': 1607.063, 'segs': [{'end': 1356.46, 'src': 'embed', 'start': 1322.995, 'weight': 0, 'content': [{'end': 1330.26, 'text': 'now, once camera captures this information, this information can be streamed in to be processed by algorithms,', 'start': 1322.995, 'duration': 7.265}, {'end': 1336.925, 'text': 'and those algorithms will see which product or which series of product customers might be interested in.', 'start': 1330.26, 'duration': 6.665}, {'end': 1345.652, 'text': 'And if this algorithm in real time can process based on the number of customers, based on the available product in the store,', 'start': 1337.545, 'duration': 8.107}, {'end': 1356.46, 'text': 'it can come up with a attractive alternative price so that which the price can be displayed on the screen and probably customers would buy the product.', 'start': 1345.652, 'duration': 10.808}], 'summary': 'Camera captures data for real-time algorithmic pricing, boosting customer interest and sales.', 'duration': 33.465, 'max_score': 1322.995, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1322995.jpg'}, {'end': 1435.449, 'src': 'embed', 'start': 1388.896, 'weight': 1, 'content': [{'end': 1397.419, 'text': 'Based on the clicks, based on customer history, based on customer behavior, algorithms can come up with recommendation of products or, better,', 'start': 1388.896, 'duration': 8.523}, {'end': 1400.22, 'text': 'altered price so that the sale happens.', 'start': 1397.419, 'duration': 2.801}, {'end': 1410.028, 'text': 'Now, in this case, we would be seeing the essence of real time processing only in a fixed or in a particular duration of time.', 'start': 1400.78, 'duration': 9.248}, {'end': 1416.194, 'text': 'And this also means that you should have something which can process the data as it comes in.', 'start': 1410.269, 'duration': 5.925}, {'end': 1426.124, 'text': 'So Spark Streaming is a lightweight API that allows developers to perform batch processing and also real-time streaming and processing of data.', 'start': 1416.539, 'duration': 9.585}, {'end': 1430.367, 'text': 'So it provides secure, reliable, fast processing of live data streams.', 'start': 1426.304, 'duration': 4.063}, {'end': 1435.449, 'text': 'So what happens here in Spark Streaming in brief? So you have an input data stream.', 'start': 1430.907, 'duration': 4.542}], 'summary': 'Using spark streaming, algorithms recommend products based on customer behavior, enabling real-time processing of live data streams.', 'duration': 46.553, 'max_score': 1388.896, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1388896.jpg'}, {'end': 1540.277, 'src': 'embed', 'start': 1513.76, 'weight': 3, 'content': [{'end': 1521.811, 'text': "we are talking about machine learning algorithms which can be built using MLlib's libraries and then Spark can be used for processing.", 'start': 1513.76, 'duration': 8.051}, {'end': 1528.399, 'text': 'So MLlib eases the deployment and development of scalable machine learning algorithms.', 'start': 1521.971, 'duration': 6.428}, {'end': 1531.825, 'text': 'I mean think about your clustering techniques.', 'start': 1528.92, 'duration': 2.905}, {'end': 1540.277, 'text': 'so think about your classification, where you would want to classify the data, where you would want to do supervised or unsupervised learning.', 'start': 1531.825, 'duration': 8.452}], 'summary': 'Mllib simplifies building scalable machine learning algorithms for clustering and classification using spark.', 'duration': 26.517, 'max_score': 1513.76, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1513760.jpg'}, {'end': 1582.192, 'src': 'embed', 'start': 1552.35, 'weight': 4, 'content': [{'end': 1557.094, 'text': "graphics is spark's own graph computation engine.", 'start': 1552.35, 'duration': 4.744}, {'end': 1561.658, 'text': 'so this is mainly if you are interested in doing a graph based processing.', 'start': 1557.094, 'duration': 4.564}, {'end': 1563.4, 'text': 'think about facebook.', 'start': 1561.658, 'duration': 1.742}, {'end': 1572.45, 'text': 'think about linkedin, where you can have your data which can be stored and that data has some kind of network connections,', 'start': 1563.4, 'duration': 9.05}, {'end': 1575.05, 'text': 'or you could say it is well networked.', 'start': 1572.45, 'duration': 2.6}, {'end': 1582.192, 'text': 'I could say X is connected to Y, Y is connected to Z, Z is connected to A.', 'start': 1575.47, 'duration': 6.722}], 'summary': "Spark's graphics engine is ideal for graph-based processing, as seen in social networks like facebook and linkedin.", 'duration': 29.842, 'max_score': 1552.35, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1552350.jpg'}], 'start': 1303.298, 'title': 'Real-time customer analysis and spark streaming', 'summary': 'Discusses real-time customer analysis through cameras in stores and online portals, leveraging algorithms for product recommendations and price adjustments, along with an overview of spark streaming, mllib, and graphx for real-time data processing and machine learning.', 'chapters': [{'end': 1410.028, 'start': 1303.298, 'title': 'Real-time customer analysis', 'summary': 'Discusses the implementation of real-time customer analysis using cameras in stores and online portals, where algorithms process customer data to provide product recommendations and alter prices to drive sales.', 'duration': 106.73, 'highlights': ['Cameras in stores monitor customer movement to identify areas of interest, enabling real-time algorithm processing to provide product recommendations and alter prices for increased sales.', 'Real-time processing through machine learning algorithms on online shopping portals analyzes customer behavior and history to offer product recommendations and adjusted prices to drive sales.', 'The essence of machine learning and real-time processing is highlighted in scenarios where customer data is processed to provide product recommendations and alter prices for increased sales.']}, {'end': 1607.063, 'start': 1410.269, 'title': 'Spark streaming, mllib, and graphx', 'summary': 'Covers spark streaming for real-time data processing, mllib for machine learning algorithms, and graphx for graph computation, emphasizing their capabilities and use cases.', 'duration': 196.794, 'highlights': ['Spark Streaming provides secure, reliable, fast processing of live data streams, allowing real-time streaming and processing of data. Spark Streaming enables real-time processing of live data streams, ensuring secure, reliable, and fast processing.', 'MLlib eases the deployment and development of scalable machine learning algorithms for clustering, classification, collaborative filtering, and other data science techniques. MLlib simplifies the deployment and development of scalable machine learning algorithms for various data science techniques.', "GraphX is Spark's graph computation engine, suitable for graph-based processing, such as network connections and relationships. GraphX serves as a graph computation engine, ideal for processing network connections and relationships in graph-based scenarios."]}], 'duration': 303.765, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1303298.jpg', 'highlights': ['Cameras in stores enable real-time algorithm processing for product recommendations and price adjustments.', 'Real-time processing through machine learning algorithms on online portals analyzes customer behavior for product recommendations and price adjustments.', 'Spark Streaming provides secure, reliable, and fast processing of live data streams for real-time data processing.', 'MLlib simplifies the deployment and development of scalable machine learning algorithms for various data science techniques.', 'GraphX serves as a graph computation engine, ideal for processing network connections and relationships in graph-based scenarios.']}, {'end': 2297.302, 'segs': [{'end': 1679.861, 'src': 'heatmap', 'start': 1623.933, 'weight': 1, 'content': [{'end': 1633.576, 'text': 'so if you have data which can be represented in the form of graph, then graphics can be used to do ETL, that is, extraction, transformation, load,', 'start': 1623.933, 'duration': 9.643}, {'end': 1638.578, 'text': 'to do your data analysis and also do interactive graph computation.', 'start': 1633.576, 'duration': 5.002}, {'end': 1640.359, 'text': 'so graphics is quite powerful.', 'start': 1638.578, 'duration': 1.781}, {'end': 1648.104, 'text': 'now, when you talk about spark, Your Spark can work with your different clustering technologies.', 'start': 1640.359, 'duration': 7.745}, {'end': 1650.265, 'text': 'So it can work with Apache Mesos.', 'start': 1648.184, 'duration': 2.081}, {'end': 1656.328, 'text': "That's how Spark came in, where it was initially to prove the credibility of Apache Mesos.", 'start': 1650.585, 'duration': 5.743}, {'end': 1661.63, 'text': 'Spark can work with YARN, which is usually you will see in different working environments.', 'start': 1656.748, 'duration': 4.882}, {'end': 1663.671, 'text': 'Spark can also work as standalone.', 'start': 1661.67, 'duration': 2.001}, {'end': 1668.693, 'text': 'That means without Hadoop, Spark can have its own setup with master and worker processes.', 'start': 1663.811, 'duration': 4.882}, {'end': 1674.525, 'text': 'So usually or you can say technically Spark uses a master slave architecture.', 'start': 1669.213, 'duration': 5.312}, {'end': 1679.861, 'text': 'Now that consists of a driver program that can run on a master node.', 'start': 1674.765, 'duration': 5.096}], 'summary': 'Graphics can be used for etl and data analysis. spark works with various clustering technologies and can function with or without hadoop.', 'duration': 55.928, 'max_score': 1623.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1623933.jpg'}, {'end': 1679.861, 'src': 'embed', 'start': 1650.585, 'weight': 1, 'content': [{'end': 1656.328, 'text': "That's how Spark came in, where it was initially to prove the credibility of Apache Mesos.", 'start': 1650.585, 'duration': 5.743}, {'end': 1661.63, 'text': 'Spark can work with YARN, which is usually you will see in different working environments.', 'start': 1656.748, 'duration': 4.882}, {'end': 1663.671, 'text': 'Spark can also work as standalone.', 'start': 1661.67, 'duration': 2.001}, {'end': 1668.693, 'text': 'That means without Hadoop, Spark can have its own setup with master and worker processes.', 'start': 1663.811, 'duration': 4.882}, {'end': 1674.525, 'text': 'So usually or you can say technically Spark uses a master slave architecture.', 'start': 1669.213, 'duration': 5.312}, {'end': 1679.861, 'text': 'Now that consists of a driver program that can run on a master node.', 'start': 1674.765, 'duration': 5.096}], 'summary': 'Spark can work with yarn, can operate standalone, and uses a master-slave architecture.', 'duration': 29.276, 'max_score': 1650.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1650585.jpg'}, {'end': 1752.706, 'src': 'heatmap', 'start': 1700.459, 'weight': 2, 'content': [{'end': 1711.004, 'text': 'so your spark, every spark application will have a driver program And that driver program has a inbuilt or internally used Spark context,', 'start': 1700.459, 'duration': 10.545}, {'end': 1716.125, 'text': 'which is basically your entry point of application for any Spark functionality.', 'start': 1711.004, 'duration': 5.121}, {'end': 1721.767, 'text': 'So your driver or your driver program interacts with your cluster manager.', 'start': 1716.545, 'duration': 5.222}, {'end': 1724.408, 'text': 'Now when I say interacts with cluster manager.', 'start': 1721.967, 'duration': 2.441}, {'end': 1731.95, 'text': 'so you have your Spark context, which is the entry point that takes your application request to the cluster manager.', 'start': 1724.408, 'duration': 7.542}, {'end': 1740.517, 'text': 'now, as i said, your cluster manager could be, say, apache mesos, it could be yarn, it could be spark standalone master itself.', 'start': 1732.33, 'duration': 8.187}, {'end': 1746.081, 'text': 'so your cluster manager, in terms of yarn, is your resource manager.', 'start': 1740.517, 'duration': 5.564}, {'end': 1752.706, 'text': 'so your spark application internally runs as series or set of tasks and processes.', 'start': 1746.081, 'duration': 6.625}], 'summary': 'Every spark application has a driver program with an internally used spark context, interacting with the cluster manager, such as apache mesos, yarn, or spark standalone master, which runs tasks and processes.', 'duration': 52.247, 'max_score': 1700.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1700459.jpg'}, {'end': 1799.27, 'src': 'embed', 'start': 1773.374, 'weight': 4, 'content': [{'end': 1783.337, 'text': 'So at high level we can say a job is split into multiple tasks and those tasks will be distributed over the slave nodes or worker nodes.', 'start': 1773.374, 'duration': 9.963}, {'end': 1788.421, 'text': 'So whenever you do some kind of transformation, or you use a method of Spark,', 'start': 1783.837, 'duration': 4.584}, {'end': 1794.766, 'text': 'context and RDD is created and this RDD is distributed across multiple nodes.', 'start': 1788.421, 'duration': 6.345}, {'end': 1799.27, 'text': 'As I explained earlier, worker nodes are the slaves that run different tasks.', 'start': 1794.886, 'duration': 4.384}], 'summary': 'Jobs are split into tasks distributed over slave nodes; rdds are distributed across multiple nodes.', 'duration': 25.896, 'max_score': 1773.374, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1773374.jpg'}, {'end': 1870.724, 'src': 'embed', 'start': 1844.383, 'weight': 5, 'content': [{'end': 1848.767, 'text': 'Now what does the resource manager do? Resource manager makes a request.', 'start': 1844.383, 'duration': 4.384}, {'end': 1857.553, 'text': 'So resource manager makes requests to the node manager of the machines wherever the relevant data resides, asking for containers.', 'start': 1849.187, 'duration': 8.366}, {'end': 1863.418, 'text': 'So your resource manager is negotiating or asking for containers from node manager, saying hey,', 'start': 1857.974, 'duration': 5.444}, {'end': 1867.341, 'text': 'can I have a container of one GB RAM and one CPU core?', 'start': 1863.418, 'duration': 3.923}, {'end': 1870.724, 'text': 'can I have a container of one GB RAM and one CPU core?', 'start': 1867.642, 'duration': 3.082}], 'summary': 'Resource manager negotiates for containers, requesting 1gb ram and 1 cpu core.', 'duration': 26.341, 'max_score': 1844.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1844383.jpg'}, {'end': 1943.106, 'src': 'embed', 'start': 1911.351, 'weight': 6, 'content': [{'end': 1914.732, 'text': 'So it is within this container which can take care of execution.', 'start': 1911.351, 'duration': 3.381}, {'end': 1917.914, 'text': 'So what is a container? A combination of RAM and CPU core.', 'start': 1914.752, 'duration': 3.162}, {'end': 1923.476, 'text': 'So it is within this container, we will have a executor process which would run.', 'start': 1918.114, 'duration': 5.362}, {'end': 1928.778, 'text': 'And this executor process is taking care of your application related tasks.', 'start': 1923.796, 'duration': 4.982}, {'end': 1933.52, 'text': "so that's how overall spark works in integration with yarn.", 'start': 1929.258, 'duration': 4.262}, {'end': 1937.423, 'text': "now let's learn about this spark cluster managers.", 'start': 1933.52, 'duration': 3.903}, {'end': 1943.106, 'text': 'as i said, spark can work in a standalone mode, so that is without hadoop.', 'start': 1937.423, 'duration': 5.683}], 'summary': 'A container combines ram and cpu core to execute an executor process, which handles application tasks in spark.', 'duration': 31.755, 'max_score': 1911.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1911351.jpg'}, {'end': 2049.775, 'src': 'embed', 'start': 2002.884, 'weight': 0, 'content': [{'end': 2006.045, 'text': "You can have Spark working with Hadoop's YARN.", 'start': 2002.884, 'duration': 3.161}, {'end': 2010.747, 'text': 'This is something which widely you will see in different working environments.', 'start': 2006.185, 'duration': 4.562}, {'end': 2019.091, 'text': 'So YARN which takes care of your processing and can take care of different processing frameworks also supports Spark.', 'start': 2011.107, 'duration': 7.984}, {'end': 2021.293, 'text': 'you could have kubernetes now.', 'start': 2019.571, 'duration': 1.722}, {'end': 2024.556, 'text': "that is something which is making a lot of news in today's world.", 'start': 2021.293, 'duration': 3.263}, {'end': 2031.764, 'text': 'it is a open source system for automating deployment, scaling and management of containerized applications,', 'start': 2024.556, 'duration': 7.208}, {'end': 2036.809, 'text': 'so where you could have multiple docker based images which can be connecting to each other.', 'start': 2031.764, 'duration': 5.045}, {'end': 2038.912, 'text': 'so spark also works with kubernetes.', 'start': 2036.809, 'duration': 2.103}, {'end': 2043.113, 'text': "Now let's look at some applications of Spark.", 'start': 2039.471, 'duration': 3.642}, {'end': 2049.775, 'text': 'So JPMorgan Chase and company uses Spark to detect fraudulent transactions,', 'start': 2043.573, 'duration': 6.202}], 'summary': "Spark works with hadoop's yarn and kubernetes, used by jpmorgan chase for fraud detection.", 'duration': 46.891, 'max_score': 2002.884, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms2002884.jpg'}, {'end': 2089.342, 'src': 'embed', 'start': 2063.241, 'weight': 10, 'content': [{'end': 2077.114, 'text': 'due to its real-time processing capabilities and in-memory faster processing where they could be working on fraud detection or credit analysis or pattern identification and many other use cases.', 'start': 2063.241, 'duration': 13.873}, {'end': 2082.498, 'text': 'Alibaba Group that uses also Spark to analyze large data,', 'start': 2077.655, 'duration': 4.843}, {'end': 2089.342, 'text': 'sets of data such as real-time transaction details now that might be based online or in the stores,', 'start': 2082.498, 'duration': 6.844}], 'summary': 'Alibaba group uses spark for real-time processing, fraud detection, credit analysis, and pattern identification in analyzing large sets of data, including real-time transaction details.', 'duration': 26.101, 'max_score': 2063.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms2063241.jpg'}, {'end': 2153.67, 'src': 'embed', 'start': 2113.314, 'weight': 11, 'content': [{'end': 2119.879, 'text': 'So there is a lot of work happening in the healthcare industry where real-time processing is finding a lot of importance.', 'start': 2113.314, 'duration': 6.565}, {'end': 2124.383, 'text': 'real time and faster processing is what is required.', 'start': 2120.76, 'duration': 3.623}, {'end': 2128.887, 'text': 'so healthcare industry and iqvi is also using spark.', 'start': 2124.383, 'duration': 4.504}, {'end': 2132.75, 'text': 'you have netflix, which is known, and you have riot games.', 'start': 2128.887, 'duration': 3.863}, {'end': 2143.979, 'text': 'so entertainment and gaming companies like netflix and riot games use apache spark to showcase relevant advertisements to their users based on the videos that they have watched,', 'start': 2132.75, 'duration': 11.229}, {'end': 2145.8, 'text': 'shared or liked.', 'start': 2143.979, 'duration': 1.821}, {'end': 2153.67, 'text': 'so these are few domains which find use cases of spark, that is, banking, e-commerce, healthcare, entertainment,', 'start': 2145.8, 'duration': 7.87}], 'summary': 'Healthcare, entertainment, banking, and e-commerce industries use real-time processing with apache spark.', 'duration': 40.356, 'max_score': 2113.314, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms2113314.jpg'}, {'end': 2225.118, 'src': 'embed', 'start': 2201.528, 'weight': 13, 'content': [{'end': 2210.235, 'text': 'Conviva collects data about video streaming quality to give their customers visibility into the end user experience they are delivering.', 'start': 2201.528, 'duration': 8.707}, {'end': 2213.776, 'text': 'Now, how do they do it? Apache Spark again.', 'start': 2210.635, 'duration': 3.141}, {'end': 2215.736, 'text': 'Using Apache Spark.', 'start': 2214.256, 'duration': 1.48}, {'end': 2225.118, 'text': 'Conviva delivers a better quality of service to its customers by removing the screen buffering and learning in detail about network conditions in real time.', 'start': 2215.736, 'duration': 9.382}], 'summary': 'Conviva uses apache spark to improve video streaming quality and real-time network condition monitoring.', 'duration': 23.59, 'max_score': 2201.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms2201528.jpg'}], 'start': 1607.063, 'title': 'Apache spark architecture and usage', 'summary': 'Delves into the architecture of apache spark, its integration with yarn, and its applications in various industries, highlighting its compatibility with different clustering technologies, resource allocation process, and real-time processing capabilities. examples from banking, e-commerce, healthcare, and entertainment industries are provided to showcase its impact on fraud detection, credit analysis, user recommendations, and video streaming quality improvement.', 'chapters': [{'end': 1799.27, 'start': 1607.063, 'title': 'Spark architecture and processing', 'summary': 'Discusses the power of using graphics for etl, data analysis, and interactive graph computation with spark, including its compatibility with different clustering technologies and its architecture of driver program, master slave, and cluster manager.', 'duration': 192.207, 'highlights': ["Spark can work with different clustering technologies like Apache Mesos and YARN. Spark's compatibility with various clustering technologies, such as Apache Mesos and YARN, provides flexibility and scalability for processing big data.", 'Spark can work as standalone, having its own setup with master and worker processes. The standalone capability of Spark allows it to operate independently without Hadoop, providing its own setup with master and worker processes for efficient data processing.', 'Spark architecture consists of a driver program that interacts with the cluster manager, which can be YARN or Spark standalone master. The architecture of Spark involves a driver program that interacts with the cluster manager, enabling communication with YARN or Spark standalone master for effective application execution.', "Spark context serves as the entry point for any Spark functionality and interacts with the cluster manager, such as YARN's resource manager. The Spark context functions as the entry point for Spark functionality and communicates with the cluster manager, such as YARN's resource manager, to facilitate application execution.", 'Worker nodes run different tasks and distribute RDD created by Spark context across multiple nodes. The worker nodes act as slaves, running various tasks and distributing the Resilient Distributed Datasets (RDD) created by Spark context across multiple nodes for efficient data processing.']}, {'end': 2049.775, 'start': 1799.85, 'title': 'Spark architecture and integration with yarn', 'summary': "Introduces the architecture of spark and its interaction with yarn, explaining the process of resource allocation, execution, and the various cluster managers that spark can work with, including apache mesos, hadoop's yarn, and kubernetes.", 'duration': 249.925, 'highlights': ['Spark interacts with YARN for resource allocation and execution, with the resource manager making requests to node managers for containers based on RAM and CPU core requirements. The resource manager negotiates with node managers for containers, specifying RAM and CPU core requirements, and the node manager approves or denies based on processing needs.', 'The interaction involves the allocation of containers, execution of app master, and usage of executor processes within containers to run tasks, illustrating the integration of Spark with YARN. The process involves the allocation of containers, execution of app master, and usage of executor processes within containers to run tasks, showcasing the integration of Spark with YARN.', "Spark can work with various cluster managers, including standalone mode, Apache Mesos, Hadoop's YARN, and Kubernetes, showcasing its flexibility and adaptability. Spark's compatibility with standalone mode, Apache Mesos, Hadoop's YARN, and Kubernetes highlights its flexibility and adaptability to different cluster management systems.", 'The application of Spark is exemplified through its usage by JPMorgan Chase and company for detecting fraudulent transactions, showcasing its real-world utility. The real-world application of Spark is demonstrated through its usage by JPMorgan Chase and company for detecting fraudulent transactions, highlighting its practical utility.']}, {'end': 2297.302, 'start': 2049.775, 'title': 'Apache spark in different industries', 'summary': 'Discusses the diverse uses of apache spark in industries such as banking, e-commerce, healthcare, entertainment, and video streaming, with examples including alibaba, netflix, iqvia, and conviva, showcasing real-time processing capabilities and its impact on fraud detection, credit analysis, pattern identification, user recommendations, and video streaming quality improvement.', 'duration': 247.527, 'highlights': ['The use of Apache Spark in industries such as banking, e-commerce, healthcare, entertainment, and video streaming has diverse applications, including real-time processing capabilities for fraud detection, credit analysis, pattern identification, and user recommendations, with examples from Alibaba, Netflix, IQVIA, and Conviva. (Relevance: 5)', 'Alibaba Group utilizes Spark for analyzing real-time transaction details and browsing history to provide recommendations, demonstrating the relevance of Spark in e-commerce. (Relevance: 4)', "IQVIA, a leading healthcare company, leverages Spark to analyze patients' data, identify possible health issues, and diagnose based on medical history, showcasing the importance of real-time processing in the healthcare industry. (Relevance: 3)", 'Netflix and Riot Games employ Apache Spark to deliver relevant advertisements to users based on their video interactions, highlighting the significance of Spark in entertainment and gaming industries. (Relevance: 2)', 'Conviva, a prominent video streaming company, uses Apache Spark to enhance video streaming quality by collecting and analyzing data to ensure a better user experience, showcasing the impact of Spark on improving video streaming services. (Relevance: 1)']}], 'duration': 690.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/znBa13Earms/pics/znBa13Earms1607063.jpg', 'highlights': ["Spark's compatibility with various clustering technologies, such as Apache Mesos and YARN, provides flexibility and scalability for processing big data.", 'The standalone capability of Spark allows it to operate independently without Hadoop, providing its own setup with master and worker processes for efficient data processing.', 'The architecture of Spark involves a driver program that interacts with the cluster manager, enabling communication with YARN or Spark standalone master for effective application execution.', "The Spark context functions as the entry point for Spark functionality and communicates with the cluster manager, such as YARN's resource manager, to facilitate application execution.", 'The worker nodes act as slaves, running various tasks and distributing the Resilient Distributed Datasets (RDD) created by Spark context across multiple nodes for efficient data processing.', 'The resource manager negotiates with node managers for containers, specifying RAM and CPU core requirements, and the node manager approves or denies based on processing needs.', 'The process involves the allocation of containers, execution of app master, and usage of executor processes within containers to run tasks, showcasing the integration of Spark with YARN.', "Spark's compatibility with standalone mode, Apache Mesos, Hadoop's YARN, and Kubernetes highlights its flexibility and adaptability to different cluster management systems.", 'The real-world application of Spark is demonstrated through its usage by JPMorgan Chase and company for detecting fraudulent transactions, highlighting its practical utility.', 'The application of Spark is exemplified through its usage by JPMorgan Chase and company for detecting fraudulent transactions, showcasing its real-world utility.', 'Alibaba Group utilizes Spark for analyzing real-time transaction details and browsing history to provide recommendations, demonstrating the relevance of Spark in e-commerce.', "IQVIA, a leading healthcare company, leverages Spark to analyze patients' data, identify possible health issues, and diagnose based on medical history, showcasing the importance of real-time processing in the healthcare industry.", 'Netflix and Riot Games employ Apache Spark to deliver relevant advertisements to users based on their video interactions, highlighting the significance of Spark in entertainment and gaming industries.', 'Conviva, a prominent video streaming company, uses Apache Spark to enhance video streaming quality by collecting and analyzing data to ensure a better user experience, showcasing the impact of Spark on improving video streaming services.']}], 'highlights': ['Apache Spark is an open source in-memory computing framework, used to process data in batch and real time across various cluster computers.', 'Spark can process the same data 100 times faster than MapReduce, making it a preferred choice for efficient data processing.', 'In 2014, Apache Spark, used by Databricks, set a new world record in sorting large-scale datasets.', "Spark's implementation in Scala results in concise code with fewer lines, offering the benefits of both functional programming and object-oriented language, making it a preferred choice for efficient coding and execution.", 'Apache Spark is highlighted as one of the most in-demand processing frameworks in the big data world.', 'Resilient Distributed Datasets (RDDs) in Spark can be 10 to 100 times faster than Hadoop.', 'Spark ensures fault tolerance through distributed RDDs, preventing data loss and allowing recomputation in case of failure.', 'Spark offers a rich set of SQL queries, machine learning algorithms, and various components for structured data processing and machine learning.', 'In-memory computing involves lazy evaluation and data stored in RAM, utilizing RAM for processing and storage.', 'Spark SQL is a component type processing framework used for structured and semi-structured data processing, allowing for in-memory processing of structured data stored in RDBMS or files with specific delimiters and patterns.', 'Data frames in Spark SQL can be visualized as rows and columns, and the DataFrame API facilitates the creation of DataFrames, enabling efficient processing of structured data.', 'Spark supports various data formats such as CSV, JSON, Avro, Parquet, and binary files, and provides the capability to extract data from RDBMS using JDBC connection for diverse data processing needs.', 'Actions in Spark invoke the execution from the first RDD till the last RDD to obtain the result, providing a clear understanding of the execution flow in Spark.', 'Transformations in Spark always create an RDD or a step in the DAG, using the method of Spark context, forming the fundamental building blocks for data processing in Spark.', "DataFrames can be created from structured data, indicating the data format and allowing Spark to understand the data's structure. This feature of DataFrames enables Spark to efficiently handle structured data, optimizing the backend DAG scheduler for better performance and resource management.", "Usage of Spark context or SQL context for working with DataFrames, with examples from earlier versions like using Hive context or SQL context in Spark 1.6. Understanding the historical context of using different context types for DataFrames provides insights into the evolution of Spark's data processing capabilities.", "Relevance of DataFrames in working with streaming data, particularly in the context of organizations like Macy's implementing machine learning algorithms. Highlighting the practical implications of using DataFrames for real-time data processing and advanced analytics, demonstrating its significance in modern data-driven organizations.", 'Cameras in stores enable real-time algorithm processing for product recommendations and price adjustments.', 'Real-time processing through machine learning algorithms on online portals analyzes customer behavior for product recommendations and price adjustments.', 'Spark Streaming provides secure, reliable, and fast processing of live data streams for real-time data processing.', 'MLlib simplifies the deployment and development of scalable machine learning algorithms for various data science techniques.', 'GraphX serves as a graph computation engine, ideal for processing network connections and relationships in graph-based scenarios.', "Spark's compatibility with various clustering technologies, such as Apache Mesos and YARN, provides flexibility and scalability for processing big data.", 'The standalone capability of Spark allows it to operate independently without Hadoop, providing its own setup with master and worker processes for efficient data processing.', 'The architecture of Spark involves a driver program that interacts with the cluster manager, enabling communication with YARN or Spark standalone master for effective application execution.', "The Spark context functions as the entry point for Spark functionality and communicates with the cluster manager, such as YARN's resource manager, to facilitate application execution.", 'The worker nodes act as slaves, running various tasks and distributing the Resilient Distributed Datasets (RDD) created by Spark context across multiple nodes for efficient data processing.', 'The resource manager negotiates with node managers for containers, specifying RAM and CPU core requirements, and the node manager approves or denies based on processing needs.', 'The process involves the allocation of containers, execution of app master, and usage of executor processes within containers to run tasks, showcasing the integration of Spark with YARN.', "Spark's compatibility with standalone mode, Apache Mesos, Hadoop's YARN, and Kubernetes highlights its flexibility and adaptability to different cluster management systems.", 'The real-world application of Spark is demonstrated through its usage by JPMorgan Chase and company for detecting fraudulent transactions, highlighting its practical utility.', 'Alibaba Group utilizes Spark for analyzing real-time transaction details and browsing history to provide recommendations, demonstrating the relevance of Spark in e-commerce.', "IQVIA, a leading healthcare company, leverages Spark to analyze patients' data, identify possible health issues, and diagnose based on medical history, showcasing the importance of real-time processing in the healthcare industry.", 'Netflix and Riot Games employ Apache Spark to deliver relevant advertisements to users based on their video interactions, highlighting the significance of Spark in entertainment and gaming industries.', 'Conviva, a prominent video streaming company, uses Apache Spark to enhance video streaming quality by collecting and analyzing data to ensure a better user experience, showcasing the impact of Spark on improving video streaming services.']}