title
Apache Spark Full Course - Learn Apache Spark in 8 Hours | Apache Spark Tutorial | Edureka

description
** Edureka Apache Spark Training (Use Code: YOUTUBE20) - https://www.edureka.co/apache-spark-scala-certification-training ) This Edureka Spark Full Course video will help you understand and learn Apache Spark in detail. This Spark tutorial is ideal for both beginners as well as professionals who want to master Apache Spark concepts. Below are the topics covered in this Spark tutorial for beginners: 00:00 Agenda 2:44 Introduction to Apache Spark 3:49 What is Spark? 5:34 Spark Eco-System 7:44 Why RDD? 16:44 RDD Operations 18:59 Yahoo Use-Case 21:09 Apache Spark Architecture 24:24 RDD 26:59 Spark Architecture 31:09 Demo 39:54 Spark RDD 41:09 Spark Applications 41:59 Need For RDDs 43:34 What are RDDs? 44:24 Sources of RDDs 45:04 Features of RDDs 46:39 Creation of RDDs 50:19 Operations Performed On RDDs 50:49 Narrow Transformations 51:04 Wide Transformations 51:29 Actions 51:44 RDDs Using Spark Pokemon Use-Case 1:05:19 Spark DataFrame 1:06:54 What is a DataFrame? 1:08:24 Why Do We Need Dataframes? 1:09:54 Features of DataFrames 1:11:09 Sources Of DataFrames 1:11:34 Creation Of DataFrame 1:24:44 Spark SQL 1:25:14 Why Spark SQL? 1:27:09 Spark SQL Advantages Over Hive 1:31:54 Spark SQL Success Story 1:33:24 Spark SQL Features 1:37:15 Spark SQL Architecture 1:39:40 Spark SQL Libraries 1:42:15 Querying Using Spark SQL 1:45:50 Adding Schema To RDDs 1:55:05 Hive Tables 1:57:50 Use Case: Stock Market Analysis with Spark SQL 2:16:50 Spark Streaming 2:18:10 What is Streaming? 2:25:46 Spark Streaming Overview 2:27:56 Spark Streaming workflow 2:31:21 Streaming Fundamentals 2:33:36 DStream 2:38:56 Input DStreams 2:40:11 Transformations on DStreams 2:43:06 DStreams Window 2:47:11 Caching/Persistence 2:48:11 Accumulators 2:49:06 Broadcast Variables 2:49:56 Checkpoints 2:51:11 Use-Case Twitter Sentiment Analysis 3:00:26 Spark MLlib 3:00:31 MLlib Techniques 3:01:46 Demo 3:11:51 Use Case: Earthquake Detection Using Spark 3:24:01 Visualizing Result 3:25:11 Spark GraphX 3:26:01 Basics of Graph 3:27:56 Types of Graph 3:38:56 GraphX 3:40:42 Property Graph 3:48:37 Creating & Transforming Property Graph 3:56:17 Graph Builder 4:02:22 Vertex RDD 4:07:07 Edge RDD 4:11:37 Graph Operators 4:24:37 GraphX Demo 4:34:24 Graph Algorithms 4:34:40 PageRank 4:38:29 Connected Components 4:40:39 Triangle Counting 4:44:09 Spark GraphX Demo 4;57:54 MapReduce vs Spark 5:13:03 Kafka with Spark Streaming 5:23:38 Messaging System 5:21:15 Kafka Components 2:23:45 Kafka Cluster 5:24:15 Demo 5:48:56 Kafka Spark Streaming Demo 6:17:16 PySpark Tutorial 6:21:26 PySpark Installation 6:47:06 Spark Interview Questions -------------------------------------------------------------------------------------------------------- PG in Big Data Engineering with NIT Rourkela : https://www.edureka.co/post-graduate/big-data-engineering (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies) Instagram: https://www.instagram.com/edureka_learning Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka ---------------------------------------------------------------------------------------------------------- Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For more information, please write back to us at sales@edureka.in or call us at IND: 9606058406 / US: 18338555775 (toll-free).

detail
{'title': 'Apache Spark Full Course - Learn Apache Spark in 8 Hours | Apache Spark Tutorial | Edureka', 'heatmap': [{'end': 1126.443, 'start': 841.138, 'weight': 0.73}, {'end': 1968.803, 'start': 1398.527, 'weight': 0.854}, {'end': 3376.437, 'start': 2806.003, 'weight': 0.858}, {'end': 4502.801, 'start': 4215.311, 'weight': 0.705}], 'summary': 'This full course on apache spark covers its dominance in big data and ai, 100 times faster than mapreduce, comprehensive crash course with 12 modules, real-time tweet sentiment analysis, real-time earthquake detection saving lives, and pyspark mllib overview with heart disease prediction test errors of 0.2297 and 0.168.', 'chapters': [{'end': 38.581, 'segs': [{'end': 30.738, 'src': 'embed', 'start': 7.005, 'weight': 0, 'content': [{'end': 14.029, 'text': 'For the past five years, Spark has been on an absolute tear, becoming one of the most widely used technologies in big data and AI.', 'start': 7.005, 'duration': 7.024}, {'end': 21.673, 'text': "Today's cutting-edge companies like Facebook, Apple, Netflix, Uber and many more have deployed Spark at massive scale,", 'start': 14.569, 'duration': 7.104}, {'end': 30.738, 'text': 'processing petabytes of data to deliver innovations ranging from detecting fraudulent behavior to delivering personalized experiences in real-life time,', 'start': 21.673, 'duration': 9.065}], 'summary': 'Spark has become a widely used technology in big data and ai, with companies like facebook, apple, netflix, and uber processing petabytes of data at massive scale.', 'duration': 23.733, 'max_score': 7.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g7005.jpg'}], 'start': 7.005, 'title': "Spark's dominance in big data and ai", 'summary': "Highlights spark's widespread adoption by leading companies like facebook, apple, netflix, and uber for processing petabytes of data, delivering innovations such as fraud detection and real-time personalized experiences.", 'chapters': [{'end': 38.581, 'start': 7.005, 'title': "Spark's dominance in big data and ai", 'summary': "Highlights spark's widespread adoption by leading companies like facebook, apple, netflix, and uber for processing petabytes of data, delivering innovations such as fraud detection and real-time personalized experiences.", 'duration': 31.576, 'highlights': ['Leading companies like Facebook, Apple, Netflix, and Uber have deployed Spark at massive scale for processing petabytes of data.', 'Spark has been widely used in big data and AI technologies for the past five years, indicating its dominance in the industry.', 'Spark enables innovations such as detecting fraudulent behavior and delivering personalized experiences in real time, transforming multiple industries.']}], 'duration': 31.576, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g7005.jpg', 'highlights': ['Leading companies like Facebook, Apple, Netflix, and Uber have deployed Spark at massive scale for processing petabytes of data.', 'Spark enables innovations such as detecting fraudulent behavior and delivering personalized experiences in real time, transforming multiple industries.', 'Spark has been widely used in big data and AI technologies for the past five years, indicating its dominance in the industry.']}, {'end': 1358.216, 'segs': [{'end': 615.49, 'src': 'embed', 'start': 591.756, 'weight': 4, 'content': [{'end': 598.861, 'text': 'That is they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes.', 'start': 591.756, 'duration': 7.105}, {'end': 603.383, 'text': 'Thus even if one executor fails another will still process the data.', 'start': 599.401, 'duration': 3.982}, {'end': 610.107, 'text': 'This allows you to perform functional calculations against your data set very quickly by harnessing the power of multiple nodes.', 'start': 603.804, 'duration': 6.303}, {'end': 612.108, 'text': 'So this is all about RDD.', 'start': 610.687, 'duration': 1.421}, {'end': 615.49, 'text': "Now, let's have a look at some of the important features of RDDs.", 'start': 612.428, 'duration': 3.062}], 'summary': 'Rdds replicate data across nodes for quick recovery and processing, enabling fast functional calculations.', 'duration': 23.734, 'max_score': 591.756, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g591756.jpg'}, {'end': 1126.443, 'src': 'heatmap', 'start': 841.138, 'weight': 0.73, 'content': [{'end': 844.78, 'text': 'Now, let me show you how to create an RDD from the existing RDD.', 'start': 841.138, 'duration': 3.642}, {'end': 846.455, 'text': 'Okay Here.', 'start': 845.18, 'duration': 1.275}, {'end': 853.754, 'text': "I'll create an array called even and assign numbers 1 to 10 1 2 3 4 5 6 7.", 'start': 846.495, 'duration': 7.259}, {'end': 864.709, 'text': 'Okay, so I got the result here that is I have created an integer array of 1 to 10 and now I will parallelize this array one.', 'start': 853.761, 'duration': 10.948}, {'end': 872.706, 'text': 'Sorry, I got an error.', 'start': 871.565, 'duration': 1.141}, {'end': 877.108, 'text': 'It is SC dot parallelize of a1.', 'start': 873.326, 'duration': 3.782}, {'end': 882.671, 'text': 'Okay, so I created an RDD called parallel collection cool.', 'start': 878.289, 'duration': 4.382}, {'end': 892.737, 'text': 'Now, I will create a new RDD from the existing RDD that is val new RDD is equal to a1 dot map.', 'start': 883.011, 'duration': 9.726}, {'end': 895.75, 'text': 'Data present in an RDD.', 'start': 894.15, 'duration': 1.6}, {'end': 898.871, 'text': 'I will create a new RDD from existing RDD.', 'start': 896.271, 'duration': 2.6}, {'end': 907.774, 'text': 'So here I will take even as a reference and map the data and multiply that data into 2.', 'start': 899.432, 'duration': 8.342}, {'end': 918.377, 'text': 'So what should be your output if I map the data present in an RDD into 2 so it would be like 2 4 6 8 up to 20 correct.', 'start': 907.774, 'duration': 10.603}, {'end': 919.837, 'text': "So, let's see how it works.", 'start': 918.797, 'duration': 1.04}, {'end': 922.258, 'text': 'Yes, we got the output.', 'start': 921.198, 'duration': 1.06}, {'end': 927.17, 'text': 'that is multiple of 1 to 10 that is 2 4 6 8 up to 20.', 'start': 922.808, 'duration': 4.362}, {'end': 933.793, 'text': 'So this is one of the method of creating a new RDD from an old RDD and I have one more method that is from external file sources.', 'start': 927.17, 'duration': 6.623}, {'end': 939.615, 'text': 'So what I will do here is I will give var test is equal to sc.txt file.', 'start': 934.273, 'duration': 5.342}, {'end': 945.417, 'text': 'Here I will give the path to HDFS file location and link the path.', 'start': 940.956, 'duration': 4.461}, {'end': 947.258, 'text': 'that is HDFS.', 'start': 945.417, 'duration': 1.841}, {'end': 954.132, 'text': 'localhost 9000 is the path and I have a folder called example and in that I have a file called sample.', 'start': 947.258, 'duration': 6.874}, {'end': 961.313, 'text': 'Cool So I got one more RDD created here.', 'start': 957.312, 'duration': 4.001}, {'end': 966.534, 'text': 'Now, let me show you this file that I have already kept in HDFS directory.', 'start': 962.233, 'duration': 4.301}, {'end': 973.175, 'text': 'I will browse the file system and I will show you the slash example directory that I have created.', 'start': 968.254, 'duration': 4.921}, {'end': 978.676, 'text': 'So here you can see the example that I have created as a directory.', 'start': 974.995, 'duration': 3.681}, {'end': 982.506, 'text': 'and here I have sample as the input file that I have been given.', 'start': 979.384, 'duration': 3.122}, {'end': 985.388, 'text': 'So here you can see the same path location.', 'start': 983.086, 'duration': 2.302}, {'end': 990.13, 'text': 'So this is how I can create an RDD from external file sources in this case.', 'start': 985.828, 'duration': 4.302}, {'end': 992.832, 'text': 'I have used HDFS as an external file source.', 'start': 990.33, 'duration': 2.502}, {'end': 1000.977, 'text': 'So this is how we can create RDDs from three different ways that is paralyzed collections from external RDDs and from an existing RDDs.', 'start': 993.452, 'duration': 7.525}, {'end': 1004.819, 'text': "So let's move further and see the various RDD operations.", 'start': 1001.897, 'duration': 2.922}, {'end': 1012.757, 'text': "RDD supports two main operations namely transformations and actions as I've already said RDDs are immutable.", 'start': 1005.792, 'duration': 6.965}, {'end': 1017.3, 'text': 'So once you create an RDD, you cannot change any content in the RDD.', 'start': 1013.157, 'duration': 4.143}, {'end': 1021.603, 'text': 'So you might be wondering how RDD applies those transformations correct?', 'start': 1017.7, 'duration': 3.903}, {'end': 1028.079, 'text': 'When you run any transformations, it runs those transformations on old RDD and create a new RDD.', 'start': 1022.356, 'duration': 5.723}, {'end': 1036.305, 'text': 'This is basically done for optimization reasons transformations are the operations which are applied on an RDD to create a new RDD.', 'start': 1028.48, 'duration': 7.825}, {'end': 1040.907, 'text': 'Now these transformations work on the principle of lazy evaluations.', 'start': 1036.845, 'duration': 4.062}, {'end': 1042.348, 'text': 'So what does it mean??', 'start': 1041.428, 'duration': 0.92}, {'end': 1045.69, 'text': 'It means that when we call some operation in RDD,', 'start': 1042.789, 'duration': 2.901}, {'end': 1051.234, 'text': 'it does not execute immediately and Spark maintains the record of the operation that is being called.', 'start': 1045.69, 'duration': 5.544}, {'end': 1053.858, 'text': 'since transformations are lazy in nature.', 'start': 1051.937, 'duration': 1.921}, {'end': 1058.179, 'text': 'So we can execute the operation anytime by calling an action on the data.', 'start': 1054.198, 'duration': 3.981}, {'end': 1063.28, 'text': 'Hence in lazy evaluation data is not loaded until it is necessary.', 'start': 1058.759, 'duration': 4.521}, {'end': 1072.243, 'text': 'Now these actions analyze the RDD and produce result simple action can be count which will count the rows and RDD and then produce a result.', 'start': 1063.74, 'duration': 8.503}, {'end': 1079.85, 'text': 'So I can say that transformation produce new RDD and action produce results before moving further with the discussion.', 'start': 1072.865, 'duration': 6.985}, {'end': 1088.875, 'text': 'Let me tell you about the three different workloads that sparkators they are batch mode interactive mode and streaming mode in case of batch mode.', 'start': 1080.21, 'duration': 8.665}, {'end': 1089.916, 'text': 'We run a bad job.', 'start': 1089.036, 'duration': 0.88}, {'end': 1091.877, 'text': 'You write a job and then schedule it.', 'start': 1090.116, 'duration': 1.761}, {'end': 1096.26, 'text': 'It works through a queue or a batch of separate jobs without manual intervention.', 'start': 1092.137, 'duration': 4.123}, {'end': 1101.964, 'text': 'Then, in case of interactive mode, it is an interactive shell where you go and execute the commands one by one.', 'start': 1096.7, 'duration': 5.264}, {'end': 1108.749, 'text': 'So you will execute one command, check the result and then execute other command based on the output result and so on.', 'start': 1102.504, 'duration': 6.245}, {'end': 1110.471, 'text': 'it works similar to the SQL shell.', 'start': 1108.749, 'duration': 1.722}, {'end': 1116.595, 'text': 'So shell is the one which executes the driver program and in the shell mode, you can run it on the cluster mode.', 'start': 1110.851, 'duration': 5.744}, {'end': 1121.339, 'text': 'It is generally used for development work or it is used for ad hoc queries.', 'start': 1117.036, 'duration': 4.303}, {'end': 1126.443, 'text': 'then comes the streaming mode, where the program is continuously running, as in when the data comes.', 'start': 1121.339, 'duration': 5.104}], 'summary': 'Demonstrated creating rdd from array and external file, discussed rdd operations and workloads in spark.', 'duration': 285.305, 'max_score': 841.138, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g841138.jpg'}, {'end': 1012.757, 'src': 'embed', 'start': 979.384, 'weight': 2, 'content': [{'end': 982.506, 'text': 'and here I have sample as the input file that I have been given.', 'start': 979.384, 'duration': 3.122}, {'end': 985.388, 'text': 'So here you can see the same path location.', 'start': 983.086, 'duration': 2.302}, {'end': 990.13, 'text': 'So this is how I can create an RDD from external file sources in this case.', 'start': 985.828, 'duration': 4.302}, {'end': 992.832, 'text': 'I have used HDFS as an external file source.', 'start': 990.33, 'duration': 2.502}, {'end': 1000.977, 'text': 'So this is how we can create RDDs from three different ways that is paralyzed collections from external RDDs and from an existing RDDs.', 'start': 993.452, 'duration': 7.525}, {'end': 1004.819, 'text': "So let's move further and see the various RDD operations.", 'start': 1001.897, 'duration': 2.922}, {'end': 1012.757, 'text': "RDD supports two main operations namely transformations and actions as I've already said RDDs are immutable.", 'start': 1005.792, 'duration': 6.965}], 'summary': 'Creating rdds from external file sources and exploring rdd operations in spark.', 'duration': 33.373, 'max_score': 979.384, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g979384.jpg'}, {'end': 1088.875, 'src': 'embed', 'start': 1063.74, 'weight': 1, 'content': [{'end': 1072.243, 'text': 'Now these actions analyze the RDD and produce result simple action can be count which will count the rows and RDD and then produce a result.', 'start': 1063.74, 'duration': 8.503}, {'end': 1079.85, 'text': 'So I can say that transformation produce new RDD and action produce results before moving further with the discussion.', 'start': 1072.865, 'duration': 6.985}, {'end': 1088.875, 'text': 'Let me tell you about the three different workloads that sparkators they are batch mode interactive mode and streaming mode in case of batch mode.', 'start': 1080.21, 'duration': 8.665}], 'summary': 'Actions like count can analyze rdd and produce results, with three different workloads: batch, interactive, and streaming mode.', 'duration': 25.135, 'max_score': 1063.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g1063740.jpg'}, {'end': 1261.905, 'src': 'embed', 'start': 1233.851, 'weight': 0, 'content': [{'end': 1240.735, 'text': 'and also for categorizing the news stories to find out what kind of users would be interested in reading each category of news.', 'start': 1233.851, 'duration': 6.884}, {'end': 1246.016, 'text': 'and spark runs over Hadoop yarn to use existing data and clusters,', 'start': 1241.534, 'duration': 4.482}, {'end': 1252.36, 'text': 'and the extensive API of spark and machine learning library is the development of machine learning algorithms,', 'start': 1246.016, 'duration': 6.344}, {'end': 1256.642, 'text': 'and spark reduces the latency of model training via in-memory RDD.', 'start': 1252.36, 'duration': 4.282}, {'end': 1261.905, 'text': 'So this is how spark has helped Yahoo to improve the performance and achieve the targets.', 'start': 1257.062, 'duration': 4.843}], 'summary': 'Spark improves performance and achieves targets by reducing model training latency and using existing data and clusters.', 'duration': 28.054, 'max_score': 1233.851, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g1233851.jpg'}], 'start': 38.581, 'title': 'Apache spark and its ecosystem', 'summary': 'Covers a comprehensive crash course on apache spark with 12 modules, highlights spark as 100 times faster than mapreduce, features of spark like in-memory execution and high-level apis, introduces spark ecosystem, rdds, their features, creating and using rdds, spark workloads, and a yahoo use case handling over 150 petabytes of data with significant reduction in code size and training time.', 'chapters': [{'end': 121.137, 'start': 38.581, 'title': 'Apache spark crash course', 'summary': 'Covers a comprehensive crash course on apache spark, divided into 12 modules, including an introduction to spark, its components, data frames, sql queries, streaming, machine learning algorithms, and graph processing.', 'duration': 82.556, 'highlights': ['The crash course is divided into 12 modules, covering the introduction to Spark, its components, data frames, SQL queries, streaming, machine learning algorithms, and graph processing.', 'Module seven focuses on executing different machine learning algorithms using Spark machine learning library.', 'Module six covers performing streaming on live data streams using Spark.', 'Module five discusses different ways that Spark provides to perform SQL queries for accessing and processing data.', 'Module four is all about data frames and how to perform different operations in data frames.']}, {'end': 429.872, 'start': 121.957, 'title': 'Apache spark overview', 'summary': "Covers the key differences between mapreduce and spark, integration of spark and kafka, pyspark's exposure of spark programming model to python, and frequently asked interview questions on spark. apache spark is highlighted as being 100 times faster than mapreduce and a go-to tool for big data solutions, with features like in-memory execution, high-level apis in multiple languages, and components such as spark sql, spark streaming, machine learning library, and graphics component.", 'duration': 307.915, 'highlights': ['Apache Spark is highlighted as being 100 times faster than MapReduce, a go-to tool for big data solutions, and having features like in-memory execution and high-level APIs in multiple languages.', "The chapter also covers the integration of Spark and Kafka, as well as PySpark's exposure of the Spark programming model to Python.", 'Lastly, the module includes frequently asked interview questions on Spark, which will help in interview preparations.', "Apache Spark's components such as Spark SQL, Spark Streaming, machine learning library, and graphics component are explained in detail, showcasing its power in leveraging declarative queries, performing batch processing and streaming data, and enabling scalable machine learning pipelines."]}, {'end': 675.577, 'start': 430.312, 'title': 'Spark ecosystem and rdd features', 'summary': 'Introduces the various components of the spark ecosystem, emphasizing the support for different programming languages and data storage options. it also explains the concept of rdds and their features, highlighting their role as the backbone of spark and their provisions for fault tolerance, in-memory computation, lazy evaluation, immutability, partitioning, and persistence.', 'duration': 245.265, 'highlights': ['The chapter explains the support for both Python and Scala in the Spark ecosystem, with a focus on the increasing popularity of Python for data analysis and machine learning, along with the added support for Java. It also highlights the various data storage options such as HDFS, local file system, Amazon S3 cloud, and support for SQL and NoSQL databases.', 'It discusses the need for iterative distributed computing and the challenges faced in earlier frameworks like Hadoop, emphasizing the goal of reducing the number of IO operations through in-memory data sharing, which is 10 to 100 times faster than network and disk sharing.', 'The concept of RDDs (resilient distributed data sets) is introduced, highlighting their role as the backbone of Spark and their ability to handle both structured and unstructured data. The process of reading, transforming, and performing actions on RDDs is explained, emphasizing their immutable nature, logical partitioning, fault tolerance through data lineage tracking, and provisions for in-memory computation, lazy evaluation, and persistence.']}, {'end': 1079.85, 'start': 675.957, 'title': 'Creating and using rdds in spark', 'summary': 'Covers three ways to create rdds - from paralyzed collections, existing rdds, and external data sources like hdfs, with a detailed demonstration of each method and insights into rdd operations, transformations, and actions.', 'duration': 403.893, 'highlights': ['The chapter covers three ways to create RDDs - from paralyzed collections, existing RDDs, and external data sources like HDFS, with a detailed demonstration of each method and insights into RDD operations, transformations, and actions.', 'Creating RDD from paralyzed collections involves using methods like sc.parallelize to parallelize a collection of RDDs, with the demonstration showing the creation of a paralyzed collection from 1 to 100 numbers in five partitions and displaying the stages and visualizations in the web UI of Spark, providing insights into the task division and execution metrics.', 'Creating an RDD from an existing RDD is demonstrated by creating a new RDD from the existing RDD, showcasing the process of mapping the data and multiplying it by 2, providing a clear demonstration of how to create a new RDD from an old one.', 'Creating an RDD from external file sources, such as HDFS, is demonstrated by using methods like sc.textFile to link the path to the HDFS file location and displaying the process of creating an RDD from an external file source, providing a practical example of utilizing HDFS as an external file source.', 'Insights into RDD operations, transformations, and actions are provided, explaining how RDDs support two main operations - transformations and actions, with a detailed explanation of how transformations work on the principle of lazy evaluations and how actions analyze the RDD and produce results.']}, {'end': 1358.216, 'start': 1080.21, 'title': 'Spark workloads and yahoo use case', 'summary': 'Explains the three workloads in apache spark: batch mode, interactive mode, and streaming mode, as well as how yahoo overcame its challenges using spark, such as updating relevance models frequently and handling over 150 petabytes of data on a 35,000 node hadoop cluster, resulting in a 15,000 lines of c++ code machine learning algorithm being reduced to 120 lines of scala code and trained on a hundred million data sets in just 30 minutes.', 'duration': 278.006, 'highlights': ['The three different workloads in Apache Spark are batch mode, interactive mode, and streaming mode, with Yahoo utilizing Spark to handle over 150 petabytes of data on a 35,000 node Hadoop cluster, resulting in a 15,000 lines of C++ code machine learning algorithm being reduced to 120 lines of Scala code and trained on a hundred million data sets in just 30 minutes.', "Yahoo's properties are highly personalized, needing frequent updates to relevance models, and the use of Spark to improve the performance of its iterative model training, reducing a 15,000 lines of C++ code machine learning algorithm to 120 lines of Scala code and trained on a hundred million data sets in just 30 minutes.", 'The architecture of Apache Spark is based on two main abstractions: Resilient Distributed Datasets (RDD) and a Directed Acyclic Graph (DAG), and the Spark ecosystem includes components like Spark SQL, Spark Streaming, machine learning library graphics, Spark AR, and the core API component, providing various functionalities such as leveraging declarative queries, performing batch processing and streaming, and easing the development and deployment of scalable machine learning pipelines.']}], 'duration': 1319.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g38581.jpg', 'highlights': ['Apache Spark is 100 times faster than MapReduce, with in-memory execution and high-level APIs', 'The crash course covers 12 modules, including Spark introduction, components, data frames, SQL queries, streaming, and machine learning algorithms', 'RDDs are the backbone of Spark, handling structured and unstructured data, with provisions for in-memory computation and fault tolerance', 'Yahoo utilized Spark to handle over 150 petabytes of data, reducing a 15,000 lines of C++ code to 120 lines of Scala code and trained on a hundred million data sets in just 30 minutes', 'Apache Spark supports Python and Scala, with increasing popularity of Python for data analysis and machine learning, and supports various data storage options']}, {'end': 3135.759, 'segs': [{'end': 1968.803, 'src': 'heatmap', 'start': 1398.527, 'weight': 0.854, 'content': [{'end': 1407.974, 'text': 'the entire spark ecosystem is built on the top of this code execution engine, which has extensible APIs in different languages like Scala, Python,', 'start': 1398.527, 'duration': 9.447}, {'end': 1408.755, 'text': 'are and Java.', 'start': 1407.974, 'duration': 0.781}, {'end': 1414.922, 'text': 'Now let me tell you about the programming languages at the first spark support Scala.', 'start': 1409.415, 'duration': 5.507}, {'end': 1418.446, 'text': 'Scala is a functional programming language in which spark is written.', 'start': 1414.922, 'duration': 3.524}, {'end': 1424.846, 'text': 'and spark support Scala as an interface then spark also supports python interface.', 'start': 1419.182, 'duration': 5.664}, {'end': 1429.41, 'text': 'You can write program in python and execute it over the spark again.', 'start': 1425.247, 'duration': 4.163}, {'end': 1438.197, 'text': 'If you see the code and Scala and python both are very similar then coming to our it is very famous for data analysis and machine learning.', 'start': 1429.69, 'duration': 8.507}, {'end': 1443.341, 'text': 'So spark has also added the support for our and it also supports Java.', 'start': 1438.777, 'duration': 4.564}, {'end': 1447.764, 'text': 'So you can go ahead and write the Java code and execute it over the spark.', 'start': 1443.721, 'duration': 4.043}, {'end': 1455.902, 'text': 'Again spark also provides you interactive shells for Scala Python and are very can go ahead and execute the commands one by one.', 'start': 1448.394, 'duration': 7.508}, {'end': 1459.265, 'text': 'So this is all about the spark ecosystem next.', 'start': 1456.322, 'duration': 2.943}, {'end': 1465.832, 'text': "Let's discuss the fundamental data structure of spark that is RDD called as resilient distributed data sets.", 'start': 1459.706, 'duration': 6.126}, {'end': 1469.866, 'text': 'So in spark anything you do is around RDD.', 'start': 1466.884, 'duration': 2.982}, {'end': 1476.171, 'text': "You're reading the data in spark, then it is read into RDD again when you're transforming the data.", 'start': 1470.427, 'duration': 5.744}, {'end': 1480.654, 'text': "then you're performing transformations on an old RDD and creating a new one.", 'start': 1476.171, 'duration': 4.483}, {'end': 1488.28, 'text': 'then at the last you will perform some actions on the data and store that data set present in an RDD to a persistent storage.', 'start': 1480.654, 'duration': 7.626}, {'end': 1492.803, 'text': 'resilient, distributed data set is an immutable distributed collection of objects.', 'start': 1488.28, 'duration': 4.523}, {'end': 1498.522, 'text': 'your objects can be anything like string lines, rows, objects, collections, Etc.', 'start': 1493.419, 'duration': 5.103}, {'end': 1502.704, 'text': 'Now talking about the distributed environment.', 'start': 1499.823, 'duration': 2.881}, {'end': 1509.688, 'text': 'each data set in RDD is divided into logical partitions which may be computed on different nodes of the cluster.', 'start': 1502.704, 'duration': 6.984}, {'end': 1517.052, 'text': "due to this, you can perform transformations and actions on the complete data parallelly and you don't have to worry about the distribution,", 'start': 1509.688, 'duration': 7.364}, {'end': 1518.733, 'text': 'because spark takes care of that.', 'start': 1517.052, 'duration': 1.681}, {'end': 1521.976, 'text': 'Next as I said RDDs are immutable.', 'start': 1519.333, 'duration': 2.643}, {'end': 1526.219, 'text': 'So once you create an RDD, you cannot change any content in the RDD.', 'start': 1522.416, 'duration': 3.803}, {'end': 1530.883, 'text': 'So you might be wondering how RDD applies those transformations correct?', 'start': 1526.7, 'duration': 4.183}, {'end': 1537.669, 'text': 'Then you run any transformations, a trans, those transformations on old RDD and create a new RDD.', 'start': 1531.764, 'duration': 5.905}, {'end': 1541.215, 'text': 'This is basically done for optimization reasons.', 'start': 1538.473, 'duration': 2.742}, {'end': 1546.377, 'text': 'So let me tell you one thing here RDD can be cached and persisted.', 'start': 1541.935, 'duration': 4.442}, {'end': 1552.781, 'text': 'if you want to save an RDD for the future work, you can cash it and it will improve the spark performance.', 'start': 1546.377, 'duration': 6.404}, {'end': 1559.164, 'text': 'RDD is a fault-tolerant collection of elements that can be operated on in parallel if RDD is lost.', 'start': 1552.781, 'duration': 6.383}, {'end': 1563.146, 'text': 'It will automatically be recomputed by using the original transformations.', 'start': 1559.344, 'duration': 3.802}, {'end': 1566.028, 'text': 'This is how spark provides fault tolerance.', 'start': 1563.667, 'duration': 2.361}, {'end': 1573.256, 'text': 'There are two ways to create RDDs first one by parallelizing an existing collection in your driver program,', 'start': 1566.668, 'duration': 6.588}, {'end': 1580.765, 'text': 'and the second one by referencing the data set and the external storage system, such as shared file system, HDFS, HBase, Etc.', 'start': 1573.256, 'duration': 7.509}, {'end': 1586.818, 'text': 'Now transformations are the operations that you perform on RDD, which will create a new RDD.', 'start': 1581.673, 'duration': 5.145}, {'end': 1591.442, 'text': 'For example, you can perform filter on an RDD and create a new RDD.', 'start': 1587.138, 'duration': 4.304}, {'end': 1596.146, 'text': 'Then there are actions which analyzes the RDD and produce result.', 'start': 1591.962, 'duration': 4.184}, {'end': 1601.231, 'text': 'simple action can be count, which will count the rows in RDD and produce a result.', 'start': 1596.146, 'duration': 5.085}, {'end': 1605.715, 'text': 'so I can say that transformation produce new RDD and actions produce results.', 'start': 1601.231, 'duration': 4.484}, {'end': 1609.432, 'text': 'So this is all about the fundamental data structure of spark.', 'start': 1606.411, 'duration': 3.021}, {'end': 1610.793, 'text': 'that is already.', 'start': 1609.432, 'duration': 1.361}, {'end': 1615.675, 'text': "now let's dive into the core topic of today's discussion that the spark architecture.", 'start': 1610.793, 'duration': 4.882}, {'end': 1619.416, 'text': 'So this is a spark architecture in your master node.', 'start': 1616.315, 'duration': 3.101}, {'end': 1622.198, 'text': 'You have to drive a program which drives your application.', 'start': 1619.456, 'duration': 2.742}, {'end': 1628.42, 'text': "So the code that you're writing behaves as a driver program or, if you are using the interactive shell,", 'start': 1622.658, 'duration': 5.762}, {'end': 1632.042, 'text': 'the shell acts as a driver program inside the driver program.', 'start': 1628.42, 'duration': 3.622}, {'end': 1635.283, 'text': 'The first thing that you do is you create a spark context.', 'start': 1632.162, 'duration': 3.121}, {'end': 1642.266, 'text': 'Assume that the spark context is a gateway to all spark functionality at a similar to your database connection.', 'start': 1635.923, 'duration': 6.343}, {'end': 1647.248, 'text': 'So any command you execute in your database goes through the database connection.', 'start': 1642.706, 'duration': 4.542}, {'end': 1651.991, 'text': 'Similarly anything you do on spark goes through the spark context.', 'start': 1647.669, 'duration': 4.322}, {'end': 1657.834, 'text': 'Now this power context works with the cluster manager to manage various jobs.', 'start': 1652.731, 'duration': 5.103}, {'end': 1663.137, 'text': 'the driver program and the spark context takes care of executing the job across the cluster.', 'start': 1657.834, 'duration': 5.303}, {'end': 1668.4, 'text': 'a job is split into the tasks and then these tasks are distributed over the worker node.', 'start': 1663.137, 'duration': 5.263}, {'end': 1675.764, 'text': 'So anytime you create the RDD in the spark context that RDD can be distributed across various nodes and can be cached there.', 'start': 1668.76, 'duration': 7.004}, {'end': 1680.667, 'text': 'So RDD set to be taken partitioned and distributed across various nodes.', 'start': 1676.264, 'duration': 4.403}, {'end': 1685.869, 'text': 'Now worker nodes are the slave nodes whose job is to basically execute the tasks.', 'start': 1681.267, 'duration': 4.602}, {'end': 1694.013, 'text': 'The task is then executed on the partition RDDs in the worker nodes and then returns the result back to the spark context.', 'start': 1686.429, 'duration': 7.584}, {'end': 1703.697, 'text': 'spark context takes the job, breaks the job into the task and distribute them on the worker nodes and these tasks works on partition RDDs,', 'start': 1694.013, 'duration': 9.684}, {'end': 1709.8, 'text': 'perform whatever operations you wanted to perform and then collect the result and give it back to the main spark context.', 'start': 1703.697, 'duration': 6.103}, {'end': 1717.61, 'text': 'If you increase the number of workers, then you can divide jobs and more partitions and execute them badly over multiple systems.', 'start': 1710.575, 'duration': 7.035}, {'end': 1720.116, 'text': 'This will be actually a lot more faster.', 'start': 1718.252, 'duration': 1.864}, {'end': 1729.504, 'text': 'Also, if you increase the number of workers, it will also increase your memory and you can cash the jobs so that it can be executed much more faster.', 'start': 1720.874, 'duration': 8.63}, {'end': 1731.826, 'text': 'So this is all about spark architecture.', 'start': 1729.984, 'duration': 1.842}, {'end': 1736.151, 'text': 'Now, let me give you an infographic idea about the spark architecture.', 'start': 1732.287, 'duration': 3.864}, {'end': 1739.275, 'text': 'It follows master slave architecture here.', 'start': 1736.692, 'duration': 2.583}, {'end': 1742.078, 'text': 'The client submits spark user application code.', 'start': 1739.455, 'duration': 2.623}, {'end': 1744.822, 'text': 'When an application code is submitted,', 'start': 1742.598, 'duration': 2.224}, {'end': 1753.414, 'text': 'driver implicitly converts a user code that contains transformations and actions into a logically directed graph called DHE at this stage.', 'start': 1744.822, 'duration': 8.592}, {'end': 1757.641, 'text': 'It also performs optimizations such as by planning transformations.', 'start': 1753.735, 'duration': 3.906}, {'end': 1766.806, 'text': 'Then it converts a logical graph called DHE into physical execution plan with many stages after converting into physical execution plan.', 'start': 1758.281, 'duration': 8.525}, {'end': 1770.908, 'text': 'It creates a physical execution units called tasks under each stage.', 'start': 1767.106, 'duration': 3.802}, {'end': 1775.13, 'text': 'Then these tasks are bundled and sent to the cluster.', 'start': 1771.368, 'duration': 3.762}, {'end': 1782.214, 'text': 'now driver talks to the cluster manager and negotiates the resources, and cluster manager launches the needed executors.', 'start': 1775.13, 'duration': 7.084}, {'end': 1791.759, 'text': 'At this point, driver will also send the task to the executors based on the placement when executor start to register themselves with the drivers,', 'start': 1782.874, 'duration': 8.885}, {'end': 1803.226, 'text': 'so that driver will have a complete view of the executors and executors now start executing the tasks that are assigned by the driver program at any point of time when the application is running.', 'start': 1791.759, 'duration': 11.467}, {'end': 1810.77, 'text': 'driver program will monitor the set of executors that runs and the driver node also schedules a future task based on data placement.', 'start': 1803.226, 'duration': 7.544}, {'end': 1815.192, 'text': 'So this is how the internal working takes place in spark architecture.', 'start': 1811.25, 'duration': 3.942}, {'end': 1822.195, 'text': 'There are three different types of workloads that spark and cater first batch mode in case of batch mode.', 'start': 1815.732, 'duration': 6.463}, {'end': 1823.956, 'text': 'We run a bad job here.', 'start': 1822.315, 'duration': 1.641}, {'end': 1831.719, 'text': 'You write a job and then schedule it it works through a queue or batch of separate jobs through manual intervention next interactive mode.', 'start': 1824.076, 'duration': 7.643}, {'end': 1835.881, 'text': 'This is an interactive shell where you go and execute the commands one by one.', 'start': 1831.999, 'duration': 3.882}, {'end': 1843.124, 'text': 'So you will execute one command, check the result and then execute the other command based on the output result and so on.', 'start': 1836.421, 'duration': 6.703}, {'end': 1844.824, 'text': 'it works similar to the SQL shell.', 'start': 1843.124, 'duration': 1.7}, {'end': 1847.946, 'text': 'So shell is the one which executes a driver program.', 'start': 1845.185, 'duration': 2.761}, {'end': 1853.268, 'text': 'So it is generally used for development work or it is also used for ad hoc queries.', 'start': 1848.426, 'duration': 4.842}, {'end': 1856.389, 'text': 'then comes the streaming mode, where the program is continuously running.', 'start': 1853.268, 'duration': 3.121}, {'end': 1863.852, 'text': 'As in when the data comes it takes the data and do some transformations and actions on the data and then produce output results.', 'start': 1857.029, 'duration': 6.823}, {'end': 1871.836, 'text': "So these are the three different types of workloads that spark actually caters now, let's move ahead and see a simple demo here.", 'start': 1864.432, 'duration': 7.404}, {'end': 1876.498, 'text': "Let's understand how to create a spark application and spark shell using Scala.", 'start': 1872.076, 'duration': 4.422}, {'end': 1882.542, 'text': "So let's understand how to create a spark application in spark shell using Scala.", 'start': 1877.216, 'duration': 5.326}, {'end': 1888.468, 'text': 'assume that we have a text file in the HDFS directory and we are counting the number of words and that text file.', 'start': 1882.542, 'duration': 5.926}, {'end': 1889.97, 'text': "So let's see how to do it.", 'start': 1888.869, 'duration': 1.101}, {'end': 1894.595, 'text': 'So before I start running, let me first check whether all my demons are running or not.', 'start': 1890.611, 'duration': 3.984}, {'end': 1896.755, 'text': "So I'll type pseudo GPS.", 'start': 1895.234, 'duration': 1.521}, {'end': 1905.862, 'text': 'So all my spark demons and Hadoop demons are running, that I have master worker, a spark demons and name no resource manager, node manager,', 'start': 1897.396, 'duration': 8.466}, {'end': 1906.942, 'text': 'everything as Hadoop demons.', 'start': 1905.862, 'duration': 1.08}, {'end': 1915.548, 'text': 'So the first thing that I do here is I run the spark shell, so it takes bit time to start in the main, while, let me tell you,', 'start': 1907.403, 'duration': 8.145}, {'end': 1920.476, 'text': 'the web UI port for spark shell is localhost 4040..', 'start': 1915.548, 'duration': 4.928}, {'end': 1925.889, 'text': 'So this is a web UI for spark like if you click on jobs right now, we have not executed anything.', 'start': 1920.476, 'duration': 5.413}, {'end': 1928.475, 'text': 'So there is no details over here.', 'start': 1926.35, 'duration': 2.125}, {'end': 1931.459, 'text': 'So they have job stages.', 'start': 1929.438, 'duration': 2.021}, {'end': 1937.142, 'text': "So once you execute the jobs, if you'll be having the records of the task that you have executed here.", 'start': 1932.259, 'duration': 4.883}, {'end': 1941.364, 'text': 'So here you can see the status of various jobs and tasks executed.', 'start': 1937.682, 'duration': 3.682}, {'end': 1945.406, 'text': "So now let's check whether our spark shell has started or not.", 'start': 1941.384, 'duration': 4.022}, {'end': 1949.297, 'text': 'Yes So you have your spark version as 2.1.', 'start': 1945.946, 'duration': 3.351}, {'end': 1952.029, 'text': '1 and you have a scholar shell over here.', 'start': 1949.297, 'duration': 2.732}, {'end': 1958.412, 'text': "So before I start the code, let's check the content that is present in the input text file by running this command.", 'start': 1952.829, 'duration': 5.583}, {'end': 1968.803, 'text': "So I'll write where test is equal to SC dot text file because I have saved a text file over there and I'll give the HDFS path location.", 'start': 1959.036, 'duration': 9.767}], 'summary': 'Spark ecosystem supports scala, python, and java. rdd is a fundamental data structure. spark architecture follows master-slave architecture. spark caters to batch, interactive, and streaming workloads.', 'duration': 570.276, 'max_score': 1398.527, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g1398527.jpg'}, {'end': 1782.214, 'src': 'embed', 'start': 1753.735, 'weight': 3, 'content': [{'end': 1757.641, 'text': 'It also performs optimizations such as by planning transformations.', 'start': 1753.735, 'duration': 3.906}, {'end': 1766.806, 'text': 'Then it converts a logical graph called DHE into physical execution plan with many stages after converting into physical execution plan.', 'start': 1758.281, 'duration': 8.525}, {'end': 1770.908, 'text': 'It creates a physical execution units called tasks under each stage.', 'start': 1767.106, 'duration': 3.802}, {'end': 1775.13, 'text': 'Then these tasks are bundled and sent to the cluster.', 'start': 1771.368, 'duration': 3.762}, {'end': 1782.214, 'text': 'now driver talks to the cluster manager and negotiates the resources, and cluster manager launches the needed executors.', 'start': 1775.13, 'duration': 7.084}], 'summary': 'Spark optimizes and executes dhe graph with multiple stages, creating tasks and negotiating resources with the cluster manager.', 'duration': 28.479, 'max_score': 1753.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g1753735.jpg'}, {'end': 2612.475, 'src': 'embed', 'start': 2584.721, 'weight': 0, 'content': [{'end': 2591.985, 'text': 'RDDs are effortless to create and the mind-blowing property which solved the problem was its in-memory data processing capability.', 'start': 2584.721, 'duration': 7.264}, {'end': 2595.467, 'text': 'RDD is not a distributed file system instead.', 'start': 2592.566, 'duration': 2.901}, {'end': 2603.251, 'text': 'It is a distributed collection of memory where the data needed is always stored and kept available in RAM, and because of this property,', 'start': 2595.788, 'duration': 7.463}, {'end': 2606.813, 'text': 'the elevation it gave to the memory accessing speed was unbelievable.', 'start': 2603.251, 'duration': 3.562}, {'end': 2612.475, 'text': 'The RDDs are fault-tolerant and this property bought it a dignity of a whole new level.', 'start': 2607.313, 'duration': 5.162}], 'summary': 'Rdds offer in-memory data processing, enhancing speed and fault-tolerance.', 'duration': 27.754, 'max_score': 2584.721, 'thumbnail': ''}, {'end': 2696.634, 'src': 'embed', 'start': 2663.499, 'weight': 4, 'content': [{'end': 2667.802, 'text': 'So let us see the different sources from which the data can be ingested into an RDD.', 'start': 2663.499, 'duration': 4.303}, {'end': 2675.227, 'text': 'The data can be loaded from any source like HDFS, Hspace, Hive, SQL, you name it they got it.', 'start': 2668.222, 'duration': 7.005}, {'end': 2681.624, 'text': 'Hence the collected data is dropped into an RDD and guess what the RDDs are free spirited.', 'start': 2675.74, 'duration': 5.884}, {'end': 2683.825, 'text': 'They can process any type of data.', 'start': 2682.044, 'duration': 1.781}, {'end': 2689.029, 'text': "They won't care if the data is structured, unstructured or semi-structured.", 'start': 2684.226, 'duration': 4.803}, {'end': 2696.634, 'text': 'now let me walk you through the features of RDDs, which give it an edge over the other alternatives in memory computation.', 'start': 2689.029, 'duration': 7.605}], 'summary': 'Rdds can ingest data from various sources like hdfs, hspace, hive, sql, and process any type of structured, unstructured, or semi-structured data, making them versatile for in-memory computation.', 'duration': 33.135, 'max_score': 2663.499, 'thumbnail': ''}, {'end': 2945.562, 'src': 'embed', 'start': 2917.346, 'weight': 2, 'content': [{'end': 2926.111, 'text': "I'm creating a new RDD by the name Spark file, where I'll be loading a text document into the RDD from an external storage which is HDFS,", 'start': 2917.346, 'duration': 8.765}, {'end': 2929.052, 'text': 'and this is the location where my text file is located.', 'start': 2926.111, 'duration': 2.941}, {'end': 2932.914, 'text': 'So the new RDD Spark file is successfully created.', 'start': 2930.013, 'duration': 2.901}, {'end': 2936.816, 'text': "Now, let's display the data which is present in a Spark file RDD.", 'start': 2933.334, 'duration': 3.482}, {'end': 2945.562, 'text': 'So the data which was present in a Spark file RDD is a collection of alphabets starting from A to Z.', 'start': 2938.88, 'duration': 6.682}], 'summary': "Created rdd 'spark file' from text document in hdfs, containing alphabet collection from a to z.", 'duration': 28.216, 'max_score': 2917.346, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g2917346.jpg'}, {'end': 3007.943, 'src': 'embed', 'start': 2984.051, 'weight': 5, 'content': [{'end': 2991.197, 'text': 'So here we are applying my transformation in order to display the first letter of each and every word which is stored in the RDD words.', 'start': 2984.051, 'duration': 7.146}, {'end': 2992.698, 'text': "Now, let's continue.", 'start': 2991.657, 'duration': 1.041}, {'end': 2995.533, 'text': 'The transformation has been applied successfully.', 'start': 2993.371, 'duration': 2.162}, {'end': 2999.977, 'text': "Now, let's display the contents which are present in our new RDD, which is word pair.", 'start': 2995.913, 'duration': 4.064}, {'end': 3007.943, 'text': 'So as explained we have displayed the starting letter of each and every word as S is the starting letter of spark.', 'start': 3001.778, 'duration': 6.165}], 'summary': 'Applied transformation to display first letter of each word in rdd', 'duration': 23.892, 'max_score': 2984.051, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g2984051.jpg'}], 'start': 1358.96, 'title': 'Apache spark and its components', 'summary': 'Covers apache spark, emphasizing sparkar, spark core, rdds, spark ecosystem, and architecture. it discusses features, data structures, machine learning, and distributed data frame implementations.', 'chapters': [{'end': 1398.527, 'start': 1358.96, 'title': 'Sparkar and spark core', 'summary': 'Discusses the sparkar r package, which provides a lightweight front-end to use apache spark for working with graph and non-graph sources, supporting distributed data frame implementation and machine learning. it also highlights the importance of the spark core component in the spark ecosystem, responsible for basic io functions, scheduling, and monitoring.', 'duration': 39.567, 'highlights': ['The Spark Core component is the most essential in the Spark ecosystem, responsible for basic IO functions, scheduling, and monitoring.', 'SparkAR is an R package that provides a lightweight front-end to use Apache Spark, enabling data scientists to work with graph and non-graph sources, supporting distributed data frame implementation and machine learning.']}, {'end': 1663.137, 'start': 1398.527, 'title': 'Spark ecosystem and rdds', 'summary': 'Introduces the spark ecosystem, highlighting its support for scala, python, r, and java, and emphasizes the significance of resilient distributed datasets (rdds) in spark, discussing their immutability, fault-tolerance, creation methods, and operations.', 'duration': 264.61, 'highlights': ['Spark supports Scala, Python, R, and Java as programming languages, facilitating data analysis and machine learning.', 'Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark, representing an immutable distributed collection of objects, enabling parallel transformations and actions on the data.', 'RDDs provide fault tolerance by automatically recomputing if lost, and can be created either by parallelizing an existing collection or referencing external storage systems like HDFS.', 'Transformations and actions are key operations on RDDs, with transformations creating new RDDs and actions producing results, showcasing the core functionality of Spark.', 'The Spark architecture involves a driver program or interactive shell interacting with the Spark context, which serves as a gateway to all Spark functionality and works with the cluster manager to execute jobs across the cluster.']}, {'end': 2566.091, 'start': 1663.137, 'title': 'Understanding spark architecture', 'summary': 'Discusses the architecture of apache spark, which includes the distribution of tasks across worker nodes, the impact of increasing the number of workers, the different types of workloads spark caters to, and a demonstration of creating a spark application and spark shell using scala.', 'duration': 902.954, 'highlights': ['The chapter explains the distribution of tasks across worker nodes in Spark architecture, highlighting that increasing the number of workers can divide jobs into more partitions and execute them faster, and also increase memory for faster execution.', 'The different types of workloads that Spark caters to are discussed, including batch mode for running batch jobs, interactive mode for executing commands one by one, and streaming mode for continuously running programs.', 'A demonstration of creating a Spark application and Spark shell using Scala is provided, including steps to check the status of Spark demons, running the Spark shell, reading and processing a text file to perform word count, and parallelizing numbers and dividing tasks into partitions to demonstrate task parallelization.']}, {'end': 3135.759, 'start': 2566.491, 'title': 'Spark: rdds and their features', 'summary': 'Introduces spark as an advanced big data processing framework, highlighting its resilient distributed data set (rdd) as a fundamental and crucial data structure, with features like fault tolerance, in-memory data processing, and various methods for rdd creation.', 'duration': 569.268, 'highlights': ['Spark is introduced as an advanced big data processing framework, with Resilient Distributed Data Set (RDD) being a fundamental and crucial data structure, featuring fault tolerance, in-memory data processing capability, and effortless creation. (relevance: 5)', 'The RDDs store data amongst multiple computers, enabling fault tolerance and roll back of lost partitions by applying simple transformations, without the need for hard disk or secondary storage. (relevance: 4)', 'The data can be ingested into an RDD from sources like HDFS, HSpace, Hive, SQL, and RDDs are capable of processing any type of data, be it structured, unstructured, or semi-structured. (relevance: 3)', 'RDDs possess features like in-memory computation, lazy evaluations, fault tolerance, immutability, partitioning, persistence, and coarse-grained operations, providing an edge over other alternatives in memory computation. (relevance: 2)', 'RDDs can be created using three methods: parallelized collections, external storage like HDFS, HBase, Hive, and using an existing RDD, with detailed practical examples provided for each method. (relevance: 1)']}], 'duration': 1776.799, 'thumbnail': '', 'highlights': ['The Spark Core component is essential in the Spark ecosystem, responsible for basic IO functions, scheduling, and monitoring.', 'SparkAR is an R package providing a lightweight front-end to use Apache Spark, enabling data scientists to work with graph and non-graph sources, supporting distributed data frame implementation and machine learning.', 'Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark, representing an immutable distributed collection of objects, enabling parallel transformations and actions on the data.', 'Spark supports Scala, Python, R, and Java as programming languages, facilitating data analysis and machine learning.', 'The Spark architecture involves a driver program or interactive shell interacting with the Spark context, which serves as a gateway to all Spark functionality and works with the cluster manager to execute jobs across the cluster.', 'The chapter explains the distribution of tasks across worker nodes in Spark architecture, highlighting that increasing the number of workers can divide jobs into more partitions and execute them faster, and also increase memory for faster execution.', 'The different types of workloads that Spark caters to are discussed, including batch mode for running batch jobs, interactive mode for executing commands one by one, and streaming mode for continuously running programs.', 'Spark is introduced as an advanced big data processing framework, with Resilient Distributed Data Set (RDD) being a fundamental and crucial data structure, featuring fault tolerance, in-memory data processing capability, and effortless creation.']}, {'end': 3862.315, 'segs': [{'end': 3223.391, 'src': 'embed', 'start': 3184.137, 'weight': 2, 'content': [{'end': 3192.282, 'text': "I'm about to split the second column of my CSV file which consists the information regarding the states which conducted the IPL matches.", 'start': 3184.137, 'duration': 8.145}, {'end': 3197.405, 'text': "So I'm using this operation in order to display the states where the matches were conducted.", 'start': 3192.903, 'duration': 4.502}, {'end': 3204.33, 'text': 'So the transformation has been successfully applied and the data has been stored into the new RDD which is States.', 'start': 3198.886, 'duration': 5.444}, {'end': 3209.573, 'text': "Now, let's display the data which is stored in our States RDD using the collection action command.", 'start': 3204.87, 'duration': 4.703}, {'end': 3213.067, 'text': 'So these were the states where the matches were being conducted.', 'start': 3210.586, 'duration': 2.481}, {'end': 3218.229, 'text': "Now, let's find out the city which conducted the maximum number of IPL matches.", 'start': 3214.288, 'duration': 3.941}, {'end': 3223.391, 'text': "Here I'm creating a new RDD again, which is states count,", 'start': 3219.65, 'duration': 3.741}], 'summary': 'Split second column to display ipl match states, find city with max matches', 'duration': 39.254, 'max_score': 3184.137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3184137.jpg'}, {'end': 3355.25, 'src': 'embed', 'start': 3324.217, 'weight': 1, 'content': [{'end': 3328.078, 'text': 'We shall use filter transformation for this operation.', 'start': 3324.217, 'duration': 3.861}, {'end': 3333.481, 'text': 'The transformation has been applied successfully and the data has been stored into the fil RDD.', 'start': 3328.879, 'duration': 4.602}, {'end': 3336.182, 'text': 'Now, let us display the data which is present there.', 'start': 3333.921, 'duration': 2.261}, {'end': 3344.917, 'text': 'We shall use collect action command and now we have the data of all the matches which were played especially in the year 2017.', 'start': 3337.442, 'duration': 7.475}, {'end': 3355.25, 'text': 'Similarly, we can find out the matches which were played in the year 2016 and we can save the same data into the new RDD, which is fil2.', 'start': 3344.917, 'duration': 10.333}], 'summary': 'Successfully applied filter transformation to store and display data of matches played in 2017 and 2016.', 'duration': 31.033, 'max_score': 3324.217, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3324217.jpg'}, {'end': 3545.428, 'src': 'embed', 'start': 3515.931, 'weight': 3, 'content': [{'end': 3523.535, 'text': 'So the steps to be performed in Pokemon use case are loading the Pokemon data.csv file from an external storage into an RDD,', 'start': 3515.931, 'duration': 7.604}, {'end': 3529.838, 'text': 'removing the schema from the Pokemon data.csv file and finding out the total number of water type Pokemons.', 'start': 3523.535, 'duration': 6.303}, {'end': 3531.879, 'text': 'finding the total number of fire type Pokemons.', 'start': 3529.838, 'duration': 2.041}, {'end': 3533.62, 'text': "I know it's getting interesting.", 'start': 3532.319, 'duration': 1.301}, {'end': 3536.501, 'text': 'So let me explain you each and every step practically.', 'start': 3534.04, 'duration': 2.461}, {'end': 3545.428, 'text': "So here I'm creating a new RDD by name Pokemon data RDD 1 and I'm loading my CSV file from an external storage.", 'start': 3537.862, 'duration': 7.566}], 'summary': 'Loading pokemon data.csv, removing schema, finding water and fire type pokemons', 'duration': 29.497, 'max_score': 3515.931, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3515931.jpg'}, {'end': 3605.119, 'src': 'embed', 'start': 3578.515, 'weight': 6, 'content': [{'end': 3589.161, 'text': 'So we have index of the Pokemon, name of the Pokemon, its type, total points, HP attack points, defense points, special attack, special defense,', 'start': 3578.515, 'duration': 10.646}, {'end': 3594.044, 'text': 'speed generation, and we can also find if a particular Pokemon is legendary or not.', 'start': 3589.161, 'duration': 4.883}, {'end': 3605.119, 'text': "Here I'm creating a new RDD which is no header, and I'm using filter operation in order to remove the schema of our Pokemon data dot CSV file.", 'start': 3595.814, 'duration': 9.305}], 'summary': 'Creating a new rdd without headers and filtering the pokemon data csv file.', 'duration': 26.604, 'max_score': 3578.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3578515.jpg'}, {'end': 3850.228, 'src': 'embed', 'start': 3823.178, 'weight': 0, 'content': [{'end': 3830.36, 'text': 'So, in order to find out the Pokemon with least defense strength, I have created a new RDD by name minimum defense Pokemon,', 'start': 3823.178, 'duration': 7.182}, {'end': 3844.825, 'text': 'and I have applied distinct and sort by transformations onto the defense list RDD in order to extract the least defense points present in the defense list and I have used take action command in order to display the data which is present in minimum defense Pokemon RDD.', 'start': 3830.36, 'duration': 14.465}, {'end': 3850.228, 'text': 'So according to the results we have five points as the least different strength of a particular Pokemon.', 'start': 3845.565, 'duration': 4.663}], 'summary': 'Identified pokemon with least defense strength as 5 points.', 'duration': 27.05, 'max_score': 3823.178, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3823178.jpg'}], 'start': 3136.653, 'title': 'Ipl and pokemon data analysis', 'summary': 'Covers analyzing 636 ipl match records from 2008 to 2017, identifying cities with the most matches, and analyzing pokemon data, including filtering, counting, and extracting maximum and minimum defense points (230 and 5 points).', 'chapters': [{'end': 3559.199, 'start': 3136.653, 'title': 'Analyzing ipl match records', 'summary': 'Covers the analysis of ipl match records, including 636 total rows of data from 2008 to 2017, operations to display schema, states, and cities, as well as identifying cities with the maximum number of matches and players with the most man-of-the-match awards.', 'duration': 422.546, 'highlights': ['The transformation successfully applied to display the states where the matches were conducted, resulting in 636 total rows of data from 2008 to 2017.', 'The operation to find the city which conducted the maximum number of IPL matches, revealing Mumbai as the city with 85 matches from 2008 to 2017.', 'The application of filter transformation to exclude matches conducted in the city Hyderabad from the data, resulting in a new RDD named filrdd.', 'The successful application of union transformation to combine match data from 2016 and 2017 into a new RDD named Union RDD.', 'The identification of the player BWS with the maximum number of man-of-the-match awards, which is 15.', 'The loading of the Pokemon data.csv file into a new RDD named Pokemon data RDD 1, and using collect action command to display the data.']}, {'end': 3862.315, 'start': 3559.999, 'title': 'Pokemon data analysis', 'summary': 'Introduces the process of analyzing pokemon data using apache spark, including displaying data schema, filtering, counting, and extracting maximum and minimum defense points, with the maximum defense strength being 230 points and the minimum defense strength being 5 points.', 'duration': 302.316, 'highlights': ['The maximum defense strength among all the pokemons is 230 points, belonging to Steelix, Steelix Mega, Shuckle Agrigon, and Agrigon Mega.', 'There are 112 water type Pokemons and 52 fire type Pokemons present in the Pokemon data dot CSV file.', 'The least defense strength of a particular Pokemon is 5 points.']}], 'duration': 725.662, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3136653.jpg', 'highlights': ['Mumbai conducted the maximum number of IPL matches, 85 from 2008 to 2017.', 'The transformation successfully applied to display the states where the matches were conducted, resulting in 636 total rows of data from 2008 to 2017.', 'The maximum defense strength among all the pokemons is 230 points, belonging to Steelix, Steelix Mega, Shuckle Agrigon, and Agrigon Mega.', 'The identification of the player BWS with the maximum number of man-of-the-match awards, which is 15.', 'There are 112 water type Pokemons and 52 fire type Pokemons present in the Pokemon data dot CSV file.', 'The application of filter transformation to exclude matches conducted in the city Hyderabad from the data, resulting in a new RDD named filrdd.', 'The successful application of union transformation to combine match data from 2016 and 2017 into a new RDD named Union RDD.', 'The least defense strength of a particular Pokemon is 5 points.', 'The loading of the Pokemon data.csv file into a new RDD named Pokemon data RDD 1, and using collect action command to display the data.']}, {'end': 5574.972, 'segs': [{'end': 3931.62, 'src': 'embed', 'start': 3907.277, 'weight': 2, 'content': [{'end': 3919.064, 'text': "We have two number of Pokemons which come under the category of having five points as their defense strength the Pokemon's chastening and happening at the two Pokemons which are having the least defense strength.", 'start': 3907.277, 'duration': 11.787}, {'end': 3931.62, 'text': 'The world of information technology and big data processing started to see multiple potentialities, from spark coming into action.', 'start': 3924.269, 'duration': 7.351}], 'summary': 'Two pokemons with 5 defense points are chaste and happening, while spark is making an impact in big data processing.', 'duration': 24.343, 'max_score': 3907.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3907277.jpg'}, {'end': 4130.997, 'src': 'embed', 'start': 4104.895, 'weight': 1, 'content': [{'end': 4112.477, 'text': 'the IT Industries required a powerful and an integrated data structure which could support multiple programming languages and at the same time,', 'start': 4104.895, 'duration': 7.582}, {'end': 4115.078, 'text': 'without the requirement of additional API.', 'start': 4112.477, 'duration': 2.601}, {'end': 4120.599, 'text': 'data frame was the one-stop solution, which supported multiple languages along with a single API.', 'start': 4115.078, 'duration': 5.521}, {'end': 4127.241, 'text': 'the most popular languages that a data frame could support are our Python Scala, Java and many more.', 'start': 4120.599, 'duration': 6.642}, {'end': 4130.997, 'text': 'The next requirement was to support the multiple data sources.', 'start': 4127.975, 'duration': 3.022}], 'summary': 'It industry needed a versatile and integrated data structure, data frame, supporting multiple languages and data sources.', 'duration': 26.102, 'max_score': 4104.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g4104895.jpg'}, {'end': 4502.801, 'src': 'heatmap', 'start': 4215.311, 'weight': 0.705, 'content': [{'end': 4219.992, 'text': 'the term immutable depicts that the data, once stored into a data frame, will not be altered.', 'start': 4215.311, 'duration': 4.681}, {'end': 4226.114, 'text': 'The only way to alter the data present in a data frame would be by applying simple transformation operations onto them.', 'start': 4220.492, 'duration': 5.622}, {'end': 4229.906, 'text': 'So the next feature is lazy evaluation.', 'start': 4226.784, 'duration': 3.122}, {'end': 4234.408, 'text': 'lazy evaluation is the key to the remarkable performance offered by spark.', 'start': 4229.906, 'duration': 4.502}, {'end': 4241.412, 'text': 'similar to the RDDs, data frames in spark will not throw any output onto the screen until and unless an action command is encountered.', 'start': 4234.408, 'duration': 7.004}, {'end': 4243.913, 'text': 'The next feature is fault tolerance.', 'start': 4242.092, 'duration': 1.821}, {'end': 4247.655, 'text': 'There is no way that the sparks data frames can lose their data.', 'start': 4244.494, 'duration': 3.161}, {'end': 4253.478, 'text': 'They follow the principle of being fault tolerant to the unexpected calamities which tend to destroy the available data.', 'start': 4248.056, 'duration': 5.422}, {'end': 4256.1, 'text': 'The next feature is distributed storage.', 'start': 4254.079, 'duration': 2.021}, {'end': 4262.2, 'text': 'Sparks data frame distribute the data amongst multiple locations so that, in case of a node failure,', 'start': 4256.738, 'duration': 5.462}, {'end': 4266.082, 'text': 'the next available node can take its place to continue the data processing.', 'start': 4262.2, 'duration': 3.882}, {'end': 4270.204, 'text': 'the next stage will be about the multiple data source that the spark data frame can support.', 'start': 4266.082, 'duration': 4.122}, {'end': 4280.614, 'text': 'The spark API can integrate itself with multiple programming languages, such as Scala, Java, Python are my sequel and many more,', 'start': 4271.306, 'duration': 9.308}, {'end': 4291.483, 'text': 'making itself capable to handle a variety of data sources such as Hadoop, hive, edge space, Cassandra, Jason files, CSV files my sequel and many more.', 'start': 4280.614, 'duration': 10.869}, {'end': 4298.95, 'text': 'So this was the theory part and now let us move into the practical part where the creation of a data frame happens to be a first step.', 'start': 4292.444, 'duration': 6.506}, {'end': 4307.036, 'text': 'So before we begin the practical part, let us load the libraries which were required in order to process the data in data frames.', 'start': 4300.353, 'duration': 6.683}, {'end': 4313.078, 'text': 'So these were the few libraries which were required before we process the data using our data frames.', 'start': 4308.356, 'duration': 4.722}, {'end': 4318.9, 'text': 'Now that we have loaded all the libraries which were required to process the data using the data frames.', 'start': 4314.419, 'duration': 4.481}, {'end': 4321.601, 'text': 'Let us begin with the creation of our data frame.', 'start': 4319.421, 'duration': 2.18}, {'end': 4327.764, 'text': 'So we shall create a new data frame with the name employee and load the data of the employees present in an organization.', 'start': 4322.082, 'duration': 5.682}, {'end': 4334.528, 'text': 'The details of the employees will consist the first name the last name and their mail ID along with their salary.', 'start': 4328.524, 'duration': 6.004}, {'end': 4337.47, 'text': 'So the first data frame has been successfully created.', 'start': 4334.988, 'duration': 2.482}, {'end': 4340.512, 'text': 'Now, let us design the schema for this data frame.', 'start': 4338.271, 'duration': 2.241}, {'end': 4345.215, 'text': 'So the schema for this data frame has been described as shown.', 'start': 4341.653, 'duration': 3.562}, {'end': 4351.119, 'text': 'the first name is of string data type and similarly the last name is a string data type, along with the mail address.', 'start': 4345.215, 'duration': 5.904}, {'end': 4356.583, 'text': 'And finally the salary is integer data type or you can give float data type also.', 'start': 4351.58, 'duration': 5.003}, {'end': 4359.048, 'text': 'So the schema has been successfully delivered.', 'start': 4357.188, 'duration': 1.86}, {'end': 4363.749, 'text': 'Now, let us create the data frame using create data frame function here.', 'start': 4359.568, 'duration': 4.181}, {'end': 4372.371, 'text': "I'm creating a new data frame by starting a spark context and using the create data frame method and loading the data from employee and employee schema.", 'start': 4363.829, 'duration': 8.542}, {'end': 4374.611, 'text': 'The data frame is successfully created.', 'start': 4372.971, 'duration': 1.64}, {'end': 4379.072, 'text': "Now, let's print the data which is existing in the data frame EMPDF.", 'start': 4374.671, 'duration': 4.401}, {'end': 4382.112, 'text': "I'm using show method here.", 'start': 4380.452, 'duration': 1.66}, {'end': 4388.989, 'text': 'So the data which is present in EMPDF has been successfully printed Now, let us move on to the next step.', 'start': 4382.132, 'duration': 6.857}, {'end': 4394.913, 'text': "So the next step for our today's discussion is working with an example related to the FIFA data set.", 'start': 4390.01, 'duration': 4.903}, {'end': 4400.657, 'text': 'So the first step in our FIFA example would be loading the schema for the CSV file.', 'start': 4396.254, 'duration': 4.403}, {'end': 4403.939, 'text': 'We are working with so the schema has been successfully loaded now.', 'start': 4400.677, 'duration': 3.262}, {'end': 4411.444, 'text': 'Now, let us load the CSV file from our external storage, which is HDFS into our data frame, which is FIFA DF.', 'start': 4404.639, 'duration': 6.805}, {'end': 4416.579, 'text': 'The CSV file has been successfully loaded into our new data frame, which is FIFA DF.', 'start': 4412.276, 'duration': 4.303}, {'end': 4420.401, 'text': 'Now, let us print the schema of our data frame using the print schema command.', 'start': 4416.959, 'duration': 3.442}, {'end': 4428.566, 'text': 'So the schema has been successfully displayed here and we have the following credentials of each and every player in our CSV file.', 'start': 4422.102, 'duration': 6.464}, {'end': 4431.748, 'text': "Now, let's move on to our further operations on our data frame.", 'start': 4429.106, 'duration': 2.642}, {'end': 4438.572, 'text': 'We shall count the total number of records of the players we have in our CSV file using count command.', 'start': 4433.349, 'duration': 5.223}, {'end': 4443.914, 'text': 'So we have a total of 18, 000 207 players in our CSV files.', 'start': 4439.449, 'duration': 4.465}, {'end': 4447.998, 'text': 'Now, let us find out the details of the columns on which we are working with.', 'start': 4444.535, 'duration': 3.463}, {'end': 4455.907, 'text': 'So these were the columns which we are working with which consists the idea of the player name age nationality potential and many more.', 'start': 4448.719, 'duration': 7.188}, {'end': 4463.35, 'text': 'Now, let us use the column value, which has the value of each and every player for a particular team,', 'start': 4457.325, 'duration': 6.025}, {'end': 4469.454, 'text': 'and let us use describe command in order to see the highest value and the least value provided to a player.', 'start': 4463.35, 'duration': 6.104}, {'end': 4480.363, 'text': 'So we have a count of a total number of 18, 000 207 players and the minimum worth given to a player is 0 and the maximum is given as 9 million pounds.', 'start': 4470.035, 'duration': 10.328}, {'end': 4486.597, 'text': 'Now let us use the select command in order to extract the column name and the nationality,', 'start': 4481.275, 'duration': 5.322}, {'end': 4490.378, 'text': 'to find out the name of each and every player along with his nationality.', 'start': 4486.597, 'duration': 3.781}, {'end': 4498.46, 'text': 'So here we have we can display the top 20 rows of each and every player which we have in our CSV file along with this nationality.', 'start': 4491.198, 'duration': 7.262}, {'end': 4502.801, 'text': 'Similarly, let us find out the players playing for a particular club.', 'start': 4499.22, 'duration': 3.581}], 'summary': 'Spark data frames offer features like lazy evaluation, fault tolerance, distributed storage, and support for multiple data sources, enabling efficient data processing and manipulation.', 'duration': 287.49, 'max_score': 4215.311, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g4215311.jpg'}, {'end': 4291.483, 'src': 'embed', 'start': 4254.079, 'weight': 9, 'content': [{'end': 4256.1, 'text': 'The next feature is distributed storage.', 'start': 4254.079, 'duration': 2.021}, {'end': 4262.2, 'text': 'Sparks data frame distribute the data amongst multiple locations so that, in case of a node failure,', 'start': 4256.738, 'duration': 5.462}, {'end': 4266.082, 'text': 'the next available node can take its place to continue the data processing.', 'start': 4262.2, 'duration': 3.882}, {'end': 4270.204, 'text': 'the next stage will be about the multiple data source that the spark data frame can support.', 'start': 4266.082, 'duration': 4.122}, {'end': 4280.614, 'text': 'The spark API can integrate itself with multiple programming languages, such as Scala, Java, Python are my sequel and many more,', 'start': 4271.306, 'duration': 9.308}, {'end': 4291.483, 'text': 'making itself capable to handle a variety of data sources such as Hadoop, hive, edge space, Cassandra, Jason files, CSV files my sequel and many more.', 'start': 4280.614, 'duration': 10.869}], 'summary': "Spark's distributed storage ensures fault tolerance and supports multiple data sources and programming languages.", 'duration': 37.404, 'max_score': 4254.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g4254079.jpg'}, {'end': 5172.806, 'src': 'embed', 'start': 5150.314, 'weight': 0, 'content': [{'end': 5158.018, 'text': 'and since MapReduce is going to be slower in nature, then definitely your overall Hive query is going to be slower in nature.', 'start': 5150.314, 'duration': 7.704}, {'end': 5159.318, 'text': 'So that was one challenge.', 'start': 5158.138, 'duration': 1.18}, {'end': 5167.583, 'text': "So if you have, let's say, less than 200 GB of data, or if you have a smaller set of data this was actually a big challenge that in Hive,", 'start': 5159.358, 'duration': 8.225}, {'end': 5170.184, 'text': 'your performance was not that great.', 'start': 5167.583, 'duration': 2.601}, {'end': 5172.806, 'text': 'It also do not have any resuming capability.', 'start': 5170.624, 'duration': 2.182}], 'summary': 'Mapreduce slowdown impacts hive query speed, especially with <200gb data, lacking resuming capability.', 'duration': 22.492, 'max_score': 5150.314, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g5150314.jpg'}, {'end': 5319.856, 'src': 'embed', 'start': 5290.866, 'weight': 8, 'content': [{'end': 5298.69, 'text': "But don't you think that if these companies working from let's say past 10 years in Hive, they must have already written a lot of code in Hive.", 'start': 5290.866, 'duration': 7.824}, {'end': 5307.794, 'text': 'Now if you ask them to migrate to Spark SQL, will it be an easy task? No, right? Definitely it is not going to be an easy task.', 'start': 5299.35, 'duration': 8.444}, {'end': 5319.856, 'text': 'Why?? Because Hive syntax and Spark SQL syntax, though they both tackle the SQL way of writing the things, but at the same time it is always a very.', 'start': 5308.194, 'duration': 11.662}], 'summary': 'Migrating from hive to spark sql after 10 years will not be an easy task due to syntax differences.', 'duration': 28.99, 'max_score': 5290.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g5290866.jpg'}, {'end': 5501.01, 'src': 'embed', 'start': 5471.755, 'weight': 7, 'content': [{'end': 5474.836, 'text': "Now you can ask me then how it is faster if they're using same meta store.", 'start': 5471.755, 'duration': 3.081}, {'end': 5476.916, 'text': 'Remember the processing part.', 'start': 5475.216, 'duration': 1.7}, {'end': 5480.137, 'text': 'Why hive was slower? Because of its processing way.', 'start': 5477.276, 'duration': 2.861}, {'end': 5486.299, 'text': 'Because it is converting everything to the MapReduce and thus it was making the processing very, very slow.', 'start': 5480.477, 'duration': 5.822}, {'end': 5494.285, 'text': 'Here in this case, since the processing is going to be in-memory computation, so in Spark SQL case, it is always going to be the faster.', 'start': 5487.379, 'duration': 6.906}, {'end': 5495.646, 'text': 'Now, definitely it.', 'start': 5494.465, 'duration': 1.181}, {'end': 5501.01, 'text': 'just because of the metastore side, we are only able to fetch the data and all but at the same time.', 'start': 5495.646, 'duration': 5.364}], 'summary': "Hive was slower due to mapreduce processing, while spark sql's in-memory computation makes it faster.", 'duration': 29.255, 'max_score': 5471.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g5471755.jpg'}, {'end': 5585.599, 'src': 'embed', 'start': 5558.679, 'weight': 4, 'content': [{'end': 5564.824, 'text': "So, in this session, as you are noticing what we are doing, we just want to kind of show that, once you're streaming the data in the real time,", 'start': 5558.679, 'duration': 6.145}, {'end': 5567.026, 'text': 'you can also do a processing using Spark SQL.', 'start': 5564.824, 'duration': 2.202}, {'end': 5570.348, 'text': 'Thus, you are doing all the processing at the real time.', 'start': 5567.526, 'duration': 2.822}, {'end': 5574.972, 'text': 'Similarly, in the stock market analysis, you can use Spark SQL, a lot of queries you can adopt there.', 'start': 5570.508, 'duration': 4.464}, {'end': 5578.114, 'text': 'In the banking, fraud case transactions and all, you can use that.', 'start': 5575.152, 'duration': 2.962}, {'end': 5585.599, 'text': "So let's say, your credit card currently is getting swiped in India and in next 10 minutes, if your credit card is getting swiped in, let's say in US.", 'start': 5578.414, 'duration': 7.185}], 'summary': 'Real-time data processing with spark sql for stock market and banking fraud analysis.', 'duration': 26.92, 'max_score': 5558.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g5558679.jpg'}], 'start': 3862.315, 'title': 'Data frames, operations, spark sql, and advantages over hive', 'summary': 'Explores the creation and functionalities of data frames in spark, covering data frame operations, examples, spark sql for game of thrones analysis, and advantages of spark sql over hive, with insights from practical examples, including a fifa dataset with 18,207 players and game of thrones data analysis, and success stories such as twitter sentiment analysis and stock market analysis.', 'chapters': [{'end': 4334.528, 'start': 3862.315, 'title': 'Data frames and spark', 'summary': 'Explores the creation and functionalities of data frames in spark, including the explanation of data frame, its requirements, and important features, along with practical examples and the technicalities of data frames in spark, with a focus on supportability for multiple programming languages, processing of structured and unstructured data, and slicing and dicing of data.', 'duration': 472.213, 'highlights': ['The data frames in Spark are a distributed collection of data organized under named columns, providing operations to filter, group, process, and aggregate the available data.', 'Supportability for multiple programming languages was a crucial requirement for data frames, as it supports languages such as Python, Scala, and Java.', 'Data frames can process both structured and unstructured data, offering the ability to store a huge collection of data in a tabular format along with its schema.', 'Immutability, lazy evaluation, fault tolerance, and distributed memory storage are important features of data frames, ensuring data integrity and fault tolerance.', 'The Spark API can integrate itself with multiple programming languages and handle a variety of data sources, such as Hadoop, Hive, Cassandra, JSON files, and CSV files.']}, {'end': 4595.476, 'start': 4334.988, 'title': 'Data frame operations and examples', 'summary': 'Covers the creation of data frames, schema design, data loading, and various operations on data frames, including a fifa dataset example with 18,207 players and a game of thrones dataset schema design and loading.', 'duration': 260.488, 'highlights': ['A total of 18,207 players are present in the FIFA dataset.', 'The maximum worth given to a player is 9 million pounds.', 'Players playing for their respective clubs along with their names can be displayed.', 'Details of players whose age is less than 30 years, along with their club, nationality, and Jersey numbers, are obtained.']}, {'end': 5127.501, 'start': 4595.796, 'title': 'Using spark sql for game of thrones analysis', 'summary': 'Demonstrates the use of spark sql for analyzing game of thrones data, including loading data into data frames, printing schemas, analyzing battles, tactics, houses, and characters, resulting in insights such as the deadliest house and king, defending houses and kings, and the number of noble and common characters.', 'duration': 531.705, 'highlights': ['The deadliest house in the Game of Thrones story is found to be Lannister, with a total of 18 battles waged, followed by Stark and Baratheon.', 'Joffrey is identified as the deadliest King, having fought a total of 14 battles.', 'Lannister house is found to have defended the most number of wars, while Robb Stark is the king who defended the most number of battles against him.', 'There are a total of 430 noble characters and 487 common characters in the entire Game of Thrones story.', 'The chapter concludes with insights about notable characters carried out until the last book and details of the wars fought in the last years of Game of Thrones.']}, {'end': 5574.972, 'start': 5127.942, 'title': 'Spark sql advantages over hive', 'summary': 'Discusses how spark sql addresses the challenges of apache hive, highlighting its faster processing speed, compatibility with hive queries, and ability to perform real-time processing, with a focus on success stories such as twitter sentiment analysis and stock market analysis.', 'duration': 447.03, 'highlights': ['Spark SQL provides faster processing speed compared to Hive, reducing query execution time from 10 minutes in Hive to less than one minute in Spark SQL.', 'Spark SQL offers compatibility with existing Hive queries, allowing seamless execution of Hive queries directly through Spark SQL without the need for migration.', 'Spark SQL enables real-time processing, as demonstrated in the Twitter sentiment analysis example, showcasing its ability to process streaming data in real time.', 'The use of the same metastore for Spark SQL as used in Hive eliminates the need for creating a new metastore, simplifying data management and storage.']}], 'duration': 1712.657, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g3862315.jpg', 'highlights': ['The deadliest house in Game of Thrones is Lannister, with 18 battles waged.', 'Spark SQL provides faster processing speed compared to Hive, reducing query execution time from 10 minutes to less than one minute.', 'Supportability for multiple programming languages was crucial for data frames, as it supports Python, Scala, and Java.', 'Immutability, lazy evaluation, fault tolerance, and distributed memory storage are important features of data frames.', 'A total of 18,207 players are present in the FIFA dataset.', 'Spark SQL offers compatibility with existing Hive queries, allowing seamless execution of Hive queries directly through Spark SQL.', 'The maximum worth given to a player in the FIFA dataset is 9 million pounds.', 'Details of players whose age is less than 30 years, along with their club, nationality, and Jersey numbers, are obtained.', 'Spark SQL enables real-time processing, as demonstrated in the Twitter sentiment analysis example.', 'Data frames can process both structured and unstructured data, offering the ability to store a huge collection of data in a tabular format along with its schema.']}, {'end': 8596.105, 'segs': [{'end': 6138.915, 'src': 'embed', 'start': 6110.916, 'weight': 2, 'content': [{'end': 6116.68, 'text': 'So now, in this case, whatever processing you have done right in terms of transformations and all of that,', 'start': 6110.916, 'duration': 5.764}, {'end': 6124.685, 'text': 'so you can say that your Spark SQL service is an entry point for working along the structured data in your Apache Spark.', 'start': 6116.68, 'duration': 8.005}, {'end': 6130.769, 'text': 'So it is going to kind of help you to fetch the results from your optimized data or maybe whatever you have interpreted before.', 'start': 6124.985, 'duration': 5.784}, {'end': 6132.03, 'text': "So that is what it's doing.", 'start': 6131.069, 'duration': 0.961}, {'end': 6134.612, 'text': 'So this kind of completes this whole diagram.', 'start': 6132.05, 'duration': 2.562}, {'end': 6138.915, 'text': "Now let's see that how we can perform our queries using Spark SQL.", 'start': 6135.052, 'duration': 3.863}], 'summary': 'Spark sql is the entry point for working with structured data in apache spark, helping to fetch results from optimized data and perform queries.', 'duration': 27.999, 'max_score': 6110.916, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g6110916.jpg'}, {'end': 6862.397, 'src': 'embed', 'start': 6822.582, 'weight': 1, 'content': [{'end': 6824.083, 'text': 'Then we can print the schema.', 'start': 6822.582, 'duration': 1.501}, {'end': 6829.707, 'text': 'What does this do? This is going to print the schema of my employee data frame.', 'start': 6824.243, 'duration': 5.464}, {'end': 6833.37, 'text': 'So we are going to use this print schema to print up all the values.', 'start': 6830.007, 'duration': 3.363}, {'end': 6837.017, 'text': 'Then we can create a temporary view of this data frame.', 'start': 6833.854, 'duration': 3.163}, {'end': 6838.218, 'text': 'So we are doing that.', 'start': 6837.177, 'duration': 1.041}, {'end': 6840.059, 'text': 'See, create or replace temp view.', 'start': 6838.378, 'duration': 1.681}, {'end': 6842.341, 'text': 'We are creating that which we have seen it last time also.', 'start': 6840.219, 'duration': 2.122}, {'end': 6844.763, 'text': 'Now after that, we can execute our SQL query.', 'start': 6842.621, 'duration': 2.142}, {'end': 6850.207, 'text': "So let's say we are executing our SQL query from employee where age is between 18 and 30.", 'start': 6844.783, 'duration': 5.424}, {'end': 6852.909, 'text': "So this kind of SQL query, let's say we want to do, we can get that.", 'start': 6850.207, 'duration': 2.702}, {'end': 6854.831, 'text': 'And in the end, we can see the output also.', 'start': 6853.209, 'duration': 1.622}, {'end': 6855.852, 'text': "Let's see this execution.", 'start': 6854.871, 'duration': 0.981}, {'end': 6862.397, 'text': "So you can see that all the employees whose age are let's say between 18 and 30, that is showing up in the output.", 'start': 6856.232, 'duration': 6.165}], 'summary': 'Printing the schema, creating temporary view, and executing sql query for employee data frame, showing employees between 18 and 30.', 'duration': 39.815, 'max_score': 6822.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g6822582.jpg'}, {'end': 7287.695, 'src': 'embed', 'start': 7257.223, 'weight': 8, 'content': [{'end': 7259.485, 'text': 'Then we are going to define our parse RDD.', 'start': 7257.223, 'duration': 2.262}, {'end': 7267.752, 'text': 'So in parse RDD, if you notice, so here we are creating this parse RDD, right? So we are going to create all of that by using this RDD first.', 'start': 7259.926, 'duration': 7.826}, {'end': 7269.854, 'text': 'We are going to remove the header files also from it.', 'start': 7267.792, 'duration': 2.062}, {'end': 7275.218, 'text': 'Then we are going to read our CSP file into stocks AA on DF data frame.', 'start': 7270.094, 'duration': 5.124}, {'end': 7277.48, 'text': 'So we are going to read this sc.txt file.', 'start': 7275.399, 'duration': 2.081}, {'end': 7281.624, 'text': 'You can see we are reading this file and we are going to convert it into a data frame.', 'start': 7277.5, 'duration': 4.124}, {'end': 7283.005, 'text': 'So we are parsing it as an RDD.', 'start': 7281.644, 'duration': 1.361}, {'end': 7287.695, 'text': 'Once we are done, Then if you want to print the output, we can do it with the help of show API.', 'start': 7283.565, 'duration': 4.13}], 'summary': 'Creating parse rdd, removing headers, reading sc.txt file, and converting to data frame for further processing.', 'duration': 30.472, 'max_score': 7257.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g7257223.jpg'}, {'end': 7365.982, 'src': 'embed', 'start': 7334.061, 'weight': 0, 'content': [{'end': 7335.662, 'text': 'So we are hitting the SQL query now.', 'start': 7334.061, 'duration': 1.601}, {'end': 7340.943, 'text': 'On this you can notice the SQL query which we are hitting on the stocks MSFT.', 'start': 7336.162, 'duration': 4.781}, {'end': 7343.344, 'text': 'This is the, we have data frame we have created.', 'start': 7341.083, 'duration': 2.261}, {'end': 7348.885, 'text': 'Now on this we are doing that and we are putting our query that where my condition this to be true.', 'start': 7343.564, 'duration': 5.321}, {'end': 7351.886, 'text': 'Means where my closing price and my opening price.', 'start': 7349.185, 'duration': 2.701}, {'end': 7359.418, 'text': "Because, let's say, at the closing price, the stock price bar, let's say 100 US dollar and at the time in the morning when it opened with the,", 'start': 7351.946, 'duration': 7.472}, {'end': 7360.519, 'text': "let's say, 98 US dollar.", 'start': 7359.418, 'duration': 1.101}, {'end': 7365.982, 'text': 'So wherever it is going to be having a difference of two or greater than two, that only output we want to get.', 'start': 7360.859, 'duration': 5.123}], 'summary': 'Analyzing sql query results on stock msft for price difference of 2 or greater', 'duration': 31.921, 'max_score': 7334.061, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g7334061.jpg'}], 'start': 5575.152, 'title': 'Spark sql and data analysis', 'summary': 'Covers the applications of spark sql in banking fraud detection and medical domain, its performance comparison with hadoop, support for various data formats, creation of user-defined functions, and the architecture of spark sql. it explains the creation and transformation of dataframes, working with data frames and hive tables, analyzing stock data from 10 companies, including computing the average closing price per year, and using spark sql for data analysis, including joining operations, saving data in parquet file format, executing sql queries, creating temporary tables, performing transformations, and computing correlations.', 'chapters': [{'end': 6068.307, 'start': 5575.152, 'title': 'Spark sql and data processing', 'summary': 'Discusses the applications of spark sql in banking fraud detection and medical domain, its performance comparison with hadoop, support for various data formats, creation of user-defined functions, and the architecture of spark sql. it also covers the data source api, dataframe api, and interpreter and optimizer steps.', 'duration': 493.155, 'highlights': ['Spark SQL performance comparison with Hadoop, where Spark SQL outperforms Hadoop in terms of running time and number of iterations.', 'Support for various data formats in Spark SQL, including Parquet, JSON, Avro, Hive, and Cassandra, and the ability to convert SQL queries to RDD way for performing transformation steps.', 'The process of creating user-defined functions (UDF) in Spark SQL, including the example of creating an UDF for converting values to uppercase and its execution.', 'The architecture of Spark SQL, which involves the data source API for reading and storing structured and unstructured data, the DataFrame API for converting data into named column and row, and the interpreter and optimizer steps for processing the results.']}, {'end': 6641.081, 'start': 6068.307, 'title': 'Spark sql and dataframes', 'summary': 'Explains how to work with dataframes and spark sql, covering the creation and transformation of dataframes, as well as the use of spark sql to query and manipulate structured data in apache spark.', 'duration': 572.774, 'highlights': ['The chapter explains the process of creating DataFrames and performing transformation and action steps using the DataFrame API and RDD, highlighting the role of the interpreter and optimizer steps.', 'It details the use of Spark SQL to fetch outputs from optimized data, serving as an entry point for working with structured data in Apache Spark.', 'The chapter demonstrates the process of executing Spark SQL queries using Spark shell and Eclipse, emphasizing the use of Spark session and builder API for reading and outputting JSON files as DataFrames.', 'It explains the creation and mapping of DataSets, highlighting the use of case classes and the advantages of DataSets over DataFrames in terms of performance and encoding mechanisms.', 'The process of adding a schema to RDD and using Spark SQL for executing SQL queries on temporary views is explained, emphasizing the importance of temporary views for executing SQL queries on DataFrames.', 'The transformation steps for mapping DataFrames to DataSets and transforming RDD into rowRDD are demonstrated, showcasing the process of splitting, mapping, and transforming data.']}, {'end': 7093.264, 'start': 6641.442, 'title': 'Working with data frames and hive tables in spark', 'summary': 'Covers working with data frames and hive tables in spark, including creating data frames from json files, reading and writing parquet files, working with rdds, creating hive tables, and performing sql operations on data sets.', 'duration': 451.822, 'highlights': ['Creating and manipulating data frames from RDDs and JSON files for data analysis.', 'Reading and writing Parquet files to store data, with the ability to read and execute SQL queries on the data.', 'Working with RDD operations and creating temporary views for data analysis and executing SQL queries on data sets.']}, {'end': 7386.359, 'start': 7093.684, 'title': 'Stock analysis using spark sql', 'summary': 'Outlines the process of using spark sql to analyze stock data from 10 companies, including computing the average closing price per year, identifying companies with the highest closing prices, executing spark sql queries, and comparing closing prices for specific conditions.', 'duration': 292.675, 'highlights': ['The chapter explains the process of using Spark SQL to analyze stock data from 10 companies, including computing the average closing price per year and identifying companies with the highest closing prices.', 'It outlines the steps to initialize Spark SQL, import required libraries, start a Spark session, create a case class, define a stock schema, and parse RDD to remove header files and read CSV file into a data frame.', 'The transcript details the process of executing Spark SQL queries to display the average closing price per month, identify stock price increases greater than two, and perform a join operation to compare closing prices of different stocks.']}, {'end': 8596.105, 'start': 7386.659, 'title': 'Spark sql and data analysis', 'summary': 'Highlights the process of using spark sql for data analysis, including joining operations, saving data in parquet file format, executing sql queries, creating temporary tables, performing transformations, and computing correlations, all of which are demonstrated with practical code examples and explanations. it also covers the importance of spark streaming in the spark ecosystem and its applications in real-time data processing.', 'duration': 1209.446, 'highlights': ['The chapter demonstrates the process of using Spark SQL for data analysis, including joining operations, saving data in parquet file format, executing SQL queries, creating temporary tables, performing transformations, and computing correlations, with practical code examples and explanations.', 'Spark streaming is highlighted as an important component in the Spark ecosystem, enabling analytical and interactive applications for live streaming data, and driving big data and Internet of Things applications.', 'The chapter provides an overview of Spark ecosystem components, such as Spark SQL, Spark streaming, MLLib for machine learning, Graphics for graphical analysis, and Spark R for supporting R language for data analysis.']}], 'duration': 3020.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g5575152.jpg', 'highlights': ['Spark SQL performance comparison with Hadoop, outperforming Hadoop in running time and iterations', 'Support for various data formats in Spark SQL, including Parquet, JSON, Avro, Hive, and Cassandra', 'Creation of user-defined functions (UDF) in Spark SQL, with an example of converting values to uppercase', 'Architecture of Spark SQL involving data source API, DataFrame API, interpreter, and optimizer steps', 'Process of creating DataFrames and performing transformation and action steps using the DataFrame API and RDD', 'Execution of Spark SQL queries using Spark shell and Eclipse, emphasizing the use of Spark session and builder API', 'Creation and mapping of DataSets, highlighting the use of case classes and the advantages of DataSets over DataFrames', 'Reading and writing Parquet files to store data, with the ability to read and execute SQL queries on the data', 'Using Spark SQL to analyze stock data from 10 companies, computing the average closing price per year', 'Demonstration of using Spark SQL for data analysis, including joining operations, saving data in parquet file format, executing SQL queries, creating temporary tables, performing transformations, and computing correlations', 'Importance of temporary views for executing SQL queries on DataFrames', 'Process of using Spark SQL for data analysis, with practical code examples and explanations', 'Overview of Spark ecosystem components, such as Spark SQL, Spark streaming, MLLib, Graphics, and Spark R']}, {'end': 9762.141, 'segs': [{'end': 8750.801, 'src': 'embed', 'start': 8722.356, 'weight': 0, 'content': [{'end': 8730.437, 'text': 'integration with your batch time and real time processing is possible, and it can also be used for your business analytics,', 'start': 8722.356, 'duration': 8.081}, {'end': 8734.878, 'text': 'which is used to track behavior of your customer.', 'start': 8730.437, 'duration': 4.441}, {'end': 8737.659, 'text': 'so, as you can see, this is so powerful.', 'start': 8734.878, 'duration': 2.781}, {'end': 8744.12, 'text': 'right in each slide we are kind of getting to know so many interesting things about this fast training.', 'start': 8737.659, 'duration': 6.461}, {'end': 8750.801, 'text': "Now let's quickly have an overview so that we can get some basics of fast training.", 'start': 8744.96, 'duration': 5.841}], 'summary': 'Integration of batch and real-time processing for powerful business analytics and customer behavior tracking in fast training.', 'duration': 28.445, 'max_score': 8722.356, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g8722356.jpg'}, {'end': 8798.634, 'src': 'embed', 'start': 8771.051, 'weight': 8, 'content': [{'end': 8778.157, 'text': 'in fact, Spark streaming is kind of adding lot of advantage to Spark community,', 'start': 8771.051, 'duration': 7.106}, {'end': 8783.642, 'text': 'because lot of people are only joining Spark community to kind of use this fastening.', 'start': 8778.157, 'duration': 5.485}, {'end': 8791.929, 'text': "it's so powerful that everyone wants to come and want to use it, because all the other frameworks which we already have, which are existing,", 'start': 8783.642, 'duration': 8.287}, {'end': 8798.634, 'text': "are not as good in terms of performance and all, and it's the easiness of using Spark streaming is also great.", 'start': 8791.929, 'duration': 6.705}], 'summary': 'Spark streaming adds significant advantage to the community and attracts many users due to its superior performance and ease of use.', 'duration': 27.583, 'max_score': 8771.051, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g8771051.jpg'}, {'end': 8893.228, 'src': 'embed', 'start': 8862.514, 'weight': 9, 'content': [{'end': 8868.238, 'text': 'because definitely you are doing processing on some part of the data right, even if it is coming at real time,', 'start': 8862.514, 'duration': 5.724}, {'end': 8872.001, 'text': 'and that is what we are going to call it as micro batch.', 'start': 8868.238, 'duration': 3.763}, {'end': 8876.644, 'text': 'Moving further now.', 'start': 8873.602, 'duration': 3.042}, {'end': 8879.606, 'text': "let's see few more details on it now.", 'start': 8876.644, 'duration': 2.962}, {'end': 8882.443, 'text': 'from where you can get all your data?', 'start': 8880.362, 'duration': 2.081}, {'end': 8884.244, 'text': 'what can be your data sources here?', 'start': 8882.443, 'duration': 1.801}, {'end': 8893.228, 'text': 'So if we talk about data sources here now, we can stream the data from multiple sources like Parquet, Kafka, Kafka.', 'start': 8884.624, 'duration': 8.604}], 'summary': 'Processing real-time data in micro batches from multiple sources like parquet, kafka.', 'duration': 30.714, 'max_score': 8862.514, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g8862514.jpg'}, {'end': 8937.295, 'src': 'embed', 'start': 8909.517, 'weight': 1, 'content': [{'end': 8914.5, 'text': 'you can do the processing with the help of your Spark SQL and then give the output.', 'start': 8909.517, 'duration': 4.983}, {'end': 8924.727, 'text': 'So this is a very strong thing that you are bringing the data using Spark streaming but processing you can do by using some other frameworks as well.', 'start': 8915.06, 'duration': 9.667}, {'end': 8929.55, 'text': "like machine learning, you can apply on the data what you're getting at a real time.", 'start': 8925.367, 'duration': 4.183}, {'end': 8934.013, 'text': "you can also apply your Spark SQL on the data which you're getting at the real time.", 'start': 8929.55, 'duration': 4.463}, {'end': 8937.295, 'text': 'Moving further.', 'start': 8935.474, 'duration': 1.821}], 'summary': 'Utilize spark sql for processing streaming data and applying machine learning in real time.', 'duration': 27.778, 'max_score': 8909.517, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g8909517.jpg'}, {'end': 9032.443, 'src': 'embed', 'start': 9003.481, 'weight': 3, 'content': [{'end': 9006.364, 'text': 'because now we are getting real stream data right.', 'start': 9003.481, 'duration': 2.883}, {'end': 9009.007, 'text': "so let's say in today, right now I got one second.", 'start': 9006.364, 'duration': 2.643}, {'end': 9012.491, 'text': 'maybe now I got one second in one second I got more data.', 'start': 9009.007, 'duration': 3.484}, {'end': 9014.533, 'text': 'now I got more data in the next lot right.', 'start': 9012.491, 'duration': 2.042}, {'end': 9016.094, 'text': "So that is what we're talking about.", 'start': 9014.833, 'duration': 1.261}, {'end': 9017.495, 'text': 'so we are creating data.', 'start': 9016.094, 'duration': 1.401}, {'end': 9019.676, 'text': 'we are getting from time 0 to time 1.', 'start': 9017.495, 'duration': 2.181}, {'end': 9022.817, 'text': "we let's say that we have an RDD at the rate of time 1.", 'start': 9019.676, 'duration': 3.141}, {'end': 9027.58, 'text': 'similarly, it is this proceeding with the time the RDD is getting proceeded here.', 'start': 9022.817, 'duration': 4.763}, {'end': 9032.443, 'text': "now, in the next thing, it's extracting the words from an input stream.", 'start': 9027.58, 'duration': 4.863}], 'summary': 'Real-time stream data processing with increasing data volume.', 'duration': 28.962, 'max_score': 9003.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g9003481.jpg'}, {'end': 9352.138, 'src': 'embed', 'start': 9326.077, 'weight': 5, 'content': [{'end': 9330.541, 'text': "it's a representation by a continuous series of RDD.", 'start': 9326.077, 'duration': 4.464}, {'end': 9332.462, 'text': 'can you see so many RDDs getting far?', 'start': 9330.541, 'duration': 1.921}, {'end': 9337.566, 'text': "because, let's say right now, in one second, what data I got collected, I executed it.", 'start': 9332.462, 'duration': 5.104}, {'end': 9340.949, 'text': 'in the second second, this data is happening here.', 'start': 9337.566, 'duration': 3.383}, {'end': 9342.871, 'text': 'okay, sorry for that.', 'start': 9340.949, 'duration': 1.922}, {'end': 9346.974, 'text': 'now, in the second time also, it is happening here.', 'start': 9342.871, 'duration': 4.103}, {'end': 9349.896, 'text': 'third second also, it is happening here.', 'start': 9346.974, 'duration': 2.922}, {'end': 9352.138, 'text': "no problem, I'm not going to do it now.", 'start': 9349.896, 'duration': 2.242}], 'summary': 'The transcript discusses the execution of data collection over time through a continuous series of rdds.', 'duration': 26.061, 'max_score': 9326.077, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g9326077.jpg'}], 'start': 8596.105, 'title': 'Spark streaming in the spark ecosystem', 'summary': 'Provides an in-depth discussion on the spark ecosystem, focusing on the benefits of executing queries for better performance and big data processing. it also explores the core engine, significance of spark streaming for live data analytics, scalability to hundreds of nodes, integration with batch time, and data processing operations like map, flat map, filter, and reduce.', 'chapters': [{'end': 8700.977, 'start': 8596.105, 'title': 'Spark ecosystem and streaming features', 'summary': 'Discusses the spark ecosystem, highlighting the benefits of executing queries in a spark environment for better performance and working with big data, followed by an emphasis on the core engine and the significance of spark streaming for enabling analytical and interactive apps for live streaming data, particularly in use cases such as twitter sentimental analysis for marketing purposes.', 'duration': 104.872, 'highlights': ['Spark streaming is gaining popularity due to its performance, beating other platforms, and is being used for Twitter sentimental analysis for marketing purposes.', 'Executing queries using Spark environment provides better performance and the ability to work on big data.', 'Spark core engine defines all the basics of Apache Spark and RDD related functionalities.']}, {'end': 9003.481, 'start': 8700.977, 'title': 'Spark streaming overview', 'summary': 'Provides an overview of spark streaming, highlighting its scalability to hundreds of nodes, integration with batch time and real-time processing, and its capability to stream and process data with high speed and fault tolerance.', 'duration': 302.504, 'highlights': ['Spark streaming can scale to even multiple nodes, running till hundreds of nodes, ensuring quick streaming and processing of data, and fault tolerance to prevent data loss.', 'Integration with batch time and real-time processing is possible, allowing for business analytics to track customer behavior, making it a powerful addition to the Spark community.', 'Spark streaming enables high throughput and fault tolerance, with DStream as its fundamental unit to process real-time data, and it supports data streaming from various sources like Parquet, Kafka, HBase, MongoDB, etc.', 'Additionally, Spark streaming allows for processing data using machine learning and Spark SQL, and it can publish output to various UI dashboards, such as Tableau and AngularJS.']}, {'end': 9762.141, 'start': 9003.481, 'title': 'Spark streaming data processing', 'summary': 'Outlines the process of spark streaming, including the role of streaming context as the main entry point, initialization of streaming context, and operations on d stream, such as map, flat map, filter, and reduce.', 'duration': 758.66, 'highlights': ['The role of streaming context as the main entry point for Spark streaming and its dependency on Spark context.', 'Initialization of streaming context for processing data, including the use of default implementations of sources like Twitter and ZeroMQ.', "The continuous series of RDDs representing D stream, with each second's data being processed, is important for understanding the data processing flow.", 'Explanation of operations on D stream, such as applying operations on input data and generating words D stream, and the functions of map, flat map, filter, and reduce in transforming D stream data.']}], 'duration': 1166.036, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g8596105.jpg', 'highlights': ['Executing queries using Spark environment provides better performance and the ability to work on big data.', 'Spark streaming can scale to even multiple nodes, running till hundreds of nodes, ensuring quick streaming and processing of data, and fault tolerance to prevent data loss.', 'Spark streaming enables high throughput and fault tolerance, with DStream as its fundamental unit to process real-time data, and it supports data streaming from various sources like Parquet, Kafka, HBase, MongoDB, etc.', 'Spark streaming is gaining popularity due to its performance, beating other platforms, and is being used for Twitter sentimental analysis for marketing purposes.', 'Integration with batch time and real-time processing is possible, allowing for business analytics to track customer behavior, making it a powerful addition to the Spark community.', 'Spark core engine defines all the basics of Apache Spark and RDD related functionalities.', 'The role of streaming context as the main entry point for Spark streaming and its dependency on Spark context.', 'Initialization of streaming context for processing data, including the use of default implementations of sources like Twitter and ZeroMQ.', "The continuous series of RDDs representing D stream, with each second's data being processed, is important for understanding the data processing flow.", 'Explanation of operations on D stream, such as applying operations on input data and generating words D stream, and the functions of map, flat map, filter, and reduce in transforming D stream data.', 'Additionally, Spark streaming allows for processing data using machine learning and Spark SQL, and it can publish output to various UI dashboards, such as Tableau and AngularJS.']}, {'end': 11036.166, 'segs': [{'end': 11036.166, 'src': 'embed', 'start': 10940.74, 'weight': 0, 'content': [{'end': 10944.282, 'text': 'Now what you are going to do in order to start working.', 'start': 10940.74, 'duration': 3.542}, {'end': 10948.384, 'text': 'you will be opening this terminal by clicking on this black option now.', 'start': 10944.282, 'duration': 4.102}, {'end': 10954.127, 'text': 'after that, what you can do is you can now go to your Spark.', 'start': 10948.384, 'duration': 5.743}, {'end': 10956.589, 'text': 'now how I can work with Spark.', 'start': 10954.127, 'duration': 2.462}, {'end': 10960.411, 'text': 'in order to execute any program in Spark.', 'start': 10956.589, 'duration': 3.822}, {'end': 10967.377, 'text': 'by using scalar program, you will be entering it as Spark hyphen shell.', 'start': 10960.411, 'duration': 6.966}, {'end': 10976.101, 'text': 'if you type Spark hyphen shell, it will take you to the scalar prompt where you can write your Spark program,', 'start': 10967.377, 'duration': 8.724}, {'end': 10981.043, 'text': 'but by using scalar programming language you can notice this.', 'start': 10976.101, 'duration': 4.942}, {'end': 10984.485, 'text': 'Now can you see this part it is also giving me 1.5.', 'start': 10982.424, 'duration': 2.061}, {'end': 10988.947, 'text': '2 version so that is the version of your Spark.', 'start': 10984.485, 'duration': 4.462}, {'end': 10991.595, 'text': 'Now you can see here.', 'start': 10990.034, 'duration': 1.561}, {'end': 10995.438, 'text': 'you can also see this part, Spark context, available as a thing.', 'start': 10991.595, 'duration': 3.843}, {'end': 10997.78, 'text': 'when you get connected to your SparkShare.', 'start': 10995.438, 'duration': 2.342}, {'end': 10999.281, 'text': 'you can just see.', 'start': 10997.78, 'duration': 1.501}, {'end': 11001.543, 'text': 'this will be by default available to you.', 'start': 10999.281, 'duration': 2.262}, {'end': 11002.623, 'text': 'let this get connected.', 'start': 11001.543, 'duration': 1.08}, {'end': 11003.324, 'text': 'it takes some time.', 'start': 11002.623, 'duration': 0.701}, {'end': 11020.813, 'text': 'Now we got connected.', 'start': 11019.332, 'duration': 1.481}, {'end': 11023.435, 'text': 'so we got connected to this Kela prompt.', 'start': 11020.813, 'duration': 2.622}, {'end': 11027.659, 'text': 'now, if I want to come out of it, I will just type exit.', 'start': 11023.435, 'duration': 4.224}, {'end': 11030.101, 'text': 'it will just let me come out of this prompt.', 'start': 11027.659, 'duration': 2.442}, {'end': 11036.166, 'text': 'Now. secondly, I can also write my programs with my Python.', 'start': 11031.062, 'duration': 5.104}], 'summary': 'Learn to work with spark by using scala and python programs, accessing spark version 1.5.2 and spark context, and connecting to the spark shell.', 'duration': 95.426, 'max_score': 10940.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g10940740.jpg'}], 'start': 9762.141, 'title': 'Real-time tweet sentiment analysis', 'summary': 'Discusses real-time sentiment analysis of tweets using spark streaming, capable of analyzing and storing positive, negative, and neutral tweets, and its potential for diverse use cases. it also provides practical guidance on executing spark programs using scala and python.', 'chapters': [{'end': 9812.251, 'start': 9762.141, 'title': 'Understanding group by and stream windowing', 'summary': 'Discusses the concept of group by in stream processing and provides an example of windowing operations using twitter data, with tweets occurring at an initial rate of 10 per second.', 'duration': 50.11, 'highlights': ['Group by combines all common values, as demonstrated by the example where names starting with the same letter are grouped together.', 'The concept of stream windowing is explained using Twitter data, where the initial rate of tweets is 10 per second, and the example illustrates the occurrence of tweets with specific hashtags.']}, {'end': 10411.711, 'start': 9812.251, 'title': 'Windowing operations in spark streaming', 'summary': 'Discusses the concept of windowing operations in spark streaming, including the consideration of current and previous windows, output operations, caching and persistence, accumulators, broadcast variables, and checkpoints, and concludes with a brief overview of twitter sentiment analysis as a use case.', 'duration': 599.46, 'highlights': ['The chapter explains the concept of windowing operations in Spark Streaming, which involves analyzing data over specific time intervals, such as the last 30 seconds, and considering both current and previous windows for computation.', 'It details the output operations in Spark Streaming, including pushing data out to external systems such as file systems, databases, and other storage, and the supported methods for output, such as save as text file, object file, Hadoop file, and for each RDD function.', 'The section on caching and persistence in Spark Streaming highlights the ability to store data in memory for longer periods, the default persistence level for input streams, and the replication of data to multiple nodes.', 'The explanation of accumulators in Spark Streaming compares them to counters in Hadoop, emphasizing their use in associative and commutative operations, tracking through UI, and their role in debugging and analysis.', 'It covers broadcast variables in Spark Streaming, outlining their function of caching read-only variables on all machines, efficient distribution to reduce communication costs, and their role in distributing large input data sets.', 'The discussion of checkpoints in Spark Streaming likens them to gaming checkpoints, emphasizing their role in making the system resilient to failures unrelated to the application logic, and the saving of metadata and generated RDD to reliable storage.', 'The chapter concludes with a brief overview of Twitter sentiment analysis as a use case, highlighting its potential for real-time analysis, various use cases such as crisis management, target marketing, and its significance in politics, sports, and business decision-making.']}, {'end': 10671.876, 'start': 10411.711, 'title': 'Effective customer targeting and sentiment analysis', 'summary': 'Discusses the importance of targeting specific customers for advertisements to reduce costs and increase effectiveness, followed by a practical demonstration of sentiment analysis using spark and twitter data to analyze negative tweets for trump.', 'duration': 260.165, 'highlights': ['The importance of targeting specific customers for advertisements to reduce costs and increase effectiveness, as trying to target everyone will be very costly', 'Demonstration of practical sentiment analysis using Spark and Twitter data to analyze negative tweets for Trump and the process of setting up authentication tokens for Twitter', 'Providing pre-installed configuration machines and necessary tools like Eclipse and Spark for training, to avoid installation issues and facilitate ease of use for the users']}, {'end': 10810.379, 'start': 10672.336, 'title': 'Real-time tweet sentiment analysis', 'summary': 'Demonstrates a program for real-time sentiment analysis of tweets using spark streaming, with the capability to analyze and store positive, negative, and neutral tweets, allowing flexibility to change the hashtags for analysis, and potential for application in diverse use cases.', 'duration': 138.043, 'highlights': ['The program demonstrates real-time sentiment analysis of tweets using Spark streaming, capable of analyzing and storing positive, negative, and neutral tweets, providing flexibility to change the hashtags for analysis, and potential for application in diverse use cases.', 'The output results are stored at a specified location, allowing easy access for analysis and review of the sentiment analysis outcomes.', 'The program showcases the capability to analyze and store tweets based on different hashtags, offering the potential for application in various scenarios, such as analyzing tweets related to events like the FIFA World Cup or cricket matches.']}, {'end': 11036.166, 'start': 10810.379, 'title': 'Introduction to spark streaming and practical implementation', 'summary': 'Covers an introduction to spark streaming, including use cases such as classification, regression, and collaborative filtering, and provides practical guidance on executing spark programs using scala and python.', 'duration': 225.787, 'highlights': ['Spark streaming use cases include classification for spam email detection, regression for price optimization, and collaborative filtering for personalized recommendations, showcasing real-world examples of its application.', 'Demonstration of practical implementation by accessing the VM machine provided by Edureka, eliminating the concern of software availability and setup, and executing Spark programs using Scala and Python for data processing and analysis.', 'Explanation of accessing the Spark shell using Scala programming language, including the version details, demonstration of Spark context availability, and guidance on entering and exiting the Spark shell for program execution.']}], 'duration': 1274.025, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g9762141.jpg', 'highlights': ['The program demonstrates real-time sentiment analysis of tweets using Spark streaming, capable of analyzing and storing positive, negative, and neutral tweets, providing flexibility to change the hashtags for analysis, and potential for application in diverse use cases.', 'Spark streaming use cases include classification for spam email detection, regression for price optimization, and collaborative filtering for personalized recommendations, showcasing real-world examples of its application.', 'The chapter concludes with a brief overview of Twitter sentiment analysis as a use case, highlighting its potential for real-time analysis, various use cases such as crisis management, target marketing, and its significance in politics, sports, and business decision-making.', 'The concept of stream windowing is explained using Twitter data, where the initial rate of tweets is 10 per second, and the example illustrates the occurrence of tweets with specific hashtags.', 'Demonstration of practical implementation by accessing the VM machine provided by Edureka, eliminating the concern of software availability and setup, and executing Spark programs using Scala and Python for data processing and analysis.']}, {'end': 13915.261, 'segs': [{'end': 11374.902, 'src': 'embed', 'start': 11343.324, 'weight': 0, 'content': [{'end': 11346.727, 'text': 'and after that you need to give your local path, for example.', 'start': 11343.324, 'duration': 3.403}, {'end': 11347.488, 'text': 'what is this path?', 'start': 11346.727, 'duration': 0.761}, {'end': 11350.049, 'text': 'Slash home, slash Edureka.', 'start': 11347.988, 'duration': 2.061}, {'end': 11357.235, 'text': 'this is a local path, not as HBFS path, so you will be writing slash home, slash Edureka a dot txt.', 'start': 11350.049, 'duration': 7.186}, {'end': 11365.72, 'text': 'Now, if you give this, this will be loading the file into memory, but not from your HDFS.', 'start': 11357.575, 'duration': 8.145}, {'end': 11373.401, 'text': 'instead, what this did is this loaded it from your local disk.', 'start': 11365.72, 'duration': 7.681}, {'end': 11374.902, 'text': 'so that is the difference here.', 'start': 11373.401, 'duration': 1.501}], 'summary': "Use local path 'home/edureka' to load file into memory, not from hdfs", 'duration': 31.578, 'max_score': 11343.324, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g11343324.jpg'}, {'end': 11474.296, 'src': 'embed', 'start': 11447.266, 'weight': 4, 'content': [{'end': 11451.31, 'text': 'suppose I want to use Spark in production unit, but not on top of Hadoop.', 'start': 11447.266, 'duration': 4.044}, {'end': 11452.131, 'text': 'is it possible?', 'start': 11451.31, 'duration': 0.821}, {'end': 11452.772, 'text': 'yes, you can do that.', 'start': 11452.131, 'duration': 0.641}, {'end': 11457.087, 'text': "you can do that something, but usually that's not what you do.", 'start': 11453.325, 'duration': 3.762}, {'end': 11458.728, 'text': 'but yes, if you want, you can do that.', 'start': 11457.087, 'duration': 1.641}, {'end': 11460.729, 'text': 'there are lot of things which you can do.', 'start': 11458.728, 'duration': 2.001}, {'end': 11464.091, 'text': 'you can also deploy it on your Amazon clusters as well.', 'start': 11460.729, 'duration': 3.362}, {'end': 11465.692, 'text': 'lot of things you can do that.', 'start': 11464.091, 'duration': 1.601}, {'end': 11467.613, 'text': 'how will it provide the distribute?', 'start': 11465.692, 'duration': 1.921}, {'end': 11470.014, 'text': "in that case we'll be using some other distribution system.", 'start': 11467.613, 'duration': 2.401}, {'end': 11472.255, 'text': 'so in that case you are not using this part.', 'start': 11470.014, 'duration': 2.241}, {'end': 11473.036, 'text': 'you can deploy.', 'start': 11472.255, 'duration': 0.781}, {'end': 11474.296, 'text': 'it will be just definitely.', 'start': 11473.036, 'duration': 1.26}], 'summary': 'Using spark in production without hadoop, possible; can deploy on amazon clusters.', 'duration': 27.03, 'max_score': 11447.266, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g11447266.jpg'}, {'end': 12202.219, 'src': 'embed', 'start': 12174.491, 'weight': 3, 'content': [{'end': 12179.955, 'text': 'there is one important thing Japan is already of affected area of your earthquake.', 'start': 12174.491, 'duration': 5.464}, {'end': 12187.762, 'text': "and now the problem here is that, whatever it's not like, even for a minor earthquake, I should start sending the alert right.", 'start': 12179.955, 'duration': 7.807}, {'end': 12190.384, 'text': "I don't want to do all that for the minor affection.", 'start': 12187.762, 'duration': 2.622}, {'end': 12193.835, 'text': 'in fact the buildings and the infrastructure.', 'start': 12191.354, 'duration': 2.481}, {'end': 12202.219, 'text': 'what is created is Japan is in such a way, if any earthquake below six magnitude comes there,', 'start': 12193.835, 'duration': 8.384}], 'summary': 'Japan is an affected area of earthquakes, where alerts are not sent for minor tremors below a magnitude of six.', 'duration': 27.728, 'max_score': 12174.491, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g12174491.jpg'}, {'end': 12344.455, 'src': 'embed', 'start': 12313.738, 'weight': 2, 'content': [{'end': 12319.165, 'text': 'Graphs are very attractive when it comes to modeling real world data because they are intuitive,', 'start': 12313.738, 'duration': 5.427}, {'end': 12323.611, 'text': 'flexible and the theory supporting them has been maturing for centuries.', 'start': 12319.165, 'duration': 4.446}, {'end': 12327.115, 'text': "Welcome everyone in today's session on Spark Graphics.", 'start': 12324.292, 'duration': 2.823}, {'end': 12330.48, 'text': "So without any further delay, let's look at the agenda first.", 'start': 12327.736, 'duration': 2.744}, {'end': 12335.988, 'text': "We'll start by understanding the basics of craft theory and different types of craft.", 'start': 12331.644, 'duration': 4.344}, {'end': 12344.455, 'text': 'Then we look at the features of graphics further will understand what is property graph and look at various graph operations moving ahead.', 'start': 12336.288, 'duration': 8.167}], 'summary': 'Graphs are intuitive, flexible, and well-supported for modeling real-world data. this session will cover craft theory, types of craft, features of graphics, property graphs, and graph operations.', 'duration': 30.717, 'max_score': 12313.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g12313738.jpg'}, {'end': 12864.008, 'src': 'embed', 'start': 12834.572, 'weight': 1, 'content': [{'end': 12839.155, 'text': "So we'll represent the cycle as five, then, using double arrows,", 'start': 12834.572, 'duration': 4.583}, {'end': 12848.361, 'text': "will go to seven and then we'll move to eight and then we'll move to six and at last will come back to five.", 'start': 12839.155, 'duration': 9.206}, {'end': 12851.563, 'text': 'Now we have edge label graph.', 'start': 12849.361, 'duration': 2.202}, {'end': 12857.627, 'text': 'So basically edge label graph is a graph where the edges are associated with labels.', 'start': 12852.003, 'duration': 5.624}, {'end': 12864.008, 'text': 'So one can basically indicate this by making the edge set as be a set of triplets.', 'start': 12858.365, 'duration': 5.643}], 'summary': 'Cycle represented as 5, 7, 8, 6, 5 in edge label graph.', 'duration': 29.436, 'max_score': 12834.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g12834572.jpg'}], 'start': 11036.166, 'title': 'Using pyspark and apache spark for earthquake analysis', 'summary': 'Covers using pyspark to connect python with spark, transfer files to hdfs, and work with data in scala, detailing the real-time earthquake detection using apache spark for early warnings, and discussing earthquake wave analysis, types of graphs, and modeling bipartite graphs, resulting in saving millions of lives and simplifying analytics tasks.', 'chapters': [{'end': 11193.563, 'start': 11036.166, 'title': 'Using pyspark for python programming in spark', 'summary': 'Explains how to use pyspark to connect python with spark, transfer files to hdfs, and work with data in scala, highlighting the process of connecting to pyspark, transferring a file to hdfs, and working with data types in scala.', 'duration': 157.397, 'highlights': ["The process of connecting to PySpark involves typing 'PySpark' to connect Python with Spark and accessing Spark's environment (PySpark).", 'Transferring a file to HDFS is demonstrated by creating a file, checking its presence, and putting the file in the default location of HDFS, with the file successfully transferred and accessible in the Hadoop file system.', "Working with data types in Scala is illustrated by the difference in defining data types between Java and Scala, emphasizing Scala's automatic identification of data types and error handling when attempting to assign incompatible data types."]}, {'end': 11800.248, 'start': 11194.063, 'title': 'Spark for real-time earthquake detection', 'summary': 'Explains the use of apache spark for real-time earthquake detection, detailing how japan utilized it for early warnings, resulting in saving millions of lives, and emphasizes the importance of processing data in real-time and handling data from multiple sources.', 'duration': 606.185, 'highlights': ['Japan utilized Apache Spark for real-time earthquake detection, sending early warnings and saving millions of lives by informing schools, factories, and transportation systems in 60 seconds.', 'The importance of processing data in real-time and handling data from multiple sources is emphasized for effective prediction and early warnings for natural disasters like earthquakes.', 'The system requires easy usability and efficient alert messaging, which is effectively handled by Apache Spark.']}, {'end': 12470.339, 'start': 11800.248, 'title': 'Earthquake waves and apache spark for data analysis', 'summary': 'Discusses the primary and secondary waves of earthquakes, the process of creating roc using apache spark for earthquake analysis, and the basics of graph theory and graph processing algorithms.', 'duration': 670.091, 'highlights': ['The primary and secondary waves of earthquakes are explained, with the secondary wave being more severe and causing maximum damage. The chapter discusses the process of creating ROC using Apache Spark for earthquake analysis, including the use of SBT, project structure, and dependency management.', 'The process of creating ROC using Apache Spark for earthquake analysis is detailed, covering the use of SBT for dependency management, project structure, and the execution of Spark code for data analysis.', 'The basics of graph theory and different types of graphs are explained, including vertices, edges, and the representation of vertices and edges in graph sets.']}, {'end': 13386.765, 'start': 12471.099, 'title': 'Types of graphs and spark graphics', 'summary': 'Covers different types of graphs, including undirected, directed, vertex-labeled, cyclic, edge-labeled, and weighted graphs, along with insights into spark graphics, which extends the spark rdd with a directed multigraph abstraction, fundamental operators, graph algorithms, and builders to simplify analytics tasks.', 'duration': 915.666, 'highlights': ['The chapter discusses various types of graphs such as undirected, directed, vertex-labeled, cyclic, edge-labeled, and weighted graphs, along with examples and explanations for each type.', 'It explains the concept of a directed multigraph in the context of property graph, highlighting the ability to support multiple parallel edges and user-defined objects attached to each vertex and edge.', 'It introduces Spark Graphics, which extends the Spark RDD by introducing a directed multigraph abstraction with properties attached to each vertex and edge, and provides fundamental operators, graph algorithms, and builders to simplify analytics tasks.']}, {'end': 13915.261, 'start': 13387.125, 'title': 'Modeling bipartite graph and property graph', 'summary': 'Discusses modeling a bipartite graph for user and product properties, using inheritance and property graph efficiency. it also covers fault tolerance and functionality provided by vertex and edge rdds.', 'duration': 528.136, 'highlights': ['The chapter discusses modeling a bipartite graph for user and product properties, using inheritance and property graph efficiency.', 'Property graph provides fault tolerance, and the graph is partitioned across executors using a range of vertex partitioning rules.', 'Vertex RDD and edge RDD extend and optimize RDD, providing additional functionality for graph computation.', 'The vertex property might contain the username and occupation, and edges can be annotated with a string describing the relationship between the users.', 'The chapter explains constructing a property graph from raw files or RDDs using the graph object, with a code example.', 'The chapter demonstrates how to specify vertices, edges, and create a graph using Spark context and RDDs.']}], 'duration': 2879.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g11036166.jpg', 'highlights': ['Japan utilized Apache Spark for real-time earthquake detection, sending early warnings and saving millions of lives by informing schools, factories, and transportation systems in 60 seconds.', "The process of connecting to PySpark involves typing 'PySpark' to connect Python with Spark and accessing Spark's environment (PySpark).", 'Transferring a file to HDFS is demonstrated by creating a file, checking its presence, and putting the file in the default location of HDFS, with the file successfully transferred and accessible in the Hadoop file system.', 'The chapter discusses various types of graphs such as undirected, directed, vertex-labeled, cyclic, edge-labeled, and weighted graphs, along with examples and explanations for each type.', 'The basics of graph theory and different types of graphs are explained, including vertices, edges, and the representation of vertices and edges in graph sets.']}, {'end': 15347.226, 'segs': [{'end': 14037.588, 'src': 'embed', 'start': 14012.816, 'weight': 1, 'content': [{'end': 14022.186, 'text': 'So the triplet view logically joins the vertex and edge properties, yielding an RDD edge triplet with vertex property and your edge property.', 'start': 14012.816, 'duration': 9.37}, {'end': 14031.765, 'text': 'So, as you can see, it gives an RDD with s triplet and then it has vertex property as well as edge property associated with it,', 'start': 14022.819, 'duration': 8.946}, {'end': 14034.626, 'text': 'and it contains an instance of edge triplet class.', 'start': 14031.765, 'duration': 2.861}, {'end': 14037.588, 'text': "Now, I'm taking example of a join.", 'start': 14035.247, 'duration': 2.341}], 'summary': 'The triplet view combines vertex and edge properties, yielding an rdd edge triplet with vertex and edge properties associated with it.', 'duration': 24.772, 'max_score': 14012.816, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g14012816.jpg'}, {'end': 14198.688, 'src': 'embed', 'start': 14169.945, 'weight': 4, 'content': [{'end': 14173.108, 'text': 'So I hope that you guys are clear with the concepts of triplet.', 'start': 14169.945, 'duration': 3.163}, {'end': 14176.829, 'text': "So now let's quickly take a look at graph builders.", 'start': 14174.747, 'duration': 2.082}, {'end': 14185.057, 'text': 'So as I already told you that GraphX provides several ways of building a graph from a collection of vertices and edges.', 'start': 14177.45, 'duration': 7.607}, {'end': 14188.44, 'text': 'Either it can be stored in RDD or it can be stored on disks.', 'start': 14185.457, 'duration': 2.983}, {'end': 14192.004, 'text': 'So in this graph object, first we have this apply method.', 'start': 14189.021, 'duration': 2.983}, {'end': 14198.688, 'text': 'So, basically, this apply method allows creating a graph from RDD of vertices and edges,', 'start': 14192.584, 'duration': 6.104}], 'summary': 'Graphx provides methods for building graphs from vertices and edges stored in rdd or on disks.', 'duration': 28.743, 'max_score': 14169.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g14169945.jpg'}, {'end': 14256.982, 'src': 'embed', 'start': 14215.037, 'weight': 2, 'content': [{'end': 14219.92, 'text': 'then we are providing the edge RDD and then we are providing the default vertex attribute.', 'start': 14215.037, 'duration': 4.883}, {'end': 14223.735, 'text': 'So it will create the vertex which we have specified.', 'start': 14220.552, 'duration': 3.183}, {'end': 14233.463, 'text': 'then it will create the edges which are specified and if there is a vertex which is being referred by the edge but it is not present in this vertex,', 'start': 14223.735, 'duration': 9.728}, {'end': 14233.803, 'text': 'RDD.', 'start': 14233.463, 'duration': 0.34}, {'end': 14240.289, 'text': 'So what it does it creates that vertex and assigns them the value of this default vertex attribute.', 'start': 14234.244, 'duration': 6.045}, {'end': 14242.19, 'text': 'Next we have from edges.', 'start': 14240.849, 'duration': 1.341}, {'end': 14249.236, 'text': 'So graph dot from edges allows creating a graph only from the RDD of edges,', 'start': 14242.45, 'duration': 6.786}, {'end': 14254.88, 'text': 'which automatically creates any vertices mentioned in the edges and assigns them the default value.', 'start': 14249.236, 'duration': 5.644}, {'end': 14256.982, 'text': 'So what happens over here?', 'start': 14255.781, 'duration': 1.201}], 'summary': 'Creating graphs from rdds and default vertex attribute, auto-creating missing vertices.', 'duration': 41.945, 'max_score': 14215.037, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g14215037.jpg'}, {'end': 14710.96, 'src': 'embed', 'start': 14685.129, 'weight': 0, 'content': [{'end': 14693.212, 'text': 'It does not change the idea of the vertices and it helps in transforming those values now talking about the minus method.', 'start': 14685.129, 'duration': 8.083}, {'end': 14698.455, 'text': 'It shows only vertices unique in the set based on their vertex IDs.', 'start': 14693.773, 'duration': 4.682}, {'end': 14706.018, 'text': 'So what happens if you are providing two set of vertices first contains v1 v2 and v3 and second one contains v3.', 'start': 14698.755, 'duration': 7.263}, {'end': 14710.96, 'text': 'So it will return v1 and v2 because they are unique in both these sets.', 'start': 14706.438, 'duration': 4.522}], 'summary': 'The minus method returns unique vertices based on their ids.', 'duration': 25.831, 'max_score': 14685.129, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g14685129.jpg'}], 'start': 13916.107, 'title': 'Graphx and property graph optimization', 'summary': 'Covers deconstructing property graph using case classes in scala, introducing graphx triplets and graph builders, graph loading and optimization, optimization techniques, including partitioning and structural operators, for efficient data manipulation and joining, aiming to reduce communication and storage overhead.', 'chapters': [{'end': 14088.395, 'start': 13916.107, 'title': 'Deconstructing property graph using case class', 'summary': 'Covers deconstructing property graph using case classes in scala to filter and count edges where source id is greater than destination id, and utilizing the triplet view to join vertex and edge properties, resulting in an rdd edge triplet.', 'duration': 172.288, 'highlights': ['Using Scala case expression to deconstruct the tuple and filter edges where source ID is greater than destination ID, then counting those edges.', 'Utilizing case class type constructor to deconstruct the result and filter edges based on specified properties, then counting them.', 'The triplet view in GraphX logically joins vertex and edge properties, yielding an RDD edge triplet with vertex and edge properties, and contains an instance of edge triplet class.', 'Performing left join by selecting source ID, destination ID, source attribute, destination attribute, and edge attribute, ensuring the edge source ID is equal to source ID and the edge destination ID is equal to destination ID.']}, {'end': 14301.958, 'start': 14088.395, 'title': 'Graphx: triplets and graph builders', 'summary': 'Introduces the concept of triplets in graphx, including the rendering of relationships between users and the creation of graphs from rdds or disks using various methods, such as apply method and from edges, with a focus on handling duplicate vertices and edges.', 'duration': 213.563, 'highlights': ['The concept of triplets in GraphX involves rendering relationships between users using the triplet view of a graph, which consists of vertices denoting users and edges defining the relationships between them.', 'GraphX provides multiple methods for building a graph from a collection of vertices and edges, including apply method, from edges, and from H tuples, each with its own handling of duplicate vertices and edges.', 'The apply method in GraphX allows the creation of a graph from RDD of vertices and edges, where duplicate vertices are picked arbitrarily and default attributes are assigned to vertices not found in the edge RDD.', 'The from edges method in GraphX creates a graph only from the RDD of edges, automatically creating any vertices mentioned in the edges and assigning them default values.', 'The from H tuples method in GraphX supports deduplicating edges and allows the creation of a graph from only the RDD of H tuples, automatically creating specified vertices and allocating default values to them.']}, {'end': 14935.42, 'start': 14301.958, 'title': 'Graphx: graph loading and optimization', 'summary': 'Discusses the necessity of co-locating identical edges, graph loading from the file system, and the functionalities of vertex and edge rdds, providing an optimized data structure and additional functionalities for efficient data manipulation and joining.', 'duration': 633.462, 'highlights': ['The graph loader object is used to load the graph from the file system, and graph dot group edges requires the graph to be repartitioned to co-locate identical edges, creating a graph from specified edges.', 'The canonical orientation allows reorienting edges in the positive direction required by the connected components algorithm, and the minimum edge partition specifies the minimum number of edge partitions to generate, providing additional functionalities for efficient data structuring and manipulation.', 'The vertex RDD uses a reusable HashMap data structure, allowing constant-time joins without hash evaluations, and exposes multiple additional functionalities like filter, map values, minus, left join, inner join, and aggregate using index functions for optimized data manipulation and joining operations.', 'The edge RDD organizes the edge in block partition using various partitioning strategies, and provides additional functionalities like map values, reverse, and inner join for efficient edge attribute transformation and joining.']}, {'end': 15164.245, 'start': 14935.42, 'title': 'Graphx optimization and partitioning', 'summary': 'Discusses the optimization techniques in graphx, including the adoption of a vertex cut approach for graph partitioning, the use of partition strategies to assign edges to machines, and the efficient joining of vertex attributes with edges, aiming to reduce communication and storage overhead while performing operations. it also covers the basic operators of property graphs, such as map vertices, map edges, and map triplets, which yield new graphs with modified properties based on user-defined map functions.', 'duration': 228.825, 'highlights': ['GraphX adopts a vertex cut approach for graph partitioning, aiming to reduce communication and storage overhead while performing operations.', 'The partition strategy decides how to assign edges to different machines or partitions, and users can choose between different strategies using the graph dot partition by operator.', 'Efficient joining of vertex attributes with edges is a main challenge due to real-world graphs typically having more edges than vertices, and the optimization involves moving vertex attributes to the edges and maintaining a routing table for broadcasting vertices and implementing the required joins.', 'The property graph contains operators such as map vertices, map edges, and map triplets, each yielding a new graph with modified properties based on user-defined map functions.']}, {'end': 15347.226, 'start': 15164.505, 'title': 'Graph transformation and structural operators', 'summary': "Explains graph transformation operators such as map vertices, map edges, and map triplets, which modify the vertices, edges, and triplets, allowing the resulting graph to reuse the original graph's structural indices, along with a discussion on structural operators including reverse, subgraph, mask, and group edges operators, enabling efficient manipulation of graph structures.", 'duration': 182.721, 'highlights': ["The map vertices, map edges, and map triplets operators allow for the transformation of vertices, edges, and triplets, respectively, preserving the original graph's structure and enabling the creation of new graphs with modified properties.", 'The reverse operator in the structural operators category efficiently reverses the edge directions in a graph without modifying vertex or edge properties, facilitating data manipulation without data movement or duplication.', 'The subgraph operator filters the graph based on vertex and edge predicates, returning a graph containing only the vertices and edges that satisfy the specified conditions, enabling efficient subsetting of the graph structure.']}], 'duration': 1431.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g13916107.jpg', 'highlights': ['Utilizing case class type constructor to deconstruct the result and filter edges based on specified properties, then counting them.', 'The triplet view in GraphX logically joins vertex and edge properties, yielding an RDD edge triplet with vertex and edge properties, and contains an instance of edge triplet class.', 'The concept of triplets in GraphX involves rendering relationships between users using the triplet view of a graph, which consists of vertices denoting users and edges defining the relationships between them.', 'The apply method in GraphX allows the creation of a graph from RDD of vertices and edges, where duplicate vertices are picked arbitrarily and default attributes are assigned to vertices not found in the edge RDD.', 'The graph loader object is used to load the graph from the file system, and graph dot group edges requires the graph to be repartitioned to co-locate identical edges, creating a graph from specified edges.', 'The vertex RDD uses a reusable HashMap data structure, allowing constant-time joins without hash evaluations, and exposes multiple additional functionalities like filter, map values, minus, left join, inner join, and aggregate using index functions for optimized data manipulation and joining operations.', 'GraphX adopts a vertex cut approach for graph partitioning, aiming to reduce communication and storage overhead while performing operations.', "The map vertices, map edges, and map triplets operators allow for the transformation of vertices, edges, and triplets, respectively, preserving the original graph's structure and enabling the creation of new graphs with modified properties."]}, {'end': 16755.524, 'segs': [{'end': 15534.408, 'src': 'embed', 'start': 15511.164, 'weight': 1, 'content': [{'end': 15522.489, 'text': 'So the join vertices operator joins the vertices with the input RDD and returns a new graph with the vertex properties obtained after applying the user-defined map function.', 'start': 15511.164, 'duration': 11.325}, {'end': 15528.251, 'text': 'Now the vertices without a matching value in the RDD basically retains their original value.', 'start': 15522.909, 'duration': 5.342}, {'end': 15530.892, 'text': 'Now talking about outer join vertices.', 'start': 15528.751, 'duration': 2.141}, {'end': 15534.408, 'text': 'So it behaves similar to join vertices,', 'start': 15531.625, 'duration': 2.783}], 'summary': 'Join vertices operator merges vertex properties with input rdd.', 'duration': 23.244, 'max_score': 15511.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g15511164.jpg'}, {'end': 15789.312, 'src': 'embed', 'start': 15766.654, 'weight': 4, 'content': [{'end': 15774.721, 'text': 'So it basically applies the map function to all the triplets and then they aggregate those messages using this user defined reduce function.', 'start': 15766.654, 'duration': 8.067}, {'end': 15780.345, 'text': 'Now the newer version of this map reduce triplets operator is the aggregate messages.', 'start': 15775.221, 'duration': 5.124}, {'end': 15784.489, 'text': "Now moving ahead, let's talk about computing degree information operator.", 'start': 15780.946, 'duration': 3.543}, {'end': 15789.312, 'text': 'So one of the common aggregation task is computing the degree of each vertex.', 'start': 15784.909, 'duration': 4.403}], 'summary': 'Map function applied to triplets, newer version is aggregate messages, computing degree of each vertex.', 'duration': 22.658, 'max_score': 15766.654, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g15766654.jpg'}, {'end': 16123.135, 'src': 'embed', 'start': 16093.723, 'weight': 0, 'content': [{'end': 16098.828, 'text': 'so you have to provide the name of the application, and this is get or create method.', 'start': 16093.723, 'duration': 5.105}, {'end': 16102.832, 'text': 'now. next you are initializing the spark context as sc.', 'start': 16098.828, 'duration': 4.004}, {'end': 16104.594, 'text': 'now coming to the code.', 'start': 16102.832, 'duration': 1.762}, {'end': 16106.495, 'text': 'so we are specifying a graph.', 'start': 16104.594, 'duration': 1.901}, {'end': 16109.058, 'text': 'then this graph is containing double and int.', 'start': 16106.495, 'duration': 2.563}, {'end': 16112.947, 'text': 'Now I just told you that we are importing graph generator.', 'start': 16109.864, 'duration': 3.083}, {'end': 16118.051, 'text': 'So this graph generator is to generate a random graph for simplicity.', 'start': 16112.987, 'duration': 5.064}, {'end': 16123.135, 'text': 'So you would have multiple number of edges and vertices then you are using this log normal graph.', 'start': 16118.091, 'duration': 5.044}], 'summary': 'Code initializes a graph using a graph generator to create a random graph with multiple edges and vertices.', 'duration': 29.412, 'max_score': 16093.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g16093723.jpg'}], 'start': 15347.506, 'title': 'Graph operations, predicates, aggregation, and algorithm implementation in spark', 'summary': 'Covers graph operations such as subgraph and mask operators, aggregation operators like merging duplicate edges and neighborhood aggregation, usage of graphx in spark including collect neighbors id, and graph algorithm implementations such as calculating average age, pagerank, and connected components in spark.', 'chapters': [{'end': 15449.489, 'start': 15347.506, 'title': 'Graph operations and predicates', 'summary': 'Discusses graph operations and predicates, including the subgraph operator to restrict graphs based on specified criteria, the mask operator for comparing and returning common subgraphs, and the group edges operator for merging parallel edges in a multi graph.', 'duration': 101.983, 'highlights': ['The subgraph operator can be used to restrict the graph based on specified predicates for vertices and edges, yielding a connected graph.', 'The mask operator returns a subgraph containing vertices and edges common to both the input graph and another related graph, allowing for graph restriction based on properties in the related graph.', 'The group edges operator merges parallel edges in a multi graph.']}, {'end': 15830.752, 'start': 15449.489, 'title': 'Graph aggregation operators', 'summary': 'Discusses graph aggregation operators, including merging duplicate edges, joining vertices, and neighborhood aggregation, with examples and explanations of aggregate messages, mapreduce triplet transition, and computing degree information.', 'duration': 381.263, 'highlights': ['The aggregate messages operator applies a user-defined function to each edge triplet in the graph, and then uses a merge message function to aggregate those messages at the destination vertex, returning a vertex RDD with the aggregated messages.', 'The MapReduce triplet transition operator, used in older versions of graphics, applies a user-defined map function to each triplet and aggregates messages using a user-defined reduce function.', 'The graph ops class contains operators to compute the degrees of each vertex, including maximum input degree, maximum output degree, and maximum degree, which are important in the context of directed graphs.', 'Graph aggregation operators can help in reducing the size of the graph by merging duplicate edges and performing aggregation, as well as joining data from external collections and aggregating information about the neighborhood of each vertex.']}, {'end': 16253.914, 'start': 15831.292, 'title': 'Using graphx in spark', 'summary': 'Introduces the usage of collect neighbors id and operator in spark to easily collect neighboring vertices and their attributes, and demonstrates the process of starting hadoop and spark demons, exploring default data in spark, and using graphx and aggregate messages operator to compute average age of followers.', 'duration': 422.622, 'highlights': ['The collect neighbor ID takes the edge direction as the parameter and returns a vertex RDD containing the array of vertex IDs neighboring to the particular vertex, while the collect neighbors operator returns an array with the vertex ID and the vertex attribute both.', 'The demons of Hadoop and Spark are started and checked, with all Hadoop demons such as name node, data node, secondary name node, node manager, and resource manager being up.', 'The process of using GraphX and the aggregate messages operator to compute the average age of the more senior followers of each user is explained, including the usage of graph generator to generate a random graph with a hundred vertices, mapping IDs to doubles, and performing aggregate messages to calculate the average age of older followers.']}, {'end': 16755.524, 'start': 16253.914, 'title': 'Graph algorithm implementation in spark', 'summary': 'Discusses the implementation of graph algorithms in spark, including calculating average age of followers, executing pagerank to measure vertex importance, and labeling connected components of the graph.', 'duration': 501.61, 'highlights': ['The chapter details the process of calculating the average age of followers by dividing the total age by count and using the reduce function, resulting in averages such as 99.0, 76.8, and 57.55.', 'The implementation of PageRank in Spark is explained, including the static and dynamic methods for running iterations until convergence, and the process of loading and joining data to obtain vertex ranks, with examples such as Barack Obama having a rank of 1.45.', 'The connected components algorithm is discussed, focusing on labeling each connected component of the graph with the ID of its lowest numbered vertex, with practical examples of implementation.', 'The transcript also includes practical instructions for executing the examples, including navigating directories, executing jars, and accessing files in HDFS.']}], 'duration': 1408.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g15347506.jpg', 'highlights': ['The process of using GraphX and the aggregate messages operator to compute the average age of the more senior followers of each user is explained, including the usage of graph generator to generate a random graph with a hundred vertices, mapping IDs to doubles, and performing aggregate messages to calculate the average age of older followers.', 'The implementation of PageRank in Spark is explained, including the static and dynamic methods for running iterations until convergence, and the process of loading and joining data to obtain vertex ranks, with examples such as Barack Obama having a rank of 1.45.', 'The connected components algorithm is discussed, focusing on labeling each connected component of the graph with the ID of its lowest numbered vertex, with practical examples of implementation.', 'The chapter details the process of calculating the average age of followers by dividing the total age by count and using the reduce function, resulting in averages such as 99.0, 76.8, and 57.55.', 'The aggregate messages operator applies a user-defined function to each edge triplet in the graph, and then uses a merge message function to aggregate those messages at the destination vertex, returning a vertex RDD with the aggregated messages.']}, {'end': 18810.196, 'segs': [{'end': 16926.157, 'src': 'embed', 'start': 16903.182, 'weight': 1, 'content': [{'end': 16912.168, 'text': 'now we are again using this graph loader class and we are loading the followers dot txt, which contains the edges.', 'start': 16903.182, 'duration': 8.986}, {'end': 16920.473, 'text': 'as you can see, here we are using this partition by argument and we are passing the random vertex cut, which is the partition strategy.', 'start': 16912.168, 'duration': 8.305}, {'end': 16926.157, 'text': 'so this is how you can go ahead and you can implement a partition strategy.', 'start': 16921.073, 'duration': 5.084}], 'summary': 'Using graph loader class to load followers.txt with random vertex cut partition strategy.', 'duration': 22.975, 'max_score': 16903.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g16903182.jpg'}, {'end': 17153.186, 'src': 'embed', 'start': 17125.343, 'weight': 2, 'content': [{'end': 17128.565, 'text': 'So I have successfully created this Spark SQL context.', 'start': 17125.343, 'duration': 3.222}, {'end': 17132.327, 'text': 'So this is basically for running SQL queries over the data frames.', 'start': 17129.105, 'duration': 3.222}, {'end': 17136.669, 'text': 'Now let me go ahead and import the data.', 'start': 17134.288, 'duration': 2.381}, {'end': 17140.38, 'text': "So I'm loading the data in data frame.", 'start': 17137.939, 'duration': 2.441}, {'end': 17146.923, 'text': 'So the format of file is CSV then an option the header is already added.', 'start': 17140.8, 'duration': 6.123}, {'end': 17148.224, 'text': "So that's why it's true.", 'start': 17147.023, 'duration': 1.201}, {'end': 17153.186, 'text': 'Then it will automatically info the schema and then in the load parameter.', 'start': 17149.004, 'duration': 4.182}], 'summary': 'Successfully created spark sql context for running sql queries over data frames and importing csv data with header option set to true.', 'duration': 27.843, 'max_score': 17125.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g17125343.jpg'}, {'end': 17328.525, 'src': 'embed', 'start': 17301.143, 'weight': 0, 'content': [{'end': 17305.667, 'text': 'So as you can see it has one column that is value and it has data type long.', 'start': 17301.143, 'duration': 4.524}, {'end': 17311.191, 'text': 'So I have taken all these start and end station ID and using this flat map.', 'start': 17306.207, 'duration': 4.984}, {'end': 17324.542, 'text': 'I have iterated over all the start and end station ID and then using the distinct function and taking the unique values and converting it to data frames so I can use these stations and using the station.', 'start': 17311.411, 'duration': 13.131}, {'end': 17328.525, 'text': "I'll basically keep each of these stations in a vertex.", 'start': 17324.602, 'duration': 3.923}], 'summary': 'Data processed to create unique station vertices for analysis.', 'duration': 27.382, 'max_score': 17301.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g17301143.jpg'}, {'end': 17393.352, 'src': 'embed', 'start': 17361.466, 'weight': 6, 'content': [{'end': 17369.088, 'text': 'We are joining it with just stations at a station value should be equal to just station station ID.', 'start': 17361.466, 'duration': 7.622}, {'end': 17380.263, 'text': 'So as we have created stations and just station, so we are joining it and then we are selecting the station ID and start station name.', 'start': 17369.608, 'duration': 10.655}, {'end': 17386.568, 'text': 'Then we are mapping row 0 and Row 1.', 'start': 17380.484, 'duration': 6.084}, {'end': 17393.352, 'text': 'So your row 0 will basically be your vertex ID and Row 1 will be the string that is the name of your station.', 'start': 17386.568, 'duration': 6.784}], 'summary': 'Joining stations and just station by station id to map vertex id with station name.', 'duration': 31.886, 'max_score': 17361.466, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g17361466.jpg'}, {'end': 17596.715, 'src': 'embed', 'start': 17529.736, 'weight': 4, 'content': [{'end': 17532.698, 'text': 'So let us quickly go ahead and check the number of vertices.', 'start': 17529.736, 'duration': 2.962}, {'end': 17551.108, 'text': 'So these are the number of vertices again, we can check the number of edges as well.', 'start': 17544.705, 'duration': 6.403}, {'end': 17560.273, 'text': 'So these are the number of edges and to get a sanity check.', 'start': 17555.711, 'duration': 4.562}, {'end': 17565.036, 'text': "So let's go ahead and check the number of records that are present in the data frame.", 'start': 17560.614, 'duration': 4.422}, {'end': 17576.363, 'text': 'So as you can see that the number of edges in our graph and the count in our data frame is similar or you can see the same.', 'start': 17568.234, 'duration': 8.129}, {'end': 17586.193, 'text': "So now let's go ahead and run page rank on our data so we can either run a set number of iterations or we can run it until the convergence.", 'start': 17576.863, 'duration': 9.33}, {'end': 17589.797, 'text': "So in my case, I'll run it till convergence.", 'start': 17586.774, 'duration': 3.023}, {'end': 17594.852, 'text': "So it's rank then station graph then page rank.", 'start': 17591.886, 'duration': 2.966}, {'end': 17596.715, 'text': 'So I specified a double value.', 'start': 17595.233, 'duration': 1.482}], 'summary': 'Checked vertices, edges, and records. ran page rank till convergence.', 'duration': 66.979, 'max_score': 17529.736, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g17529736.jpg'}, {'end': 17860.617, 'src': 'embed', 'start': 17829.897, 'weight': 3, 'content': [{'end': 17835.838, 'text': 'and again, on the contrary, you can find out the stations where there are more number of edges,', 'start': 17829.897, 'duration': 5.941}, {'end': 17841.599, 'text': 'or you can say trip leaving those stations but there are less number of trips coming into those stations.', 'start': 17835.838, 'duration': 5.761}, {'end': 17845.32, 'text': 'So I guess you guys are now clear with spa graphics.', 'start': 17842.339, 'duration': 2.981}, {'end': 17848.247, 'text': 'Then we discussed the different types of graphs.', 'start': 17845.985, 'duration': 2.262}, {'end': 17851.129, 'text': 'Then moving ahead, we discussed the features of GraphX.', 'start': 17848.567, 'duration': 2.562}, {'end': 17853.471, 'text': 'Then we discussed something about property graph.', 'start': 17851.249, 'duration': 2.222}, {'end': 17860.617, 'text': 'We understood what is property graph, how you can create vertex, how you can create edges, how to use vertex RDD, edge RDD.', 'start': 17853.511, 'duration': 7.106}], 'summary': 'Discussed stations with more edges, types of graphs, and features of graphx.', 'duration': 30.72, 'max_score': 17829.897, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g17829897.jpg'}], 'start': 16755.864, 'title': 'Graph analysis and comparison of hadoop and spark', 'summary': "Covers graph analysis using connected components and triangle counting algorithms, spark graph operations, and a comparison of hadoop and spark. it also highlights spark's in-memory processing for near real-time analytics and its strength in real-time data analysis and iterative algorithms.", 'chapters': [{'end': 16797.937, 'start': 16755.864, 'title': 'Graph analysis: connected components and user join', 'summary': 'Covers the use of a connected components algorithm to find and join connected components with user names, similar to the process discussed in page rank, and then prints the connected components by username.', 'duration': 42.073, 'highlights': ['Using a connected components algorithm to find connected components', 'Joining the connected components with the username from the user file', 'Similar process to page rank, involving fields 0 and 1 of the user.txt file', 'Printing the connected components by username using CC by username collect']}, {'end': 17361.406, 'start': 16798.797, 'title': 'Spark graphics demo and triangle counting', 'summary': 'Discusses the triangle counting algorithm in spark graphics, which determines the number of triangles passing through each vertex, providing a measure of clustering, and then demonstrates the utilization of ford bike trip history data in spark.', 'duration': 562.609, 'highlights': ['The triangle counting algorithm in Spark Graphics determines the number of triangles passing through each vertex, providing a measure of clustering, and is implemented in the triangle count object.', 'Spark is used to analyze the Ford bike trip history data, involving the import of data into a data frame and the creation of a data frame for stations with unique values.', 'The process involves creating a data frame for stations with unique values, using flat map to iterate over start and end station IDs, and converting the values to data frames for defining vertices in the graph.']}, {'end': 17829.897, 'start': 17361.466, 'title': 'Graph operations in spark', 'summary': 'Demonstrates the creation of a graph, execution of the page rank algorithm, and analysis of station vertices and edges, highlighting the top 10 stations with the highest page rank values and the inbound and outbound trip counts of various stations.', 'duration': 468.431, 'highlights': ['The top 10 stations with the highest page rank values are displayed, indicating stations with more incoming trips as a measure of importance.', 'The count of trips from San Francisco Ferry Building to various stations, such as San Francisco and others, is presented, providing insights into common destinations.', 'The in-degree and out-degree of stations are calculated, revealing the top 10 stations with the most incoming and outgoing trips, aiding in understanding station traffic.']}, {'end': 18376.018, 'start': 17829.897, 'title': 'Hadoop vs. spark comparison', 'summary': "Compares hadoop and spark based on parameters such as performance, ease of use, and cost, highlighting that spark's in-memory processing delivers near real-time analytics and its reduced system requirement can lead to lower cost per unit of computation.", 'duration': 546.121, 'highlights': ['The in-memory processing of Spark delivers near real-time analytics, making it suitable for credit card processing system, machine learning, security analytics, and processing data for IoT sensors.', "Hadoop's MapReduce uses batch processing and was not built for real-time processing, while YARN is designed for parallel processing over distributed datasets.", 'Spark provides user-friendly APIs for Scala, Java, Python, and Spark SQL, with an interactive shell for querying and performing actions, making it easier for developers to learn and use.', 'Hadoop allows easy data ingestion using Shell or integrating with tools like Scoop and Flume and can be integrated with multiple tools like Hive and Pig for analytics, making it user-friendly in its own way.', 'Hadoop and Spark are both Apache open source projects, with no cost for the software, and are designed to run on commodity hardware with low total cost of ownership.', 'Due to its in-memory processing, Spark requires a lot of memory but can deal with a standard speed and amount of disk, while Hadoop requires a lot of disk space as well as faster transfer speed.']}, {'end': 18810.196, 'start': 18376.678, 'title': 'Batch vs stream processing and hadoop vs spark', 'summary': 'Discusses the differences between batch and stream processing, the comparison of hadoop and spark in terms of fault tolerance and security, and the use cases where hadoop and spark fit best, highlighting the strength of spark in real-time data analysis and iterative algorithms.', 'duration': 433.518, 'highlights': ['Stream processing is a current trend in big data world, providing speed and real-time information, allowing businesses to quickly react to changing business needs. It has seen a rapid growth in demand.', 'Spark is designed to cover a wide range of workloads, including batch application, iterative algorithms, interactive queries, and streaming, providing fault tolerance through RDDs and processing data 100 times faster than MapReduce.', 'Hadoop and Spark complement each other well, with Hadoop bringing huge datasets under control and Spark providing real-time in-memory processing, high processing speed, and advanced analytics, resulting in the best outcomes.']}], 'duration': 2054.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g16755864.jpg', 'highlights': ['Spark provides in-memory processing for near real-time analytics', 'Spark covers a wide range of workloads, processing data 100 times faster than MapReduce', 'Hadoop and Spark complement each other well, bringing huge datasets under control and providing real-time in-memory processing', 'Using a connected components algorithm to find connected components', 'The triangle counting algorithm in Spark Graphics determines the number of triangles passing through each vertex', 'The top 10 stations with the highest page rank values are displayed', 'Stream processing is a current trend in big data world, providing speed and real-time information', 'Hadoop and Spark are both Apache open source projects, designed to run on commodity hardware with low total cost of ownership']}, {'end': 20756.243, 'segs': [{'end': 19093.096, 'src': 'embed', 'start': 19063.91, 'weight': 3, 'content': [{'end': 19070.352, 'text': 'the Kafka cluster is distributed and have multiple machines running in parallel, and this is the reason why Kafka is fast,', 'start': 19063.91, 'duration': 6.442}, {'end': 19071.953, 'text': 'scalable and fault-tolerant.', 'start': 19070.352, 'duration': 1.601}, {'end': 19077.395, 'text': 'Now, let me tell you that Kafka is developed at LinkedIn and later it became a part of Apache project.', 'start': 19072.593, 'duration': 4.802}, {'end': 19080.556, 'text': 'Now, let us look at some of the important terminologies.', 'start': 19078.115, 'duration': 2.441}, {'end': 19083.002, 'text': "So we'll first start with topic.", 'start': 19081.261, 'duration': 1.741}, {'end': 19093.096, 'text': 'So topic is a category or feed name to which records are published and topic in Kafka are always multi subscriber, that is a topic can have 0,', 'start': 19083.562, 'duration': 9.534}], 'summary': 'Kafka is fast, scalable, and fault-tolerant due to its distributed cluster. developed at linkedin and now part of apache project. topics in kafka are multi-subscriber.', 'duration': 29.186, 'max_score': 19063.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g19063910.jpg'}, {'end': 19320.348, 'src': 'embed', 'start': 19290.293, 'weight': 0, 'content': [{'end': 19296.6, 'text': 'So if the replication factor is 3 It will have three copies which will reside on different brokers.', 'start': 19290.293, 'duration': 6.307}, {'end': 19308.172, 'text': 'So one replica is on broker to next is on broker 3 and next is on broker 5 and as you can see our EPL 5 so this 5 is from this broker 5.', 'start': 19297.121, 'duration': 11.051}, {'end': 19313.438, 'text': 'So the ID of the replica is same as the ID of the broker that hosts it.', 'start': 19308.172, 'duration': 5.266}, {'end': 19320.348, 'text': 'Now moving ahead, one of the replica of partition one will serve as the leader replica.', 'start': 19314.162, 'duration': 6.186}], 'summary': 'Replication factor 3: 3 copies on different brokers, id same as broker, leader replica exists.', 'duration': 30.055, 'max_score': 19290.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g19290293.jpg'}, {'end': 20685.662, 'src': 'embed', 'start': 20649.982, 'weight': 5, 'content': [{'end': 20656.067, 'text': 'in which case Park will keep all the elements around on the cluster for faster access,', 'start': 20649.982, 'duration': 6.085}, {'end': 20662.85, 'text': 'and whenever you will execute the query next time over the data, then the query will be executed quickly and it will give you a instant result,', 'start': 20656.067, 'duration': 6.783}, {'end': 20668.653, 'text': 'right?. So I hope that you guys are clear how spark helps in interactive streaming analytics.', 'start': 20662.85, 'duration': 5.803}, {'end': 20671.743, 'text': "Now, let's talk about data integration.", 'start': 20669.66, 'duration': 2.083}, {'end': 20673.465, 'text': 'So, basically, as you know,', 'start': 20672.264, 'duration': 1.201}, {'end': 20685.662, 'text': 'that in large organizations data is basically produced on different systems across the business and basically you need a framework which can actually integrate different data sources.', 'start': 20673.465, 'duration': 12.197}], 'summary': 'Spark enables faster query execution for instant results in data integration.', 'duration': 35.68, 'max_score': 20649.982, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g20649982.jpg'}, {'end': 20735.85, 'src': 'embed', 'start': 20712.428, 'weight': 6, 'content': [{'end': 20719.973, 'text': 'So spark basically gives you a multiple options where you can go ahead and pick the data from, and again you can go ahead and write the data into.', 'start': 20712.428, 'duration': 7.545}, {'end': 20725.487, 'text': "Now let's quickly move ahead and we'll talk about different spark components.", 'start': 20720.726, 'duration': 4.761}, {'end': 20728.628, 'text': 'So you can see here have a spark core engine.', 'start': 20726.007, 'duration': 2.621}, {'end': 20735.85, 'text': 'So, basically, this is the core engine and on top of this core engine, you have spark SQL, spark streaming, then MLA,', 'start': 20728.668, 'duration': 7.182}], 'summary': 'Apache spark offers various components such as spark core, spark sql, and spark streaming for processing and analyzing data.', 'duration': 23.422, 'max_score': 20712.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g20712428.jpg'}], 'start': 18810.756, 'title': 'Kafka messaging systems and apache spark', 'summary': "Delves into kafka's role in simplifying data pipelines, introducing apache kafka's features, kafka cluster overview, setting up kafka topics, and understanding apache spark's processing capabilities, including handling workloads such as streaming and processing, machine learning, interactive streaming analytics, and data integration.", 'chapters': [{'end': 18995.37, 'start': 18810.756, 'title': 'Kafka messaging system', 'summary': 'Discusses the complexity of data pipelines in real-time scenarios, explains the origin of messaging systems, and highlights how kafka simplifies communication between systems and decouples data pipelines, allowing for asynchronous and reliable communication.', 'duration': 184.614, 'highlights': ['Kafka decouples data pipelines and solves complexity problem, simplifying communication between systems and providing asynchronous and reliable communication.', 'Messaging systems reduce the complexity of data pipelines and make communication between systems simpler and manageable.', 'In a real-time scenario, different systems or services communicate with each other, creating complex data pipelines with an increase in the number of systems.', 'E-commerce websites and organizations can have multiple servers at the front end and back end, leading to complex data pipelines and difficulties in managing them.']}, {'end': 19219.533, 'start': 18996.21, 'title': 'Apache kafka messaging system', 'summary': 'Introduces apache kafka, a distributed publish-subscribe messaging system developed at linkedin and later became a part of apache project, allowing multiple consumers to subscribe to topics and consume data in parallel, with partitions enabling parallel reading from topics and consumer groups ensuring effective consumption of records.', 'duration': 223.323, 'highlights': ['Apache Kafka is a distributed publish subscribe messaging system, allowing multiple consumers to subscribe to topics and consume data in parallel, with partitions enabling parallel reading from topics and consumer groups ensuring effective consumption of records.', 'Kafka cluster is distributed and have multiple machines running in parallel, making it fast, scalable, and fault-tolerant.', 'Topic in Kafka are always multi subscriber, allowing 0, 1, or multiple consumers to subscribe to the topic and consume the data written to it, thus segregating messages and enabling consumer subscription to only the required topics.', 'Kafka topics are divided into a number of partitions, allowing parallel consumption of data by splitting the data in a particular topic across multiple brokers and enabling multiple consumers to read from a topic parallelly.']}, {'end': 19733.894, 'start': 19219.533, 'title': 'Kafka cluster overview', 'summary': 'Explains the concept of consumers, brokers, partitions, and replicas in a kafka cluster, along with their roles and functionalities, and demonstrates the process of setting up kafka and zookeeper.', 'duration': 514.361, 'highlights': ['The leader replica of a partition serves consumer messages, ensuring fault tolerance through replica maintenance and parallel message consumption from different partition leaders.', 'Zookeeper manages metadata information related to the Kafka cluster, while brokers act as single machines within the cluster, with each broker hosting replicas and serving specific functionalities.', 'The chapter demonstrates the process of starting Zookeeper and Kafka servers, creating and configuring properties files for brokers, and initiating multiple brokers within the Kafka cluster.']}, {'end': 20265.531, 'start': 19741.577, 'title': 'Setting up kafka topics and understanding apache spark', 'summary': 'Covers the creation and listing of kafka topics, along with the setup of a console producer and consumer. it also provides an introduction to apache spark, highlighting its features and benefits, including near real-time processing, implicit data parallelism, fault tolerance, and in-memory cluster computing.', 'duration': 523.954, 'highlights': ["Kafka: Creation of the topic 'Kafka-Spark' with replication factor 3 and 3 partitions, along with listing and description of the topic. (3 brokers, 3 partitions, 3 replication factor)", 'Console Producer and Consumer: Demonstration of producing and consuming messages using console producer and consumer for testing Kafka cluster. (Testing and monitoring Kafka cluster functionality)', 'Apache Spark: Introduction to Apache Spark, highlighting its features including near real-time processing, implicit data parallelism, fault tolerance, and in-memory cluster computing. (Open-source cluster computing framework, in-memory cluster computing)']}, {'end': 20756.243, 'start': 20265.531, 'title': 'Apache spark: processing, workloads, and components', 'summary': "Discusses apache spark's ability to perform batch processing 100 times faster than mapreduce, using rdds for fault tolerance and fast operations, and handling workloads such as streaming and processing, machine learning, interactive streaming analytics, and data integration.", 'duration': 490.712, 'highlights': ['Apache Spark can perform batch processing 100 times faster than MapReduce, making it the go-to tool for big data processing in the industry.', 'RDDs are resilient distributed data sets, providing fault tolerance and efficient operations by storing intermediate results in distributed memory.', 'Apache Spark can handle workloads like streaming and processing, machine learning, interactive streaming analytics, and data integration by integrating different data sources and offering diverse workloads for streaming SQL and machine learning.']}], 'duration': 1945.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g18810756.jpg', 'highlights': ['Apache Spark: Introduction to Apache Spark, highlighting its features including near real-time processing, implicit data parallelism, fault tolerance, and in-memory cluster computing. (Open-source cluster computing framework, in-memory cluster computing)', 'Apache Spark can perform batch processing 100 times faster than MapReduce, making it the go-to tool for big data processing in the industry.', 'Apache Spark can handle workloads like streaming and processing, machine learning, interactive streaming analytics, and data integration by integrating different data sources and offering diverse workloads for streaming SQL and machine learning.', "Kafka: Creation of the topic 'Kafka-Spark' with replication factor 3 and 3 partitions, along with listing and description of the topic. (3 brokers, 3 partitions, 3 replication factor)", 'Kafka cluster is distributed and have multiple machines running in parallel, making it fast, scalable, and fault-tolerant.', 'Kafka decouples data pipelines and solves complexity problem, simplifying communication between systems and providing asynchronous and reliable communication.', 'Messaging systems reduce the complexity of data pipelines and make communication between systems simpler and manageable.', 'In a real-time scenario, different systems or services communicate with each other, creating complex data pipelines with an increase in the number of systems.']}, {'end': 22716.906, 'segs': [{'end': 20855.451, 'src': 'embed', 'start': 20816.677, 'weight': 2, 'content': [{'end': 20827.64, 'text': "Spark SQL is a new module in Spark which integrates relational processing of Spark's functional programming API and it supports querying data either via SQL or HQL,", 'start': 20816.677, 'duration': 10.963}, {'end': 20829.08, 'text': 'that is, Hive query language.', 'start': 20827.64, 'duration': 1.44}, {'end': 20833.642, 'text': 'So, basically, for those of you who are familiar with RDBMS,', 'start': 20829.64, 'duration': 4.002}, {'end': 20841.467, 'text': 'Spark SQL is an easy transition from your earlier tool where you can go ahead and extend the boundaries of traditional relational data processing.', 'start': 20833.642, 'duration': 7.825}, {'end': 20843.339, 'text': 'Now talking about graphics.', 'start': 20841.977, 'duration': 1.362}, {'end': 20848.504, 'text': 'So graphics is the spark API for graphs and graph parallel computation.', 'start': 20843.759, 'duration': 4.745}, {'end': 20855.451, 'text': 'It extends the spark RDD with a resilient distributed property graph at talking at high level.', 'start': 20849.085, 'duration': 6.366}], 'summary': 'Spark sql integrates relational processing with support for sql and hql, while spark graphics extends rdd with resilient distributed property graph.', 'duration': 38.774, 'max_score': 20816.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g20816677.jpg'}, {'end': 20937.62, 'src': 'embed', 'start': 20902.01, 'weight': 0, 'content': [{'end': 20905.894, 'text': 'collaborative filtering techniques, then cluster analysis methods.', 'start': 20902.01, 'duration': 3.884}, {'end': 20913.16, 'text': 'then you have dimensionality reduction techniques, you have feature extraction and transformation functions, then you have optimization algorithms.', 'start': 20905.894, 'duration': 7.266}, {'end': 20918.565, 'text': 'It is basically a ML package, or you can say machine learning package on top of Spark.', 'start': 20913.62, 'duration': 4.945}, {'end': 20926.291, 'text': 'Then you also have something called PySpark, which is Python package for Spark that you can go ahead and leverage Python over Spark.', 'start': 20919.145, 'duration': 7.146}, {'end': 20930.335, 'text': 'So I hope that you guys are clear with different Spark components.', 'start': 20926.912, 'duration': 3.423}, {'end': 20937.62, 'text': 'So before moving to Kafka Spark streaming demo, so I have just given you a brief intro to Apache Spark.', 'start': 20931.275, 'duration': 6.345}], 'summary': 'Apache spark offers ml package with pyspark for python. also, includes collaborative filtering, cluster analysis, dimensionality reduction, feature extraction, and optimization algorithms.', 'duration': 35.61, 'max_score': 20902.01, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g20902010.jpg'}, {'end': 21048.706, 'src': 'embed', 'start': 21022.752, 'weight': 1, 'content': [{'end': 21028.495, 'text': 'This is Kafka transaction producer and the next one is the spark streaming Kafka master.', 'start': 21022.752, 'duration': 5.743}, {'end': 21040.781, 'text': "So first we will be producing messages from Kafka transaction producer and then we'll be streaming those records which is basically produced by this producer using the spark streaming Kafka master.", 'start': 21029.115, 'duration': 11.666}, {'end': 21044.683, 'text': 'So first, let me take you through this Kafka transaction producer.', 'start': 21041.261, 'duration': 3.422}, {'end': 21046.885, 'text': 'So this is our pom.xml file.', 'start': 21045.124, 'duration': 1.761}, {'end': 21048.706, 'text': 'Let me open it with gedit.', 'start': 21047.305, 'duration': 1.401}], 'summary': 'Introduction to kafka transaction producer and spark streaming kafka master.', 'duration': 25.954, 'max_score': 21022.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g21022752.jpg'}, {'end': 21284.992, 'src': 'embed', 'start': 21261.93, 'weight': 4, 'content': [{'end': 21269.476, 'text': 'Then we have get product, set product, get price, set price and all the getter and setter methods for each of the variables.', 'start': 21261.93, 'duration': 7.546}, {'end': 21274.285, 'text': 'this is the constructor.', 'start': 21271.983, 'duration': 2.302}, {'end': 21279.729, 'text': 'so here we are taking all the parameters, like transaction date, product price,', 'start': 21274.285, 'duration': 5.444}, {'end': 21284.992, 'text': 'and then we are setting the value of each of the variables using this operator.', 'start': 21279.729, 'duration': 5.263}], 'summary': 'Code includes getter, setter methods for variables, and constructor with parameters.', 'duration': 23.062, 'max_score': 21261.93, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g21261930.jpg'}, {'end': 21473.87, 'src': 'embed', 'start': 21444.707, 'weight': 5, 'content': [{'end': 21447.429, 'text': 'So, from all these Kafka properties object,', 'start': 21444.707, 'duration': 2.722}, {'end': 21453.453, 'text': 'we are calling those getter methods and retrieving those values and setting those values in this property object.', 'start': 21447.429, 'duration': 6.024}, {'end': 21456.735, 'text': 'So then we have partitioner class.', 'start': 21454.173, 'duration': 2.562}, {'end': 21465.201, 'text': 'So we are basically implementing this default partitioner which is present in org Apache Kafka client producer internals package.', 'start': 21457.175, 'duration': 8.026}, {'end': 21473.87, 'text': 'Then we are creating a producer over here and we are passing this props object which will set the properties.', 'start': 21465.861, 'duration': 8.009}], 'summary': 'Implementing default partitioner in kafka producer using getter and setter methods.', 'duration': 29.163, 'max_score': 21444.707, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g21444707.jpg'}], 'start': 20756.243, 'title': 'Apache spark and kafka components', 'summary': 'Covers the significance of spark core engine, its components such as spark streaming, spark sql, and graphics, and provides an overview of apache spark components including rdd, r, mllib, and pyspark. it also includes a detailed demonstration of setting up kafka, spark streaming, and real-time analytics.', 'chapters': [{'end': 20855.451, 'start': 20756.243, 'title': 'Spark core engine and its components', 'summary': 'Discusses the significance of the spark core engine in managing memory, fault recovery, scheduling, and job distribution on a cluster, along with its various components such as spark streaming, spark sql, and graphics for different data processing functionalities.', 'duration': 99.208, 'highlights': ['The Spark core engine manages memory, fault recovery, scheduling, job distribution, and storage system interaction, serving as the heart of Spark.', 'Spark streaming enables high throughput and fault-tolerant stream processing for live data streams.', 'Spark SQL integrates relational processing and supports querying data via SQL or HQL, serving as an easy transition from traditional relational data processing tools.', 'Graphics is the spark API for graphs and graph parallel computation, extending the spark RDD with resilient distributed property graph at a high level.']}, {'end': 21602.582, 'start': 20855.451, 'title': 'Apache spark and kafka demo', 'summary': 'Provides an overview of apache spark components including spark rdd, spark r, spark mllib, and pyspark, along with a detailed demonstration of setting up kafka and spark streaming for producing and streaming messages, showcasing the implementation of kafka transaction producer and spark streaming kafka master.', 'duration': 747.131, 'highlights': ['The chapter provides an overview of Apache Spark components including Spark RDD, Spark R, Spark MLlib, and PySpark, showcasing its capabilities in machine learning and data processing.', 'A detailed demonstration of setting up Kafka and Spark streaming for producing and streaming messages, including the implementation of Kafka transaction producer and Spark streaming Kafka master, is provided.', "The transcript delves into the technical details of setting up Kafka transaction producer, configuring properties, creating event producer API, constants, and performance monitoring for optimizing the Kafka producer's performance."]}, {'end': 22716.906, 'start': 21603.262, 'title': 'Kafka producer, spark streaming and pi spark', 'summary': 'Covers the implementation of kafka producer to send records to kafka, creation of custom json serializer, dispatching events using kafka producer, spark streaming application to consume data from kafka and demonstrate real-time analytics, and the overview of a techreview.com project involving real-time twitter data consumption, spark streaming, and cassandra database for trend comparison between technologies.', 'duration': 1113.644, 'highlights': ['The chapter covers the implementation of Kafka producer to send records to Kafka, including reading records from a file, creating a custom JSON serializer to write values as bytes, initializing and dispatching events using Kafka producer, and verifying the produced data in Kafka topics and consuming the records using the console consumer.', 'The chapter also elaborates on the Spark streaming application to consume data from Kafka, create a conf object, set up Kafka parameters, create a Java pair input Dstreams, continuously iterate over the RDD, print record details, start the Spark streaming context, and execute the Spark streaming application, followed by producing messages and displaying the consumed records.', 'Furthermore, the overview of a techreview.com project is provided, which involves consuming real-time Twitter feeds, performing spark streaming to identify minutely trends between different technologies, writing aggregated minutely count to Cassandra, and the solution strategy of continuously streaming data from Twitter, storing tweets in a Kafka topic, applying spark streaming analytics, and writing data to a Cassandra database.', 'The chapter concludes with an introduction to Pi Spark, highlighting its ecosystem components such as Spark SQL, Spark streaming, MLib, Graphics, and the core API component, and the topics to be covered, including the installation, fundamental concepts, and a real-life use case demo.']}], 'duration': 1960.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g20756243.jpg', 'highlights': ['The Spark core engine manages memory, fault recovery, scheduling, job distribution, and storage system interaction, serving as the heart of Spark.', 'Spark streaming enables high throughput and fault-tolerant stream processing for live data streams.', 'Spark SQL integrates relational processing and supports querying data via SQL or HQL, serving as an easy transition from traditional relational data processing tools.', 'The chapter provides an overview of Apache Spark components including Spark RDD, Spark R, Spark MLlib, and PySpark, showcasing its capabilities in machine learning and data processing.', 'The chapter covers the implementation of Kafka producer to send records to Kafka, including reading records from a file, creating a custom JSON serializer to write values as bytes, initializing and dispatching events using Kafka producer, and verifying the produced data in Kafka topics and consuming the records using the console consumer.', 'The chapter also elaborates on the Spark streaming application to consume data from Kafka, create a conf object, set up Kafka parameters, create a Java pair input Dstreams, continuously iterate over the RDD, print record details, start the Spark streaming context, and execute the Spark streaming application, followed by producing messages and displaying the consumed records.', 'The overview of a techreview.com project is provided, which involves consuming real-time Twitter feeds, performing spark streaming to identify minutely trends between different technologies, writing aggregated minutely count to Cassandra, and the solution strategy of continuously streaming data from Twitter, storing tweets in a Kafka topic, applying spark streaming analytics, and writing data to a Cassandra database.', 'Graphics is the spark API for graphs and graph parallel computation, extending the spark RDD with resilient distributed property graph at a high level.', 'The chapter concludes with an introduction to Pi Spark, highlighting its ecosystem components such as Spark SQL, Spark streaming, MLib, Graphics, and the core API component, and the topics to be covered, including the installation, fundamental concepts, and a real-life use case demo.']}, {'end': 23580.994, 'segs': [{'end': 22860.889, 'src': 'embed', 'start': 22832.25, 'weight': 0, 'content': [{'end': 22836.833, 'text': 'and talking about the readability of code, maintenance and familiarity with the python API,', 'start': 22832.25, 'duration': 4.583}, {'end': 22839.755, 'text': 'for Apache spark is far better than other programming languages.', 'start': 22836.833, 'duration': 2.922}, {'end': 22845.499, 'text': 'Python also provides various options for visualization, which is not possible using Scala or Java.', 'start': 22840.435, 'duration': 5.064}, {'end': 22849.501, 'text': 'Moreover, you can conveniently call our directly from python.', 'start': 22845.919, 'duration': 3.582}, {'end': 22855.085, 'text': 'on top of this, python comes with a wide range of libraries like numpy, Panda, skit, lens, Seaborn,', 'start': 22849.501, 'duration': 5.584}, {'end': 22860.889, 'text': 'matplotlib and these library aids in data analysis and also provide mature and time-tested statistics.', 'start': 22855.085, 'duration': 5.804}], 'summary': "Python's readability, visualization, and library support make it superior for apache spark, with a wide range of libraries like numpy, pandas, scikit-learn, seaborn, and matplotlib.", 'duration': 28.639, 'max_score': 22832.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g22832250.jpg'}, {'end': 22914.063, 'src': 'embed', 'start': 22886.42, 'weight': 1, 'content': [{'end': 22890.807, 'text': 'So in order to install PySpark first make sure that you have Hadoop installed in your system.', 'start': 22886.42, 'duration': 4.387}, {'end': 22896.696, 'text': 'So, if you want to know more about how to install Hadoop, please check out our Hadoop playlist on YouTube,', 'start': 22891.127, 'duration': 5.569}, {'end': 22898.939, 'text': 'or you can check out our blog on Edureka website.', 'start': 22896.696, 'duration': 2.243}, {'end': 22906.519, 'text': 'First of all, you need to go to the Apache spark official website, which is spark.apache.org and a download section.', 'start': 22899.915, 'duration': 6.604}, {'end': 22914.063, 'text': 'You can download the latest version of spark release, which supports the latest version of Hadoop or Hadoop version 2.7 or above, now,', 'start': 22906.539, 'duration': 7.524}], 'summary': 'To install pyspark, ensure hadoop is installed. download latest spark release supporting hadoop 2.7+.', 'duration': 27.643, 'max_score': 22886.42, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g22886420.jpg'}, {'end': 23006.449, 'src': 'embed', 'start': 22981.462, 'weight': 2, 'content': [{'end': 22989.786, 'text': 'Through a Spark context object, you can create RDDs, accumulators, and broadcast variable, access Spark services, run jobs, and much more.', 'start': 22981.462, 'duration': 8.324}, {'end': 22997.929, 'text': 'The Spark context allows the Spark driver application to access the cluster through a resource manager, which can be yarn or Sparks cluster manager.', 'start': 22990.246, 'duration': 7.683}, {'end': 23006.449, 'text': 'The driver program then runs the operations inside the executors on the worker nodes and spark context uses the pi4j to launch a JVM,', 'start': 22998.429, 'duration': 8.02}], 'summary': 'Spark context enables access to cluster, run jobs, and much more', 'duration': 24.987, 'max_score': 22981.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g22981462.jpg'}, {'end': 23345.432, 'src': 'embed', 'start': 23317.995, 'weight': 4, 'content': [{'end': 23323.779, 'text': 'So you can see here all the words are in the lower case and all of them are separated with the help of a space bar.', 'start': 23317.995, 'duration': 5.784}, {'end': 23330.383, 'text': "Now there's another transformation, which is known as the flat map, to give you a flat and output,", 'start': 23324.959, 'duration': 5.424}, {'end': 23333.164, 'text': "and I'm passing the same function which I created earlier.", 'start': 23330.383, 'duration': 2.781}, {'end': 23336.326, 'text': "So let's go ahead and have a look at the output for this one.", 'start': 23333.505, 'duration': 2.821}, {'end': 23345.432, 'text': 'So as you can see here, we got the first five elements which are the same one as we got here the contrast transactions and and the records.', 'start': 23336.767, 'duration': 8.665}], 'summary': 'Demonstrating transformation functions in code, showing output of first five elements.', 'duration': 27.437, 'max_score': 23317.995, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23317995.jpg'}, {'end': 23416.315, 'src': 'embed', 'start': 23373.213, 'weight': 5, 'content': [{'end': 23382.076, 'text': 'and now we are using here the filter transformation and with the help of Lambda function, in which we have X specified as X naught and stop words,', 'start': 23373.213, 'duration': 8.863}, {'end': 23387.278, 'text': 'and we have created another RDD, which is RDD 3, which will take the input from RDD 2..', 'start': 23382.076, 'duration': 5.202}, {'end': 23391.399, 'text': "So let's go ahead and see whether and and the are removed or not.", 'start': 23387.278, 'duration': 4.121}, {'end': 23395.281, 'text': 'So you can see contracts transaction records of them.', 'start': 23391.94, 'duration': 3.341}, {'end': 23402.126, 'text': 'If you look at the output 5 we have contracts transaction and and the and in the are not in this list.', 'start': 23395.802, 'duration': 6.324}, {'end': 23407.289, 'text': 'I suppose I want to group the data according to the first three characters of any element.', 'start': 23402.546, 'duration': 4.743}, {'end': 23411.412, 'text': "So for that I'll use the group by and I'll use the Lambda function again.", 'start': 23407.609, 'duration': 3.803}, {'end': 23413.453, 'text': "So let's have a look at the output.", 'start': 23411.872, 'duration': 1.581}, {'end': 23416.315, 'text': 'So you can see we have edg and edges.', 'start': 23414.074, 'duration': 2.241}], 'summary': "Using filter transformation and lambda function to remove stop words, group data by first three characters, resulting in 'edg' and 'edges'.", 'duration': 43.102, 'max_score': 23373.213, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23373213.jpg'}, {'end': 23464.534, 'src': 'embed', 'start': 23439.66, 'weight': 3, 'content': [{'end': 23447.006, 'text': 'All I need to do is initialize another RDD, which is the num underscore RDD, and we use the SC dot parallelize,', 'start': 23439.66, 'duration': 7.346}, {'end': 23453.131, 'text': 'and the range we have given is 1 to 10, 000 and will use the reduce action here to see the output.', 'start': 23447.006, 'duration': 6.125}, {'end': 23458.452, 'text': 'As you can see here, we have the sum of the numbers ranging from 1 to 10, 000.', 'start': 23453.851, 'duration': 4.601}, {'end': 23460.733, 'text': 'Now, this was all about RDD.', 'start': 23458.452, 'duration': 2.281}, {'end': 23464.534, 'text': 'The next topic that we have on a list is broadcast and accumulators.', 'start': 23460.753, 'duration': 3.781}], 'summary': 'Initialized num rdd, used parallelize with range 1 to 10,000, and performed reduce action to get sum of numbers. next, discussed broadcast and accumulators.', 'duration': 24.874, 'max_score': 23439.66, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23439660.jpg'}], 'start': 22716.906, 'title': 'Pyspark and spark for data processing', 'summary': 'Introduces pyspark, highlighting its advantages and fundamental concepts, and covers spark framework installation, rdd operations, transformations, and actions, including broadcast and accumulators.', 'chapters': [{'end': 23187.117, 'start': 22716.906, 'title': 'Introduction to pyspark: simplifying machine learning with python and apache spark', 'summary': 'Introduces the pyspark, a python api for apache spark, highlighting its advantages such as ease of use, flexibility, resilience, and integration with python libraries. it also outlines the installation process and covers fundamental concepts like spark context and rdds.', 'duration': 470.211, 'highlights': ['The PySpark API combines the power of Apache Spark with the simplicity of Python, making it easy to learn and use, with support for multiple data types and comprehensive API, and better readability, maintenance, and visualization options compared to other languages like Scala or Java.', 'The installation process for PySpark involves ensuring Hadoop is installed, downloading the latest version of Spark supporting Hadoop 2.7 or above, extracting it, adding the path to the bash RC file, and installing pip and Jupyter notebook with pip version 10 or above.', 'The Spark context, a vital component of any Spark application, sets up internal services, establishes connections, and allows access to cluster resources, while RDDs (resilient distributed data sets) serve as the building blocks for parallel processing on a cluster, offering fault tolerance, immutability, and support for transformations and actions.']}, {'end': 23580.994, 'start': 23187.137, 'title': 'Using spark for data processing', 'summary': 'Covers spark framework installation, rdd operations, transformations, and actions, including broadcast and accumulators.', 'duration': 393.857, 'highlights': ['The flat map transformation is used to split text data into individual words and produce a flat output, demonstrated by the example with the blockchain data.', 'The filter transformation is utilized to remove stop words from the text data using a predefined list, showcasing the removal of specified stop words from the RDD.', 'Group by transformation is used to group data based on the first three characters of each element, demonstrated by grouping elements with the same initial three characters.', 'The reduce action is employed to calculate the sum of the first 10,000 numbers, showcasing the summing of a large set of numbers using parallel processing in Spark.']}], 'duration': 864.088, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g22716906.jpg', 'highlights': ['The PySpark API combines the power of Apache Spark with the simplicity of Python, offering better readability and visualization options compared to other languages.', 'The installation process for PySpark involves ensuring Hadoop is installed, downloading the latest version of Spark supporting Hadoop 2.7 or above, and setting up the Spark context.', 'The Spark context sets up internal services, establishes connections, and allows access to cluster resources, while RDDs offer fault tolerance, immutability, and support for transformations and actions.', 'The reduce action is employed to calculate the sum of the first 10,000 numbers, showcasing the summing of a large set of numbers using parallel processing in Spark.', 'The flat map transformation is used to split text data into individual words and produce a flat output, demonstrated by the example with the blockchain data.', 'The filter transformation is utilized to remove stop words from the text data using a predefined list, showcasing the removal of specified stop words from the RDD.', 'Group by transformation is used to group data based on the first three characters of each element, demonstrated by grouping elements with the same initial three characters.']}, {'end': 24411.559, 'segs': [{'end': 23635.952, 'src': 'embed', 'start': 23598.294, 'weight': 2, 'content': [{'end': 23604.378, 'text': 'few characteristics of data frames are immutable in nature that is the same as you can create a data frame, but you cannot change it.', 'start': 23598.294, 'duration': 6.084}, {'end': 23611.202, 'text': 'It allows lazy evaluation, that is, the task not executed unless and until an action is triggered and, moreover,', 'start': 23604.838, 'duration': 6.364}, {'end': 23616.885, 'text': 'data frames are distributed in nature, which are designed for processing large collection of structure or semi-structured data.', 'start': 23611.202, 'duration': 5.683}, {'end': 23625.868, 'text': 'It can be created using different data formats like loading the data from source files such as JSON or CSV or you can load it from an existing RDD.', 'start': 23617.405, 'duration': 8.463}, {'end': 23628.749, 'text': 'You can use databases like hive Cassandra.', 'start': 23625.908, 'duration': 2.841}, {'end': 23630.09, 'text': 'You can use pocket files.', 'start': 23628.809, 'duration': 1.281}, {'end': 23632.671, 'text': 'You can use CSV XML files.', 'start': 23630.43, 'duration': 2.241}, {'end': 23635.952, 'text': 'There are many sources through which you can create a particular RDD.', 'start': 23632.971, 'duration': 2.981}], 'summary': 'Data frames are immutable, support lazy evaluation, distributed in nature, and can be created from various data formats and sources.', 'duration': 37.658, 'max_score': 23598.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23598294.jpg'}, {'end': 23747.302, 'src': 'embed', 'start': 23721.001, 'weight': 0, 'content': [{'end': 23727.306, 'text': 'We have the carrier as string the tail number as string the origin string destination string and so on.', 'start': 23721.001, 'duration': 6.305}, {'end': 23732.53, 'text': 'Now suppose I want to know how many records are there in my database or the data frame.', 'start': 23727.886, 'duration': 4.644}, {'end': 23736.194, 'text': "I'd rather say So you need the count function for this one.", 'start': 23732.711, 'duration': 3.483}, {'end': 23738.135, 'text': 'It will provide you with the results.', 'start': 23736.874, 'duration': 1.261}, {'end': 23746.001, 'text': 'So as you can see here, we have 3.3 million records here.', 'start': 23739.576, 'duration': 6.425}, {'end': 23747.302, 'text': '3, 036, 776 to be exact.', 'start': 23746.281, 'duration': 1.021}], 'summary': 'The database contains 3.3 million records.', 'duration': 26.301, 'max_score': 23721.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23721001.jpg'}, {'end': 23795.086, 'src': 'embed', 'start': 23770.785, 'weight': 1, 'content': [{'end': 23778.15, 'text': 'suppose I want to check the what is the lowest count or the highest count in the particular distance column.', 'start': 23770.785, 'duration': 7.365}, {'end': 23780.052, 'text': 'I need to use the describe function here.', 'start': 23778.39, 'duration': 1.662}, {'end': 23782.553, 'text': "So I'll show you what the summary looks like.", 'start': 23780.692, 'duration': 1.861}, {'end': 23787.437, 'text': 'So the distance the count is the number of rows total number of rows.', 'start': 23783.694, 'duration': 3.743}, {'end': 23789.561, 'text': 'We have the mean the standard deviation.', 'start': 23788.12, 'duration': 1.441}, {'end': 23795.086, 'text': 'We have the minimum value, which is 17, and the maximum value, which is 4983.', 'start': 23789.581, 'duration': 5.505}], 'summary': 'Using the describe function, the distance column has a minimum value of 17 and a maximum value of 4983.', 'duration': 24.301, 'max_score': 23770.785, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23770785.jpg'}, {'end': 23867.645, 'src': 'embed', 'start': 23842.213, 'weight': 4, 'content': [{'end': 23846.973, 'text': 'So here instead of filter we can also use a where clause which will give us the same output.', 'start': 23842.213, 'duration': 4.76}, {'end': 23855.758, 'text': 'Now we can also pass on multiple parameters and rather say the multiple conditions.', 'start': 23849.254, 'duration': 6.504}, {'end': 23863.883, 'text': 'So suppose I want the day of the flight should be 7th and the origin should be JFK and the arrival delay should be less than 0.', 'start': 23856.219, 'duration': 7.664}, {'end': 23867.645, 'text': 'I mean that is for none of the post born flights.', 'start': 23863.883, 'duration': 3.762}], 'summary': "Using 'where' clause, multiple parameters can be passed to specify conditions, e.g., day=7, origin=jfk, arrival delay<0.", 'duration': 25.432, 'max_score': 23842.213, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23842213.jpg'}, {'end': 23918.022, 'src': 'embed', 'start': 23894.115, 'weight': 6, 'content': [{'end': 23902.241, 'text': 'If someone is not good or is not acquainted to all these transformation and action and would rather use sequel queries on the data.', 'start': 23894.115, 'duration': 8.126}, {'end': 23907.539, 'text': 'They can use this register dot temp table to create a table for their particular data frame.', 'start': 23902.621, 'duration': 4.918}, {'end': 23914.861, 'text': "What I'll do is convert the NYC flights underscore DF data frame into NYC underscore flight table, which can be used later,", 'start': 23907.899, 'duration': 6.962}, {'end': 23918.022, 'text': 'and sequel queries can be performed on this particular table.', 'start': 23914.861, 'duration': 3.161}], 'summary': 'Introduces using register.temptable to create a table for data frames, enabling sql queries on the table.', 'duration': 23.907, 'max_score': 23894.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23894115.jpg'}, {'end': 23970.842, 'src': 'embed', 'start': 23939.807, 'weight': 8, 'content': [{'end': 23943.569, 'text': 'We pass all the sequel query in the sequel contest or sequel function.', 'start': 23939.807, 'duration': 3.762}, {'end': 23944.55, 'text': 'So you can see here.', 'start': 23943.909, 'duration': 0.641}, {'end': 23946.811, 'text': 'We have the minimum airtime as 20.', 'start': 23944.57, 'duration': 2.241}, {'end': 23951.813, 'text': 'now to have a look at the records in which the airtime is minimum 20.', 'start': 23946.811, 'duration': 5.002}, {'end': 23954.394, 'text': 'now we can also use nested sequel queries.', 'start': 23951.813, 'duration': 2.581}, {'end': 23959.157, 'text': 'or suppose, if I want to check which all flights have the minimum airtime as 20.', 'start': 23954.394, 'duration': 4.763}, {'end': 23962.078, 'text': 'now that cannot be done in a simple sequel query.', 'start': 23959.157, 'duration': 2.921}, {'end': 23963.939, 'text': 'We need nested query for that one.', 'start': 23962.538, 'duration': 1.401}, {'end': 23970.842, 'text': 'So selecting Asterix from New York flights where the airtime is in and inside that we have another query,', 'start': 23964.301, 'duration': 6.541}], 'summary': 'Using sequel queries to find flights with minimum airtime of 20.', 'duration': 31.035, 'max_score': 23939.807, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23939807.jpg'}, {'end': 24108.309, 'src': 'embed', 'start': 24077.017, 'weight': 5, 'content': [{'end': 24080.039, 'text': 'This has been an active research topic in data mining for years.', 'start': 24077.017, 'duration': 3.022}, {'end': 24082.68, 'text': 'We have the linear algebra.', 'start': 24080.659, 'duration': 2.021}, {'end': 24087.142, 'text': 'Now this algorithm supports by Spark MLlib utilities for linear algebra.', 'start': 24083.24, 'duration': 3.902}, {'end': 24088.863, 'text': 'We have collaborative filtering.', 'start': 24087.462, 'duration': 1.401}, {'end': 24092.345, 'text': 'We have classification for binary classification.', 'start': 24089.603, 'duration': 2.742}, {'end': 24097.107, 'text': 'Variant methods are available in Spark dot MLlib packets such as multi-class classification.', 'start': 24092.405, 'duration': 4.702}, {'end': 24100.706, 'text': 'as well as regression analysis in classification.', 'start': 24097.705, 'duration': 3.001}, {'end': 24108.309, 'text': 'Some of the most popular algorithms used are naive bias random Forest decision tree and so much and finally we have the linear regression.', 'start': 24100.746, 'duration': 7.563}], 'summary': 'Active research in data mining with spark mllib supporting various algorithms including linear regression and decision tree.', 'duration': 31.292, 'max_score': 24077.017, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g24077017.jpg'}, {'end': 24163.093, 'src': 'embed', 'start': 24122.635, 'weight': 7, 'content': [{'end': 24126.637, 'text': "Let's not try to implement all the concepts which we have learned in pi spark tutorial session.", 'start': 24122.635, 'duration': 4.002}, {'end': 24133.358, 'text': 'Now here we are going to use a heart disease prediction model and we are going to predict it using the decision tree,', 'start': 24127.449, 'duration': 5.909}, {'end': 24136.102, 'text': 'with the help of classification as well as regression.', 'start': 24133.358, 'duration': 2.744}, {'end': 24139.487, 'text': 'Now, these are all are part of the ML live library here.', 'start': 24136.462, 'duration': 3.025}, {'end': 24142.732, 'text': "Let's see how we can perform these types of functions and queries.", 'start': 24139.807, 'duration': 2.925}, {'end': 24163.093, 'text': 'The first of all what we need to do is initialize the spark context.', 'start': 24159.991, 'duration': 3.102}], 'summary': "Using a heart disease prediction model, we will implement classification and regression with spark's mllib library.", 'duration': 40.458, 'max_score': 24122.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g24122635.jpg'}], 'start': 23580.994, 'title': 'Pyspark data frames and mllib overview', 'summary': 'Covers the features and operations of data frames in pyspark, including creation from different data formats, actions such as count and select, filtering using functions and where clauses, and the use of temporary tables for sql queries. additionally, it provides an overview of pyspark mllib, including storage levels, mllib api for machine learning in python, and practical implementation of heart disease prediction using decision tree for classification and regression with test errors of 0.2297 and 0.168.', 'chapters': [{'end': 23984.105, 'start': 23580.994, 'title': 'Data frames in pyspark', 'summary': 'Covers the features and operations of data frames in pyspark, including their characteristics, creation from different data formats, actions such as count and select, filtering using functions and where clauses, and the use of temporary tables for sql queries.', 'duration': 403.111, 'highlights': ['The data frames in PySpark are immutable in nature, allowing lazy evaluation and designed for processing large collections of structured or semi-structured data.', 'A data frame can be created using different data formats like loading from source files such as JSON or CSV or from an existing RDD, and supports databases like Hive, Cassandra, and various file types like pocket files, CSV, and XML.', "The count function on the data frame 'NYC flights_TF' shows that there are 3,036,776 records in the database.", "The describe function provides a summary of the 'distance' column, including the count, mean, standard deviation, minimum value (17), and maximum value (4983).", 'The filter function is used to filter out data based on specific conditions, such as filtering for flights with a distance of 17 or originating from EWR airport.', 'The where clause is used for filtering, and multiple conditions can be passed using the and symbol, for example, filtering for flights with a day of 7th, origin of JFK, and arrival delay less than 0.', "The chapter demonstrates the creation of a temporary table for SQL queries using the 'registerTempTable' method, allowing users to perform SQL queries on the data frame.", 'The use of nested SQL queries is shown, where a nested query is used to find flights with the minimum airtime as 20, showcasing the capability of performing complex operations using SQL.']}, {'end': 24411.559, 'start': 23984.545, 'title': 'Pyspark mllib overview', 'summary': 'Covers an overview of pyspark mllib, including topics such as storage levels, mllib api for machine learning in python, and practical implementation of heart disease prediction using decision tree for classification and regression with test errors of 0.2297 and 0.168.', 'duration': 427.014, 'highlights': ['The chapter covers an overview of PySpark MLlib, including topics such as storage levels, MLlib API for machine learning in Python, and practical implementation of heart disease prediction using decision tree for classification and regression with test errors of 0.2297 and 0.168.', 'MLlib is a machine learning API provided by Spark that supports various algorithms such as model-based collaborative filtering, clustering, frequent pattern matching, linear algebra, collaborative filtering, classification, and regression.', 'Practical implementation of heart disease prediction involves data cleaning, creating a data frame, initializing the spark context, creating label points, splitting data into training and testing sets, training the decision tree model for classification, evaluating the model, and performing regression with test errors of 0.2297 for classification and 0.168 for regression.']}], 'duration': 830.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g23580994.jpg', 'highlights': ["The count function on the data frame 'NYC flights_TF' shows 3,036,776 records", "The describe function provides a summary of the 'distance' column", 'The data frames in PySpark are immutable and designed for processing large data', 'A data frame can be created from different data formats like JSON or CSV', 'The where clause is used for filtering based on multiple conditions', 'The chapter covers an overview of PySpark MLlib and its practical implementation', "The creation of a temporary table for SQL queries using the 'registerTempTable' method", 'Practical implementation of heart disease prediction using decision tree for classification and regression with test errors of 0.2297 and 0.168', 'The use of nested SQL queries to find flights with the minimum airtime as 20']}, {'end': 26325.293, 'segs': [{'end': 24695.696, 'src': 'embed', 'start': 24666.341, 'weight': 0, 'content': [{'end': 24673.307, 'text': 'So, Spark, whenever you do iterative computing, again and again do the processing on the same data, especially in machine learning, deep learning,', 'start': 24666.341, 'duration': 6.966}, {'end': 24675.95, 'text': 'all we will be using the iterative computing.', 'start': 24673.307, 'duration': 2.643}, {'end': 24677.631, 'text': 'Here, Spark performs much better.', 'start': 24676.19, 'duration': 1.441}, {'end': 24682.03, 'text': 'you will see the Spark performance improvement 100 times faster than MapReduce.', 'start': 24678.069, 'duration': 3.961}, {'end': 24688.173, 'text': 'But if it is one time processing and fire and forget that type of processing Spark,', 'start': 24682.751, 'duration': 5.422}, {'end': 24695.696, 'text': 'literally it may be the same latency you will be getting it than MapReduce, maybe like some improvements because of the building block of Spark.', 'start': 24688.173, 'duration': 7.523}], 'summary': 'Spark outperforms mapreduce by 100x in iterative computing.', 'duration': 29.355, 'max_score': 24666.341, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g24666341.jpg'}, {'end': 24799.489, 'src': 'embed', 'start': 24756.543, 'weight': 1, 'content': [{'end': 24760.165, 'text': 'So we need to give all the steps as well as what final result I want.', 'start': 24756.543, 'duration': 3.622}, {'end': 24768.14, 'text': "It's going to calculate the optimal cycle, or optimal calculation, what all steps needs to be calculated, or what all steps needs to be executed.", 'start': 24760.754, 'duration': 7.386}, {'end': 24769.941, 'text': 'Only those steps it will be executing it.', 'start': 24768.38, 'duration': 1.561}, {'end': 24772.143, 'text': "So basically it's a lazy execution.", 'start': 24770.562, 'duration': 1.581}, {'end': 24776.266, 'text': 'Only if the results needs to be processed, it will be processing that specific result.', 'start': 24772.804, 'duration': 3.462}, {'end': 24778.428, 'text': 'And it supports real-time computing.', 'start': 24777.187, 'duration': 1.241}, {'end': 24780.09, 'text': "It's through Spark streaming.", 'start': 24778.869, 'duration': 1.221}, {'end': 24783.172, 'text': 'There is a component called Spark streaming which supports real-time computing.', 'start': 24780.37, 'duration': 2.802}, {'end': 24786.715, 'text': 'And it gels with Hadoop ecosystem very well.', 'start': 24783.632, 'duration': 3.083}, {'end': 24792.125, 'text': 'It can run on top of Hadoop YARN, or it can leverage the HDFS to do the processing.', 'start': 24787.215, 'duration': 4.91}, {'end': 24799.489, 'text': 'So, when it leverages the HDFS, the Hadoop cluster container can be used to do the distributed computing,', 'start': 24792.646, 'duration': 6.843}], 'summary': 'Lazy execution for optimal cycle using spark streaming in hadoop ecosystem.', 'duration': 42.946, 'max_score': 24756.543, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g24756543.jpg'}, {'end': 25388.832, 'src': 'embed', 'start': 25361.038, 'weight': 4, 'content': [{'end': 25364.239, 'text': 'to do machine learning, to do graph computing, to do streaming.', 'start': 25361.038, 'duration': 3.201}, {'end': 25366.46, 'text': 'We have a number of other components.', 'start': 25364.8, 'duration': 1.66}, {'end': 25374.704, 'text': 'So the majorly used components are these components like Spark SQL, Spark Streaming, MLlib, Graphics, and Spark R.', 'start': 25367.221, 'duration': 7.483}, {'end': 25376.984, 'text': 'At the high level, we will see what are these components.', 'start': 25374.704, 'duration': 2.28}, {'end': 25383.427, 'text': "Spark SQL, especially it's designed to do the processing against a structured data.", 'start': 25377.905, 'duration': 5.522}, {'end': 25385.408, 'text': 'So we can write SQL queries.', 'start': 25383.887, 'duration': 1.521}, {'end': 25388.832, 'text': 'and we can handle or we can do the processing.', 'start': 25386.571, 'duration': 2.261}], 'summary': 'Spark is utilized for machine learning, graph computing, and streaming, with major components including spark sql, spark streaming, mllib, graphics, and spark r.', 'duration': 27.794, 'max_score': 25361.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g25361038.jpg'}, {'end': 26077.681, 'src': 'embed', 'start': 26051.33, 'weight': 2, 'content': [{'end': 26058.033, 'text': 'So the vector RDD will be used to represent the vector directly and that will be used extensively while doing the machine learning.', 'start': 26051.33, 'duration': 6.703}, {'end': 26059.933, 'text': 'Yeah Jason, thank you.', 'start': 26058.633, 'duration': 1.3}, {'end': 26062.274, 'text': "And there's another question what is RDD lineage?", 'start': 26060.294, 'duration': 1.98}, {'end': 26068.297, 'text': 'So here, any data processing, any transformations that we do, it maintains something called a lineage.', 'start': 26063.135, 'duration': 5.162}, {'end': 26071.198, 'text': 'So how data is getting transformed?', 'start': 26068.957, 'duration': 2.241}, {'end': 26077.681, 'text': 'Well, the data is available in a partition form in multiple systems, and when we do the transformation it will undergo multiple steps.', 'start': 26071.318, 'duration': 6.363}], 'summary': 'Vector rdd extensively used in machine learning. rdd lineage maintains transformation steps.', 'duration': 26.351, 'max_score': 26051.33, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g26051330.jpg'}], 'start': 24412.42, 'title': 'Apache spark overview', 'summary': 'Provides an overview of apache spark covering its features, limitations, ecosystem, rdds, lineage, cluster computing, real-time processing, and core functionalities, including its improved performance compared to mapreduce, support for multiple languages and data formats, and its role in distributed environments.', 'chapters': [{'end': 24495.395, 'start': 24412.42, 'title': 'Spark interview questions overview', 'summary': 'Provides an overview of a spark interview question session covering basic and spark core technologies, including the depth and nodes of a regression tree model, and the components of spark core such as streaming, graphics, ml lab, and sql.', 'duration': 82.975, 'highlights': ['The regression tree model has been created to the depth of three, with 15 nodes, and the features and classification of the tree are discussed.', 'The session covers commonly asked interview questions related to Spark technology, including an introduction to Spark Core technologies such as streaming, graphics, ML Lab, and SQL.']}, {'end': 24776.266, 'start': 24495.916, 'title': 'Apache spark: cluster computing & real-time processing', 'summary': 'Discusses apache spark, an active open-source project for cluster computing and in-memory processing, with improved performance compared to mapreduce, supporting multiple languages and data formats, and featuring lazy evaluation for optimal execution.', 'duration': 280.35, 'highlights': ['Apache Spark is an open-source project used for cluster computing and in-memory processing, with an active community and multiple releases.', "Spark's in-memory computing and support for real-time processing differentiate it from other cluster computing projects.", "Spark's performance in iterative computing, such as in machine learning, is approximately 100 times faster than MapReduce, while latency-wise, it is much lower due to data caching.", 'Spark supports various programming languages like Python, Java, R, and Scala, and accepts input in different data formats like JSON and Parquet.', 'The key selling point of Spark is its lazy evaluation, calculating only the necessary steps for achieving the final result, leading to optimal execution.']}, {'end': 25276.566, 'start': 24777.187, 'title': 'Spark features and limitations', 'summary': 'Discusses the key features of apache spark, including real-time computing, machine learning algorithms, graph theory support, resource management with yarn, and file system support. it also highlights limitations such as higher storage utilization, network transfer constraints, and cost inefficiencies of in-memory computing. spark outperforms hadoop in real-time processing and iterative computing.', 'duration': 499.379, 'highlights': ['Spark supports real-time computing through Spark streaming, and it can run on top of Hadoop YARN and leverage HDFS for processing, enabling distributed computing and resource management.', 'Apache Spark features a wide range of machine learning algorithms within MLlib and supports graph theory through its graphics component.', 'YARN, as the resource manager in the Hadoop ecosystem, provides the resource management platform across all clusters, while Spark focuses on data processing and needs to be installed on all nodes within the cluster.', 'Spark can directly use Hadoop Distributed File System (HDFS) for distributed data processing, providing data locality advantages and also supports file systems like S3 and Alexo.', 'Limitations of Spark include higher storage utilization, network transfer constraints, cost inefficiencies of in-memory computing, and the need for careful app execution in distributed environments.', 'Spark outperforms Hadoop in real-time processing and iterative computing, providing faster performance and leveraging iterative processing for improved data retrieval from memory.']}, {'end': 26050.449, 'start': 25277.226, 'title': 'Spark ecosystem and core functionalities', 'summary': 'Discusses the components of spark ecosystem, including spark sql, spark streaming, mllib, graphics, and spark r, along with the core functionalities of spark such as rdds, their fault tolerance, and partitioning. it also explains the use of spark alongside hadoop and the creation of rdds, executor memory, and partitions in spark applications.', 'duration': 773.223, 'highlights': ['The chapter explains the major components of the Spark ecosystem including Spark SQL, Spark Streaming, MLlib, Graphics, and Spark R, along with their respective functionalities and use cases.', 'It details the core functionalities of Spark, focusing on RDDs as the building blocks, their fault tolerance, partitioning, and the balance between batch and stream processing when using Spark alongside Hadoop.', 'The discussion also covers the creation of RDDs using Spark context and loading data from external sources, defining executor memory in Spark applications, and the concept of partitions in Apache Spark.']}, {'end': 26325.293, 'start': 26051.33, 'title': 'Spark rdd, lineage, and cluster managers', 'summary': 'Covers the importance of rdd lineage in handling data transformations and failures in distributed systems, the role of spark driver in coordinating the execution of jobs, and the types of cluster managers in spark, such as yarn, mesos, and standalone, as well as the function of worker nodes in distributed environments.', 'duration': 273.963, 'highlights': ['RDD lineage is crucial in maintaining the history of data transformations and enables restoration of the last partition in case of failures, ensuring data integrity and fault tolerance.', 'Spark driver serves as the coordinator for the complete execution of jobs, connecting to the Spark master, delivering the RDD graph, and coordinating the tasks in the distributed environment, enabling parallel processing and actions against the RDD.', 'Various types of cluster managers in Spark include YARN, standalone, and Mesos, each catering to specific resource management needs and compatibility with different computing platforms or frameworks.', 'Worker nodes in a distributed environment are responsible for actual data processing, resource utilization reporting, and task execution, maintaining data locality and resource availability for efficient processing.']}], 'duration': 1912.873, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g24412420.jpg', 'highlights': ["Spark's performance in iterative computing is approximately 100 times faster than MapReduce, with lower latency due to data caching.", 'Spark supports real-time computing through Spark streaming and can run on top of Hadoop YARN, leveraging HDFS for processing.', 'RDD lineage is crucial for maintaining the history of data transformations and ensuring data integrity and fault tolerance.', "Spark's lazy evaluation calculates only the necessary steps for achieving the final result, leading to optimal execution.", 'The chapter explains the major components of the Spark ecosystem, including Spark SQL, Spark Streaming, MLlib, Graphics, and Spark R, along with their respective functionalities and use cases.']}, {'end': 28104.871, 'segs': [{'end': 26465.499, 'src': 'embed', 'start': 26431.92, 'weight': 3, 'content': [{'end': 26433.62, 'text': 'Sparse means thinly distributed.', 'start': 26431.92, 'duration': 1.7}, {'end': 26442.863, 'text': 'So to represent the huge amount of data with the position and saying this particular position is having a zero value.', 'start': 26434.961, 'duration': 7.902}, {'end': 26446.104, 'text': 'we can mention that with a key and value.', 'start': 26442.863, 'duration': 3.241}, {'end': 26448.085, 'text': 'so what position having what value?', 'start': 26446.104, 'duration': 1.981}, {'end': 26455.511, 'text': 'Rather than storing all zeros, I can store only non-zeros, the position of it and the corresponding value.', 'start': 26448.726, 'duration': 6.785}, {'end': 26458.433, 'text': 'That means all of this is going to be a zero value.', 'start': 26455.731, 'duration': 2.702}, {'end': 26465.499, 'text': 'So we can mention this particular sparse vector, mentioning it to represent the non-zero entities.', 'start': 26458.954, 'duration': 6.545}], 'summary': 'Sparse vectors store non-zero entities to represent huge data efficiently.', 'duration': 33.579, 'max_score': 26431.92, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g26431920.jpg'}, {'end': 26511.852, 'src': 'embed', 'start': 26483.051, 'weight': 4, 'content': [{'end': 26489.194, 'text': "Spark Streaming is used for processing real-time streaming data to precisely say it's a micro-batch processing.", 'start': 26483.051, 'duration': 6.143}, {'end': 26493.636, 'text': 'So data will be collected between every small interval, say maybe like .', 'start': 26489.654, 'duration': 3.982}, {'end': 26496.017, 'text': "5 seconds or every seconds, and it'll get processed.", 'start': 26493.636, 'duration': 2.381}, {'end': 26498.318, 'text': "So internally it's going to create micro-batches.", 'start': 26496.417, 'duration': 1.901}, {'end': 26502.46, 'text': 'The data created out of that micro-batch we call that is a DStream.', 'start': 26499.158, 'duration': 3.302}, {'end': 26511.852, 'text': 'DStream is like a RDD, so I can do transformations and actions, whatever that I do with RDD, I can do it with DStream as well.', 'start': 26503.069, 'duration': 8.783}], 'summary': 'Spark streaming processes real-time data in micro-batches, creating dstreams similar to rdds for transformations and actions.', 'duration': 28.801, 'max_score': 26483.051, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g26483051.jpg'}, {'end': 27075.893, 'src': 'embed', 'start': 27051.598, 'weight': 0, 'content': [{'end': 27059.74, 'text': "it's a subset of data science world where we have different type of algorithms, different categories of algorithms, like clustering, regression,", 'start': 27051.598, 'duration': 8.142}, {'end': 27061.6, 'text': 'dimensionality reduction all that we have.', 'start': 27059.74, 'duration': 1.86}, {'end': 27066.641, 'text': 'And all these algorithms, or most of the algorithms, have been implemented in Spark.', 'start': 27062.52, 'duration': 4.121}, {'end': 27075.893, 'text': 'And Spark is the preferred framework or preferred application component to do the machine learning algorithm nowadays or machine learning processing.', 'start': 27067.506, 'duration': 8.387}], 'summary': 'Subset of data science with various algorithms implemented in spark.', 'duration': 24.295, 'max_score': 27051.598, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g27051598.jpg'}, {'end': 27580.104, 'src': 'embed', 'start': 27551.99, 'weight': 1, 'content': [{'end': 27555.113, 'text': 'Of course, yes, Spark SQL can work only with the structured data.', 'start': 27551.99, 'duration': 3.123}, {'end': 27558.714, 'text': 'It can be used to load varieties of structured data.', 'start': 27555.632, 'duration': 3.082}, {'end': 27563.196, 'text': 'And you can use SQL-like statements to query against the program.', 'start': 27559.214, 'duration': 3.982}, {'end': 27567.558, 'text': 'And it can be used with external tools to connect to the Spark as well.', 'start': 27563.856, 'duration': 3.702}, {'end': 27569.939, 'text': 'It gives a very good integration with the SQL.', 'start': 27567.918, 'duration': 2.021}, {'end': 27577.963, 'text': 'And using Python, Java, or scalar code, we can create a RDD from the structured data available directly using the Spark SQL.', 'start': 27570.399, 'duration': 7.564}, {'end': 27580.104, 'text': 'I can generate the RDD.', 'start': 27578.383, 'duration': 1.721}], 'summary': 'Spark sql works with structured data, offers sql-like querying, and integrates well with external tools.', 'duration': 28.114, 'max_score': 27551.99, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g27551990.jpg'}, {'end': 27664.208, 'src': 'embed', 'start': 27634.926, 'weight': 2, 'content': [{'end': 27641.272, 'text': 'Can you use Spark to access and analyze data stored in Cassandra database? Yes, it is possible.', 'start': 27634.926, 'duration': 6.346}, {'end': 27649.299, 'text': 'Okay, not only Cassandra, any of the NoSQL database, it can very well do the processing, and Cassandra also works in a distributed architecture.', 'start': 27641.834, 'duration': 7.465}, {'end': 27653.541, 'text': "It's a NoSQL database, so it can leverage the data locality.", 'start': 27649.559, 'duration': 3.982}, {'end': 27657.724, 'text': 'The query can be executed locally where the Cassandra nodes are available.', 'start': 27653.962, 'duration': 3.762}, {'end': 27664.208, 'text': "It's going to make the query execution faster and reduce the network load and Spark Executors.", 'start': 27658.144, 'duration': 6.064}], 'summary': 'Spark can access & analyze cassandra & other nosql data, leveraging data locality for faster query execution.', 'duration': 29.282, 'max_score': 27634.926, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g27634926.jpg'}], 'start': 26326.243, 'title': 'Spark and its components', 'summary': 'Covers sparse vectors, spark streaming with micro-batch processing, dstream, fault tolerance, sliding window operation, machine learning, mllib components, spark sql functions, and using spark with nosql databases like cassandra, emphasizing minimizing data transfers.', 'chapters': [{'end': 26566.704, 'start': 26326.243, 'title': 'Sparse vectors and spark streaming', 'summary': 'Discusses the concept of sparse vectors and their application in representing data, along with an overview of spark streaming, including its micro-batch processing, dstream, fault tolerance, and sliding window operation.', 'duration': 240.461, 'highlights': ['The chapter explains sparse vectors as a way to represent data thinly distributed, with an example of using a vector to represent ratings of products, where only non-zero values are stored, reducing storage requirements. (relevance: 5)', 'It describes Spark Streaming as a micro-batch processing technique, creating DStreams similar to RDDs, capable of reading data from various sources, offering high throughput, fault tolerance, and utilizing sliding window operation for processing real-time streaming data. (relevance: 4)']}, {'end': 27220.745, 'start': 26567.245, 'title': 'Spark streaming and machine learning', 'summary': 'Discusses the significance of sliding window operation in spark streaming, the representation of data using dstream, caching in spark streaming, implementation of graphs in spark, including the pagerank algorithm, lineage graph, and checkpointing, as well as the implementation of machine learning in spark, covering mllib components and categories of machine learning.', 'duration': 653.5, 'highlights': ['The sliding window operation in Spark streaming allows for the normalization of drastic changes or spikes in data patterns by considering the average or sum of data within a specific time frame, ensuring the accuracy of trending hashtags, and can handle prior data and window size automatically.', 'DStream, or discretized stream, is an abstract representation of data in Spark streaming that allows for the application of streaming functions and transformations against micro-batches at defined intervals, enabling the processing of series of RDDs.', 'Caching in Spark streaming involves persisting data in memory to ensure its availability for further processing, with the option to define whether it should be in memory only or in memory and hard disk, optimizing memory usage and enhancing processing efficiency.', 'Spark graphics provides an API for implementing graphs and offers functionalities to represent graphs through the creation of edge and vertex RDDs, allowing for distributed graph processing and the calculation of page rank for specific nodes within the graph.', 'The lineage graph in Spark records the complete history and transformations of RDDs, enabling the regeneration of lost partitions or complete RDDs based on need, effectively saving time and memory usage by triggering recalculation only on demand.', 'Apache Spark provides checkpointing to store and regenerate data in case of system failure, ensuring the restoration of the previous state and enabling the recovery of lost data within the Spark streaming environment.', 'Machine learning algorithms, including clustering, regression, and dimensionality reduction, are implemented in Spark through MLlib, offering features such as featurization, persistence, and pipeline processing, making Spark a preferred framework for iterative processing and enhancing algorithm performance.', 'Categories of machine learning include supervised, unsupervised, and reinforced learning, with supervised learning involving known categories and unsupervised learning identifying patterns and grouping data into categories based on the available information.']}, {'end': 27634.446, 'start': 27221.685, 'title': 'Spark mllib and spark sql overview', 'summary': 'Provides an overview of spark mllib, highlighting its features, popular algorithms, and utilities like regression, classification, recommendation systems, clustering, and dimensionality reduction, as well as an overview of spark sql and its functions, including lazy evaluation and parquet file usage.', 'duration': 412.761, 'highlights': ['Spark MLlib offers a variety of algorithms including regression, classification, recommendation systems, clustering, and dimensionality reduction, with the ability to process huge amounts of data and reduce dimensions without losing features.', 'Spark SQL facilitates the loading of structured data, supports SQL-like statements for querying, and enables integration with external tools, providing a seamless experience for database professionals.', 'Lazy evaluation in Spark ensures that processing occurs only when the final result is required, optimizing data processing and minimizing unnecessary transformations.']}, {'end': 28104.871, 'start': 27634.926, 'title': 'Using spark with nosql database & data transfers', 'summary': 'Discusses using spark with nosql databases, focusing on cassandra, and emphasizes minimizing data transfers in spark through broadcast variables and accumulators, as well as explaining the need for broadcast variables, automatic cleanups, levels of persistence, schema rdd, and spark streaming use cases.', 'duration': 469.945, 'highlights': ['Using Spark with NoSQL databases, such as Cassandra, enables faster query execution and reduced network load, leveraging data locality.', 'Minimizing data transfers in Spark can be achieved through broadcast variables and accumulators, which help in distributing data and consolidating values across multiple workers.', 'Broadcast variables in Spark are read-only, cached in memory, and eliminate the need to move data from a centralized location to all executors within the cluster.', 'Automatic cleanups in Spark, triggered by setting TTL, handle accumulated metadata by writing data results into the disk and cleaning unnecessary RDDs.', 'Various levels of persistence in Apache Spark include storing data in memory only, memory and disk, or disk only, and can be stored in a serialized form for future reuse.', 'Schema RDD in Spark SQL contains meta information and schema, making it easier to handle the data as it stores the data and its structure together.', 'Spark streaming is utilized for scenarios like sentiment analysis of tweets, where data is streamed using tools like Flume, and structured data is processed for further analysis and machine learning using MLlib.', 'Spark offers different offerings such as MLlib, Streaming, and Core, each solving different problems and working together to provide comprehensive solutions.', 'The chapter concludes with an invitation to explore more about Edureka for further learning opportunities.']}], 'duration': 1778.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/F8pyaR4uQ2g/pics/F8pyaR4uQ2g26326243.jpg', 'highlights': ['Spark MLlib offers regression, classification, recommendation systems, clustering, and dimensionality reduction. (relevance: 5)', 'Spark SQL supports SQL-like statements for querying and enables integration with external tools. (relevance: 4)', 'Using Spark with NoSQL databases like Cassandra reduces network load and enables faster query execution. (relevance: 3)', 'Sparse vectors represent thinly distributed data, reducing storage requirements. (relevance: 2)', 'Spark Streaming utilizes micro-batch processing, fault tolerance, and sliding window operation for real-time data. (relevance: 1)']}], 'highlights': ['Japan utilized Apache Spark for real-time earthquake detection, saving millions of lives in 60 seconds.', 'Leading companies like Facebook, Apple, Netflix, and Uber have deployed Spark at massive scale for processing petabytes of data.', 'Apache Spark is 100 times faster than MapReduce, with in-memory execution and high-level APIs.', 'Spark SQL provides faster processing speed compared to Hive, reducing query execution time from 10 minutes to less than one minute.', 'Spark enables innovations such as detecting fraudulent behavior and delivering personalized experiences in real time, transforming multiple industries.', 'Spark supports real-time computing through Spark streaming and can run on top of Hadoop YARN, leveraging HDFS for processing.', 'Spark MLlib offers regression, classification, recommendation systems, clustering, and dimensionality reduction.', 'The crash course covers 12 modules, including Spark introduction, components, data frames, SQL queries, streaming, and machine learning algorithms.', 'Spark streaming can scale to even multiple nodes, running till hundreds of nodes, ensuring quick streaming and processing of data, and fault tolerance to prevent data loss.', 'Spark SQL integrates relational processing and supports querying data via SQL or HQL, serving as an easy transition from traditional relational data processing tools.']}