title
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka
description
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka video on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, dataframes and mllib.
Subscribe to our channel to get video updates. Hit the subscribe button above.
Edureka PySpark Playlist: https://goo.gl/pCym9F
--------------------------------------------
About the Course
Edureka’s PySpark Certification Training is designed to provide you the knowledge and skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175). Throughout the PySpark Training, you will get an in-depth knowledge of Apache Spark and the Spark Ecosystem, which includes Spark RDD, Spark SQL, Spark MLlib and Spark Streaming. You will also get comprehensive knowledge of Python Programming language, HDFS, Sqoop, Flume, Spark GraphX and Messaging System such as Kafka.
----------------------------------------------
Spark Certification Training is designed by industry experts to make you a Certified Spark Developer. The PySpark Course offers:
Overview of Big Data & Hadoop including HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator)
Comprehensive knowledge of various tools that falls in Spark Ecosystem like Spark SQL, Spark MlLib, Sqoop, Kafka, Flume and Spark Streaming
The capability to ingest data in HDFS using Sqoop & Flume, and analyze those large datasets stored in the HDFS
The power of handling real time data feeds through a publish-subscribe messaging system like Kafka
The exposure to many real-life industry-based projects which will be executed using Edureka’s CloudLab
Projects which are diverse in nature covering banking, telecommunication, social media, and govenment domains
Rigorous involvement of a SME throughout the Spark Training to learn industry standards and best practices
---------------------------------------------------
Who should go for this course?
Market for Big Data Analytics is growing tremendously across the world and such strong growth pattern followed by market demand is a great opportunity for all IT Professionals. Here are a few Professional IT groups, who are continuously enjoying the benefits and perks of moving into Big Data domain.
Developers and Architects
BI /ETL/DW Professionals
Senior IT Professionals
Mainframe Professionals
Freshers
Big Data Architects, Engineers and Developers
Data Scientists and Analytics Professionals
-------------------------------------------------------
There are no such prerequisites for Edureka’s PySpark Training Course. However, prior knowledge of Python Programming and SQL will be helpful but is not at all mandatory.
--------------------------------------------------------
For more information, please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: 18338555775 (toll free).
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
detail
{'title': 'Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka', 'heatmap': [{'end': 388.865, 'start': 329.823, 'weight': 0.899}, {'end': 629.2, 'start': 583.511, 'weight': 0.732}, {'end': 717.616, 'start': 636.003, 'weight': 0.725}, {'end': 886.584, 'start': 819.21, 'weight': 0.923}, {'end': 1081.958, 'start': 1043.632, 'weight': 0.733}, {'end': 1137.106, 'start': 1111.761, 'weight': 0.714}, {'end': 1394.618, 'start': 1374.376, 'weight': 0.707}], 'summary': 'This pyspark tutorial introduces fundamental concepts like spark context, dataframes, mllib, and rdds, with a real-life use case demo. it also covers spark ecosystem components, pyspark advantages, installation, rdd operations, managing frameworks, spark configuration, data frames, mllib, data preprocessing, classification, and decision tree modeling in pyspark.', 'chapters': [{'end': 63.384, 'segs': [{'end': 63.384, 'src': 'embed', 'start': 7.099, 'weight': 0, 'content': [{'end': 13.688, 'text': 'Apache Spark is a powerful framework, which is being heavily used in the industry for real-time analytics and machine learning purposes.', 'start': 7.099, 'duration': 6.589}, {'end': 18.755, 'text': 'So I Kisly on behalf of Edureka welcome you all to this session on PySpark tutorial.', 'start': 14.149, 'duration': 4.606}, {'end': 23.782, 'text': "So before I proceed with this session, let's have a quick look at the topics which will be covering today.", 'start': 19.176, 'duration': 4.606}, {'end': 27.846, 'text': 'So starting off by explaining what exactly is PySpark and how it works.', 'start': 24.284, 'duration': 3.562}, {'end': 31.648, 'text': "Moving ahead, we'll find out the various advantages provided by PySpark.", 'start': 28.286, 'duration': 3.362}, {'end': 35.23, 'text': "Then I'll be showing you how to install PySpark in your systems.", 'start': 32.107, 'duration': 3.123}, {'end': 43.554, 'text': 'Once we are done with the installation, I will talk about the fundamental concepts of PySpark, like the Spark context, DataFrames, MLLab,', 'start': 35.77, 'duration': 7.784}, {'end': 44.835, 'text': 'RDDs and much more.', 'start': 43.554, 'duration': 1.281}, {'end': 51.799, 'text': "And finally, I'll close off this session with a demo in which I'll show you how to implement PySpark to solve real-life use cases.", 'start': 45.676, 'duration': 6.123}, {'end': 58.542, 'text': "So without any further ado, let's quickly embark on our journey to PI spark now before I start off with PI spark.", 'start': 52.478, 'duration': 6.064}, {'end': 63.384, 'text': 'Let me first brief you about the PI spark ecosystem as you can see from the diagram.', 'start': 58.582, 'duration': 4.802}], 'summary': 'Apache spark is heavily used for real-time analytics and machine learning. the tutorial covers pyspark installation, fundamental concepts, and a demo of real-life use cases.', 'duration': 56.285, 'max_score': 7.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg7099.jpg'}], 'start': 7.099, 'title': 'Pyspark tutorial', 'summary': 'Introduces a pyspark tutorial covering its definition, advantages, installation process, fundamental concepts like spark context, dataframes, mllab, rdds, and concludes with a demo to implement pyspark for real-life use cases.', 'chapters': [{'end': 63.384, 'start': 7.099, 'title': 'Pyspark tutorial overview', 'summary': 'Introduces a pyspark tutorial covering its definition, advantages, installation process, fundamental concepts like spark context, dataframes, mllab, rdds, and concludes with a demo to implement pyspark for real-life use cases.', 'duration': 56.285, 'highlights': ['The session covers the fundamental concepts of PySpark, including Spark context, DataFrames, MLLab, RDDs, and more. The session will cover fundamental concepts of PySpark such as Spark context, DataFrames, MLLab, RDDs, and more.', 'The tutorial includes a demo showcasing how to implement PySpark for real-life use cases. The session concludes with a demo on implementing PySpark for real-life use cases.', 'PySpark is heavily used in the industry for real-time analytics and machine learning purposes. Apache Spark is a powerful framework heavily used in the industry for real-time analytics and machine learning purposes.', 'The session also covers the advantages provided by PySpark. The session will cover the various advantages provided by PySpark.', 'The tutorial will guide the audience through the installation process of PySpark in their systems. The tutorial will guide the audience through the installation process of PySpark in their systems.']}], 'duration': 56.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg7099.jpg', 'highlights': ['The tutorial includes a demo showcasing how to implement PySpark for real-life use cases.', 'PySpark is heavily used in the industry for real-time analytics and machine learning purposes.', 'The session covers the fundamental concepts of PySpark, including Spark context, DataFrames, MLLab, RDDs, and more.', 'The session also covers the advantages provided by PySpark.', 'The tutorial will guide the audience through the installation process of PySpark in their systems.']}, {'end': 239.398, 'segs': [{'end': 217.683, 'src': 'embed', 'start': 164.337, 'weight': 0, 'content': [{'end': 172.764, 'text': 'which is a python API for spark that lets you harness the simplicity of python and the power of Apache spark in order to tame pit data.', 'start': 164.337, 'duration': 8.427}, {'end': 178.469, 'text': 'a pi spark also lets you use the RDDs and come with the default integration of pi4j library.', 'start': 172.764, 'duration': 5.705}, {'end': 181.387, 'text': "We'll learn about RDDs later in this video.", 'start': 179.105, 'duration': 2.282}, {'end': 186.251, 'text': "Now that you know what is PySpark, let's now see the advantages of using Spark with Python.", 'start': 181.767, 'duration': 4.484}, {'end': 189.493, 'text': 'As we all know, Python itself is very simple and easy.', 'start': 186.851, 'duration': 2.642}, {'end': 194.157, 'text': 'So when Spark is written in Python, it makes Apache Spark quite easy to learn and use.', 'start': 190.054, 'duration': 4.103}, {'end': 199.942, 'text': "Moreover, it's a dynamically typed language which means RDDs can hold objects of multiple data types.", 'start': 194.738, 'duration': 5.204}, {'end': 203.825, 'text': 'Not only this, it also makes the API simple and comprehensive.', 'start': 200.562, 'duration': 3.263}, {'end': 209.017, 'text': 'and talking about the readability of code, maintenance and familiarity with the python API,', 'start': 204.434, 'duration': 4.583}, {'end': 211.939, 'text': 'for Apache spark is far better than other programming languages.', 'start': 209.017, 'duration': 2.922}, {'end': 217.683, 'text': 'Python also provides various options for visualization, which is not possible using Scala or Java.', 'start': 212.62, 'duration': 5.063}], 'summary': 'Pyspark combines simplicity of python with power of apache spark for data processing.', 'duration': 53.346, 'max_score': 164.337, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg164337.jpg'}], 'start': 63.545, 'title': 'Spark ecosystem and pyspark in python', 'summary': 'Discusses the components of the spark ecosystem, with a focus on pyspark in python, highlighting its advantages such as simplicity, flexibility, and wide range of libraries, making it easy to learn and use.', 'chapters': [{'end': 239.398, 'start': 63.545, 'title': 'Spark ecosystem and pyspark in python', 'summary': 'Discusses the components of the spark ecosystem, with a focus on pyspark in python, highlighting its advantages such as simplicity, flexibility, and wide range of libraries, making it easy to learn and use.', 'duration': 175.853, 'highlights': ["PySpark integrates the simplicity of Python and the power of Apache Spark, making it easy to tame big data and providing default integration of Py4J library. PySpark integrates Python's simplicity with Apache Spark's power, providing default integration of Py4J library for big data processing.", "Python's dynamically typed nature allows RDDs to hold objects of multiple data types, making the API simple and comprehensive. Python's dynamically typed nature enables RDDs to hold objects of multiple data types, simplifying the API.", "Python's readability, maintenance, familiarity, and visualization options make it advantageous for using Spark, especially in comparison to other programming languages like Scala or Java. Python's readability, maintenance, familiarity, and visualization options make it advantageous for using Spark, especially compared to other languages."]}], 'duration': 175.853, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg63545.jpg', 'highlights': ["PySpark integrates Python's simplicity with Apache Spark's power, providing default integration of Py4J library for big data processing.", "Python's dynamically typed nature enables RDDs to hold objects of multiple data types, simplifying the API.", "Python's readability, maintenance, familiarity, and visualization options make it advantageous for using Spark, especially compared to other languages."]}, {'end': 559.307, 'segs': [{'end': 286.245, 'src': 'embed', 'start': 258.603, 'weight': 0, 'content': [{'end': 262.991, 'text': 'So in order to install PySpark first make sure that you have Hadoop installed in your system.', 'start': 258.603, 'duration': 4.388}, {'end': 268.88, 'text': 'So, if you want to know more about how to install Hadoop, please check out our Hadoop playlist on YouTube,', 'start': 263.311, 'duration': 5.569}, {'end': 271.123, 'text': 'or you can check out our blog on Edureka website.', 'start': 268.88, 'duration': 2.243}, {'end': 278.701, 'text': 'First of all, you need to go to the Apache spark official website, which is spark.apache.org and a download section.', 'start': 272.077, 'duration': 6.624}, {'end': 286.245, 'text': 'You can download the latest version of spark release, which supports the latest version of Hadoop or Hadoop version 2.7 or above, now,', 'start': 278.721, 'duration': 7.524}], 'summary': 'To install pyspark, ensure hadoop is installed. download latest spark supporting hadoop 2.7+.', 'duration': 27.642, 'max_score': 258.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg258603.jpg'}, {'end': 388.865, 'src': 'heatmap', 'start': 329.823, 'weight': 0.899, 'content': [{'end': 335.586, 'text': "Let's now dive deeper into PI spark and learn few of its fundamentals which you need to know in order to work with by spark.", 'start': 329.823, 'duration': 5.763}, {'end': 340.288, 'text': 'Now this timeline shows the various topics which we will be covering under the PI spark fundamentals.', 'start': 336.066, 'duration': 4.222}, {'end': 343.349, 'text': "So let's start off with the very first topic in our list.", 'start': 341.088, 'duration': 2.261}, {'end': 344.65, 'text': 'That is the spark context.', 'start': 343.389, 'duration': 1.261}, {'end': 348.052, 'text': 'The spark context is the heart of any spark application.', 'start': 345.43, 'duration': 2.622}, {'end': 353.214, 'text': 'It sets up internal services and establishes a connection to a spark execution environment.', 'start': 348.392, 'duration': 4.822}, {'end': 361.951, 'text': 'Through a Spark context object, you can create RDDs, accumulators, and broadcast variable, access Spark services, run jobs, and much more.', 'start': 353.649, 'duration': 8.302}, {'end': 370.114, 'text': 'The Spark context allows the Spark driver application to access the cluster through a resource manager, which can be yarn or Sparks cluster manager.', 'start': 362.431, 'duration': 7.683}, {'end': 378.636, 'text': 'The driver program then runs the operations inside the executors on the worker nodes and spark context uses the pi4j to launch a JVM,', 'start': 370.614, 'duration': 8.022}, {'end': 381.078, 'text': 'which in turn creates a Java spark context.', 'start': 378.636, 'duration': 2.442}, {'end': 388.865, 'text': 'Now there are various parameters which can be used with a spark context object, like the master app name, spark home, the pi files,', 'start': 381.599, 'duration': 7.266}], 'summary': 'Pi spark fundamentals cover topics such as spark context, which sets up internal services and establishes connections to a spark execution environment.', 'duration': 59.042, 'max_score': 329.823, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg329823.jpg'}, {'end': 378.636, 'src': 'embed', 'start': 353.649, 'weight': 2, 'content': [{'end': 361.951, 'text': 'Through a Spark context object, you can create RDDs, accumulators, and broadcast variable, access Spark services, run jobs, and much more.', 'start': 353.649, 'duration': 8.302}, {'end': 370.114, 'text': 'The Spark context allows the Spark driver application to access the cluster through a resource manager, which can be yarn or Sparks cluster manager.', 'start': 362.431, 'duration': 7.683}, {'end': 378.636, 'text': 'The driver program then runs the operations inside the executors on the worker nodes and spark context uses the pi4j to launch a JVM,', 'start': 370.614, 'duration': 8.022}], 'summary': 'Using spark context, you can create rdds, accumulators, access spark services, and run jobs to process data on worker nodes.', 'duration': 24.987, 'max_score': 353.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg353649.jpg'}, {'end': 470.753, 'src': 'embed', 'start': 442.162, 'weight': 1, 'content': [{'end': 445.225, 'text': 'It is considered to be the building block of any spark application.', 'start': 442.162, 'duration': 3.063}, {'end': 453.144, 'text': 'The reason behind this is these elements run and operate on multiple nodes to do parallel processing on a cluster, and once you create a RDD,', 'start': 445.54, 'duration': 7.604}, {'end': 455.205, 'text': 'it becomes immutable, and by immutable.', 'start': 453.144, 'duration': 2.061}, {'end': 463.149, 'text': 'I mean that it is an object whose state cannot be modified after it is created, but we can transform its values by applying certain transformation.', 'start': 455.305, 'duration': 7.844}, {'end': 468.332, 'text': 'They have good fault tolerance ability and can automatically recover for almost any failures.', 'start': 463.589, 'duration': 4.743}, {'end': 470.753, 'text': 'This adds an added advantage.', 'start': 468.972, 'duration': 1.781}], 'summary': 'Rdd is a key building block in spark, offering parallel processing on clusters, immutability, fault tolerance, and automatic recovery.', 'duration': 28.591, 'max_score': 442.162, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg442162.jpg'}, {'end': 518.316, 'src': 'embed', 'start': 486.188, 'weight': 3, 'content': [{'end': 489.232, 'text': 'now these transformation work on the principle of lazy evaluation.', 'start': 486.188, 'duration': 3.044}, {'end': 497.312, 'text': 'and transformation are lazy in nature, meaning when we call some operation in RDD, it does not execute immediately.', 'start': 490.066, 'duration': 7.246}, {'end': 499.674, 'text': 'spark maintains the record of the operations.', 'start': 497.312, 'duration': 2.362}, {'end': 507.581, 'text': 'It is being called through with the help of dietic acyclic grass, which is also known as the edges and since the transformations are lazy in nature.', 'start': 499.935, 'duration': 7.646}, {'end': 511.805, 'text': 'So, when we execute operation anytime by calling an action on the data,', 'start': 508.062, 'duration': 3.743}, {'end': 518.316, 'text': "The lazy evaluation data is not loaded until it's necessary and the moment we call out the action,", 'start': 512.395, 'duration': 5.921}], 'summary': 'Spark transformations use lazy evaluation, delaying execution until necessary.', 'duration': 32.128, 'max_score': 486.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg486188.jpg'}, {'end': 559.307, 'src': 'embed', 'start': 530.639, 'weight': 4, 'content': [{'end': 537.341, 'text': 'are the operations which are applied on a RDD to instruct a party spark to apply computation and pass the result back to the driver.', 'start': 530.639, 'duration': 6.702}, {'end': 541.902, 'text': 'few of these actions include collect, the collect as map reduce, take first.', 'start': 537.341, 'duration': 4.561}, {'end': 545.284, 'text': 'Now let me implement few of these for your better understanding.', 'start': 542.503, 'duration': 2.781}, {'end': 550.665, 'text': 'So first of all, let me show you the bash RC file, which I was talking about.', 'start': 547.024, 'duration': 3.641}, {'end': 559.307, 'text': 'So here you can see in the bash RC file.', 'start': 557.446, 'duration': 1.861}], 'summary': 'Rdd operations instruct spark to apply computation and return results. actions include collect, map reduce, and take first.', 'duration': 28.668, 'max_score': 530.639, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg530639.jpg'}], 'start': 239.959, 'title': 'Pyspark installation and rdd operations', 'summary': "Provides guidance on installing pyspark on a linux system with hadoop, and explains spark context's role. it also covers the spark program life cycle, rdd operations, lazy transformations, caching, and important rdd operations such as map, flatmap, filter, distinct, reducebykey, mappartition, sort, collect, map, reduce, and take, emphasizing lazy evaluation and parallel computation.", 'chapters': [{'end': 406.141, 'start': 239.959, 'title': 'Installing pyspark and understanding spark context', 'summary': 'Provides a guide on installing pyspark on a linux system, emphasizing the importance of having hadoop installed, and then delves into the fundamentals of spark context, detailing its role in setting up internal services, establishing connections, and launching a jvm.', 'duration': 166.182, 'highlights': ['The chapter provides a guide on installing PySpark on a Linux system It demonstrates the steps to install PySpark on a Red Hat Linux-based system, with the same steps applicable to other Linux systems.', 'Emphasizing the importance of having Hadoop installed It stresses the need to have Hadoop installed before installing PySpark, directing users to resources for installing Hadoop.', 'Detailing the fundamentals of Spark context It explains the role of Spark context as the heart of any Spark application, setting up internal services, establishing connections, and enabling the creation of RDDs, accumulators, and broadcast variables.']}, {'end': 559.307, 'start': 406.141, 'title': 'Spark program life cycle and rdd operations', 'summary': 'Covers the life cycle of a spark program, including creating rdds, lazy transformations, caching, and performing actions, as well as the concept of rdds, their immutability, fault tolerance, and the categorization of transformations and actions. it also highlights important rdd operations such as map, flatmap, filter, distinct, reducebykey, mappartition, sort, collect, map, reduce, and take, emphasizing lazy evaluation and parallel computation.', 'duration': 153.166, 'highlights': ['RDDs are the building blocks of any spark application, operating on multiple nodes for parallel processing, and once created, they become immutable and have good fault tolerance ability. Highlights the significance of RDDs in spark applications, emphasizing their immutability, fault tolerance, and parallel processing capabilities.', 'The chapter explains the categorization of transformations and actions, with transformations being lazy in nature and actions instructing Spark to apply computation and return the result to the driver. Explains the distinction between transformations and actions, emphasizing the lazy nature of transformations and the computation instructions given by actions.', 'Important RDD operations such as map, flatMap, filter, distinct, reduceByKey, mapPartition, sort, collect, map, reduce, and take are highlighted, showcasing their role in instructing Spark for computation and result retrieval. Emphasizes the significance of key RDD operations in instructing Spark for computation and result retrieval, showcasing a variety of operations such as map, flatMap, filter, distinct, reduceByKey, mapPartition, sort, collect, map, reduce, and take.']}], 'duration': 319.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg239959.jpg', 'highlights': ['The chapter provides a guide on installing PySpark on a Linux system, emphasizing the importance of having Hadoop installed.', 'RDDs are the building blocks of any spark application, operating on multiple nodes for parallel processing, and once created, they become immutable and have good fault tolerance ability.', 'Detailing the fundamentals of Spark context, explaining the role of Spark context as the heart of any Spark application, setting up internal services, establishing connections, and enabling the creation of RDDs, accumulators, and broadcast variables.', 'The chapter explains the categorization of transformations and actions, with transformations being lazy in nature and actions instructing Spark to apply computation and return the result to the driver.', 'Important RDD operations such as map, flatMap, filter, distinct, reduceByKey, mapPartition, sort, collect, map, reduce, and take are highlighted, showcasing their role in instructing Spark for computation and result retrieval.']}, {'end': 953.179, 'segs': [{'end': 629.2, 'src': 'heatmap', 'start': 583.511, 'weight': 0.732, 'content': [{'end': 590.358, 'text': "I'll highlight this one for you the pi spark driver python, which is the Jupiter and we have given it as a notebook.", 'start': 583.511, 'duration': 6.847}, {'end': 592.498, 'text': 'the option available as notebook.', 'start': 591.138, 'duration': 1.36}, {'end': 597.879, 'text': "What it'll do is at the moment I start Spark it will automatically redirect me to the Jupiter notebook.", 'start': 592.538, 'duration': 5.341}, {'end': 606.461, 'text': 'So let me just rename this notebook as RDD tutorial.', 'start': 602.44, 'duration': 4.021}, {'end': 608.821, 'text': "So let's get started.", 'start': 607.401, 'duration': 1.42}, {'end': 617.223, 'text': "So here to load any file into an RDD suppose I'm loading a text file you need to use the essay which is a Spark context.", 'start': 610.242, 'duration': 6.981}, {'end': 622.197, 'text': 'sc.txt file and you need to provide the path of the data which you are going to load.', 'start': 618.035, 'duration': 4.162}, {'end': 629.2, 'text': 'So one thing to keep in mind is that the default path which the RDD takes or the Jupyter Notebook takes is the HDFS path.', 'start': 623.037, 'duration': 6.163}], 'summary': 'Using pi spark driver python in jupyter notebook to load text file into rdd using spark context with default hdfs path.', 'duration': 45.689, 'max_score': 583.511, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg583511.jpg'}, {'end': 622.197, 'src': 'embed', 'start': 592.538, 'weight': 0, 'content': [{'end': 597.879, 'text': "What it'll do is at the moment I start Spark it will automatically redirect me to the Jupiter notebook.", 'start': 592.538, 'duration': 5.341}, {'end': 606.461, 'text': 'So let me just rename this notebook as RDD tutorial.', 'start': 602.44, 'duration': 4.021}, {'end': 608.821, 'text': "So let's get started.", 'start': 607.401, 'duration': 1.42}, {'end': 617.223, 'text': "So here to load any file into an RDD suppose I'm loading a text file you need to use the essay which is a Spark context.", 'start': 610.242, 'duration': 6.981}, {'end': 622.197, 'text': 'sc.txt file and you need to provide the path of the data which you are going to load.', 'start': 618.035, 'duration': 4.162}], 'summary': 'Using spark, automatically redirected to jupiter notebook; loading text file into rdd using spark context.', 'duration': 29.659, 'max_score': 592.538, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg592538.jpg'}, {'end': 717.616, 'src': 'heatmap', 'start': 636.003, 'weight': 0.725, 'content': [{'end': 643.626, 'text': 'Now once our sample data is inside the RDD, now to have a look at it, we need to invoke using it the action.', 'start': 636.003, 'duration': 7.623}, {'end': 651.09, 'text': "So let's go ahead and take a look at the first five objects or rather say the first five elements of this particular RDD.", 'start': 644.346, 'duration': 6.744}, {'end': 654.953, 'text': 'The sample data I have taken here is about blockchain.', 'start': 652.011, 'duration': 2.942}, {'end': 659.516, 'text': 'As you can see, we have one, two, three, four and five elements here.', 'start': 655.093, 'duration': 4.423}, {'end': 666.251, 'text': 'Suppose I need to convert all the data into a low case and split it according to word by word.', 'start': 660.888, 'duration': 5.363}, {'end': 671.853, 'text': "So for that, I'll create a function and in that function, I'll pass on this RDD.", 'start': 666.771, 'duration': 5.082}, {'end': 679.877, 'text': "So I'm creating, as you can see here, I'm creating RDD1, that is, a new RDD, and using the map function, or rather say the transformation,", 'start': 672.354, 'duration': 7.523}, {'end': 683.759, 'text': 'and passing on the function which I just created to lower and to split it.', 'start': 679.877, 'duration': 3.882}, {'end': 687.341, 'text': 'So, if we have a look at the output of RDD1,', 'start': 684.74, 'duration': 2.601}, {'end': 695.963, 'text': 'So you can see here all the words are in the lower case and all of them are separated with the help of a space bar.', 'start': 690.179, 'duration': 5.784}, {'end': 702.567, 'text': "Now there's another transformation, which is known as the flat map, to give you a flat and output,", 'start': 697.143, 'duration': 5.424}, {'end': 705.348, 'text': "and I'm passing the same function which I created earlier.", 'start': 702.567, 'duration': 2.781}, {'end': 708.51, 'text': "So let's go ahead and have a look at the output for this one.", 'start': 705.689, 'duration': 2.821}, {'end': 717.616, 'text': 'So as you can see here, we got the first five elements which are the same one as we got here the contrast transactions and and the records.', 'start': 708.951, 'duration': 8.665}], 'summary': 'Using rdd in spark, data transformed to lowercase and split into words, displaying first five elements.', 'duration': 81.613, 'max_score': 636.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg636003.jpg'}, {'end': 717.616, 'src': 'embed', 'start': 690.179, 'weight': 1, 'content': [{'end': 695.963, 'text': 'So you can see here all the words are in the lower case and all of them are separated with the help of a space bar.', 'start': 690.179, 'duration': 5.784}, {'end': 702.567, 'text': "Now there's another transformation, which is known as the flat map, to give you a flat and output,", 'start': 697.143, 'duration': 5.424}, {'end': 705.348, 'text': "and I'm passing the same function which I created earlier.", 'start': 702.567, 'duration': 2.781}, {'end': 708.51, 'text': "So let's go ahead and have a look at the output for this one.", 'start': 705.689, 'duration': 2.821}, {'end': 717.616, 'text': 'So as you can see here, we got the first five elements which are the same one as we got here the contrast transactions and and the records.', 'start': 708.951, 'duration': 8.665}], 'summary': 'Demonstrating lower case transformation and flat map to produce output.', 'duration': 27.437, 'max_score': 690.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg690179.jpg'}, {'end': 886.584, 'src': 'heatmap', 'start': 819.21, 'weight': 0.923, 'content': [{'end': 825.315, 'text': 'and the range we have given is 1 to 10, 000 and will use the reduce action here to see the output.', 'start': 819.21, 'duration': 6.105}, {'end': 830.636, 'text': 'As you can see here, we have the sum of the numbers ranging from 1 to 10, 000.', 'start': 826.035, 'duration': 4.601}, {'end': 832.917, 'text': 'Now, this was all about RDD.', 'start': 830.636, 'duration': 2.281}, {'end': 836.718, 'text': 'The next topic that we have on a list is broadcast and accumulators.', 'start': 832.937, 'duration': 3.781}, {'end': 841.58, 'text': 'Now in Spark, we perform parallel processing through the help of shared variables, or,', 'start': 837.279, 'duration': 4.301}, {'end': 848.742, 'text': 'when the driver sends any task to the executor present on the cluster, a copy of the shared variable is also sent to each node of the cluster,', 'start': 841.58, 'duration': 7.162}, {'end': 851.143, 'text': 'thus maintaining high availability and fault tolerance.', 'start': 848.742, 'duration': 2.401}, {'end': 857.063, 'text': 'Now, this is done in order to accomplish the task and Apache spark suppose two type of shared variables.', 'start': 851.843, 'duration': 5.22}, {'end': 865.485, 'text': 'One of them is broadcast and the other one is the accumulator now broadcast variables are used to save the copy of data on all the notes in a cluster.', 'start': 857.703, 'duration': 7.782}, {'end': 870.326, 'text': 'Whereas the accumulator is the variable that is used for aggregating the incoming information.', 'start': 865.705, 'duration': 4.621}, {'end': 873.827, 'text': 'We are different associative and commutative operations.', 'start': 870.926, 'duration': 2.901}, {'end': 877.779, 'text': 'Now moving on to our next topic, which is a spark configuration.', 'start': 874.617, 'duration': 3.162}, {'end': 886.584, 'text': 'now, spark configuration class provides a set of configurations and parameters that are needed to execute a spark application on the local system or any cluster.', 'start': 877.779, 'duration': 8.805}], 'summary': 'The transcript covers rdd, broadcast, accumulators, and spark configuration in apache spark, emphasizing shared variables for parallel processing and spark application parameters.', 'duration': 67.374, 'max_score': 819.21, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg819210.jpg'}, {'end': 886.584, 'src': 'embed', 'start': 857.703, 'weight': 2, 'content': [{'end': 865.485, 'text': 'One of them is broadcast and the other one is the accumulator now broadcast variables are used to save the copy of data on all the notes in a cluster.', 'start': 857.703, 'duration': 7.782}, {'end': 870.326, 'text': 'Whereas the accumulator is the variable that is used for aggregating the incoming information.', 'start': 865.705, 'duration': 4.621}, {'end': 873.827, 'text': 'We are different associative and commutative operations.', 'start': 870.926, 'duration': 2.901}, {'end': 877.779, 'text': 'Now moving on to our next topic, which is a spark configuration.', 'start': 874.617, 'duration': 3.162}, {'end': 886.584, 'text': 'now, spark configuration class provides a set of configurations and parameters that are needed to execute a spark application on the local system or any cluster.', 'start': 877.779, 'duration': 8.805}], 'summary': 'Broadcast variables save data on all nodes, accumulators aggregate information. spark configuration class provides parameters for application execution.', 'duration': 28.881, 'max_score': 857.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg857703.jpg'}], 'start': 559.327, 'title': 'Managing frameworks, spark configuration, rdd transformations, and apache spark features', 'summary': 'Discusses managing installed frameworks, highlighting hadoop and spark configuration, redirection to jupyter notebook, loading a text file into an rdd, and covers rdd transformations and actions such as map, flatmap, filter, take, groupby. it also explains using rdd for reduce action, broadcast, accumulators, and spark configuration in apache spark.', 'chapters': [{'end': 643.626, 'start': 559.327, 'title': 'Managing frameworks and spark configuration', 'summary': 'Discusses the process of managing installed frameworks, highlighting the configuration of hadoop and spark, and the redirection to jupyter notebook upon starting spark, along with the method of loading a text file into an rdd and the need to specify the file path for local file system usage.', 'duration': 84.299, 'highlights': ['The process of managing installed frameworks, including Hadoop and Spark, and shifting them to a specific location.', 'Configuration of Spark to automatically redirect to the Jupyter notebook upon starting, with the option to rename the notebook for operations.', 'The method of loading a text file into an RDD using the Spark context, specifying the path of the data.', 'The need to mention the file path for local file system usage when working with the default path taken by the RDD or Jupyter Notebook.']}, {'end': 811.364, 'start': 644.346, 'title': 'Rdd transformations and actions', 'summary': 'Covers the use of rdd transformations such as map, flatmap, and filter, as well as actions like take and groupby, to manipulate and analyze data from an rdd representing blockchain, demonstrating the process of converting data to lowercase, removing stop words, and grouping by the first three and two characters of elements.', 'duration': 167.018, 'highlights': ['Demonstrating the process of converting data to lowercase and splitting it word by word using map transformation. The speaker creates a new RDD using the map transformation to convert all the data into lowercase and split it word by word.', 'Illustrating the use of flatMap transformation to flatten the output and remove stop words using filter transformation. The speaker demonstrates the use of flatMap transformation to give a flat output and then uses the filter transformation with a Lambda function to remove stop words from the data.', 'Showcasing the use of groupBy to group data according to the first three characters of any element. The speaker uses the groupBy transformation with a Lambda function to group the data according to the first three characters of any element.', 'Explaining the process of finding the sum of the first 10,000 numbers using an action. The speaker discusses the process of finding the sum of the first 10,000 numbers, showcasing an action to perform this computation.']}, {'end': 953.179, 'start': 811.844, 'title': 'Apache spark: rdd, broadcast, accumulators, and configuration', 'summary': 'Covers initializing and using rdd to perform reduce action on a range of numbers, followed by explanations of broadcast, accumulators, spark configuration, and spark files in apache spark.', 'duration': 141.335, 'highlights': ['In Spark, initializing another RDD using SC.parallelize with a range of 1 to 10,000 and using reduce action to obtain the sum of the numbers. RDD initialization with SC.parallelize, range of 1 to 10,000, usage of reduce action', 'Explanation of broadcast variables used to save data on all nodes in a cluster and accumulators for aggregating incoming information with associative and commutative operations. Broadcast variables for saving data, accumulators for aggregation, associative and commutative operations', 'Overview of Spark configuration class providing configurations and parameters needed to execute a Spark application on a local system or any cluster. Spark configuration class, configurations and parameters, priority over system properties', 'Description of Spark files class methods for resolving file paths added using the spark context add file method. Spark files class methods, resolving file paths, added through spark context add file']}], 'duration': 393.852, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg559327.jpg', 'highlights': ['Configuration of Spark to automatically redirect to the Jupyter notebook upon starting, with the option to rename the notebook for operations.', 'Demonstrating the process of converting data to lowercase and splitting it word by word using map transformation.', 'Explanation of broadcast variables used to save data on all nodes in a cluster and accumulators for aggregating incoming information with associative and commutative operations.']}, {'end': 1356.289, 'segs': [{'end': 1008.138, 'src': 'embed', 'start': 953.179, 'weight': 0, 'content': [{'end': 957.761, 'text': "these are small topics and the next topic that we'll covering in our list are the data frames.", 'start': 953.179, 'duration': 4.582}, {'end': 962.871, 'text': 'Now, data frames in a purchase Park is a distributed collection of rows under named columns,', 'start': 958.427, 'duration': 4.444}, {'end': 966.614, 'text': 'which is similar to the relation database tables or Excel sheets.', 'start': 962.871, 'duration': 3.743}, {'end': 969.857, 'text': 'It also shares common attributes with the RDDs.', 'start': 967.154, 'duration': 2.703}, {'end': 973.466, 'text': 'Few characteristics of data frames are immutable in nature.', 'start': 970.464, 'duration': 3.002}, {'end': 976.568, 'text': 'That is the same as you can create a data frame, but you cannot change it.', 'start': 973.626, 'duration': 2.942}, {'end': 978.429, 'text': 'It allows lazy evaluation.', 'start': 977.028, 'duration': 1.401}, {'end': 985.253, 'text': 'That is the task not executed unless and until an action is triggered and, moreover, data frames are distributed in nature,', 'start': 978.469, 'duration': 6.784}, {'end': 989.075, 'text': 'which are designed for processing large collection of structure or semi-structured data.', 'start': 985.253, 'duration': 3.822}, {'end': 998.037, 'text': 'It can be created using different data formats like loading the data from source files such as JSON or CSV or you can load it from an existing RDD.', 'start': 989.575, 'duration': 8.462}, {'end': 1000.937, 'text': 'You can use databases like hive Cassandra.', 'start': 998.097, 'duration': 2.84}, {'end': 1002.257, 'text': 'You can use pocket files.', 'start': 1000.977, 'duration': 1.28}, {'end': 1004.838, 'text': 'You can use CSV XML files.', 'start': 1002.597, 'duration': 2.241}, {'end': 1008.138, 'text': 'There are many sources through which you can create a particular RDD.', 'start': 1005.138, 'duration': 3}], 'summary': 'Data frames in apache spark are distributed, immutable, and designed for processing large collections of structured or semi-structured data from various sources like json, csv, and databases.', 'duration': 54.959, 'max_score': 953.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg953179.jpg'}, {'end': 1085.62, 'src': 'heatmap', 'start': 1027.103, 'weight': 3, 'content': [{'end': 1033.867, 'text': 'We are using the spark dot read dot CSV method and you to provide the path which is the local path by default.', 'start': 1027.103, 'duration': 6.764}, {'end': 1041.07, 'text': "It takes the SDFS same as RDD, and one thing to note down here is that I've provided two parameters extra here,", 'start': 1033.926, 'duration': 7.144}, {'end': 1043.632, 'text': 'which is the info schema and the header.', 'start': 1041.07, 'duration': 2.562}, {'end': 1046.733, 'text': 'if we do not provide this as true or we skip it,', 'start': 1043.632, 'duration': 3.101}, {'end': 1054.619, 'text': 'What will happen is that if your data set contains the name of the columns on the first row, it will take those as data as well.', 'start': 1047.116, 'duration': 7.503}, {'end': 1056.259, 'text': 'It will not infer the schema.', 'start': 1055.079, 'duration': 1.18}, {'end': 1063.682, 'text': 'Now once we have loaded the data in our data frame, we need to use the show action to have a look at the output.', 'start': 1057.3, 'duration': 6.382}, {'end': 1070.305, 'text': 'So as you can see here, we have the output which is exactly it gives us the top 20 rows or the particular data set.', 'start': 1064.262, 'duration': 6.043}, {'end': 1071.645, 'text': 'We have the year month.', 'start': 1070.545, 'duration': 1.1}, {'end': 1081.958, 'text': 'They departure time departure delay arrival time arrival delay and so many more attributes not to print the schema of the particular data frame.', 'start': 1071.785, 'duration': 10.173}, {'end': 1085.62, 'text': 'You need the transformation or as say the action of print schema.', 'start': 1082.018, 'duration': 3.602}], 'summary': 'Using spark dot read dot csv method to load data and view schema in data frame.', 'duration': 58.517, 'max_score': 1027.103, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1027103.jpg'}, {'end': 1137.106, 'src': 'heatmap', 'start': 1093.166, 'weight': 4, 'content': [{'end': 1099.491, 'text': 'We have the carrier as string the tail number as string the origin string destination string and so on.', 'start': 1093.166, 'duration': 6.325}, {'end': 1108.378, 'text': "Let's suppose I want to know how many records are there in my database or the data frame I'd rather say So you need the count function for this one.", 'start': 1100.071, 'duration': 8.307}, {'end': 1110.32, 'text': 'It will provide you with the results.', 'start': 1109.059, 'duration': 1.261}, {'end': 1118.186, 'text': 'So as you can see here, we have 3.3 million records here.', 'start': 1111.761, 'duration': 6.425}, {'end': 1119.487, 'text': '3, 036, 776 to be exact.', 'start': 1118.466, 'duration': 1.021}, {'end': 1127.093, 'text': 'Now suppose I want to have a look at the flight name, the origin, and the destination of just these three columns for the particular data frame.', 'start': 1120.227, 'duration': 6.866}, {'end': 1129.495, 'text': 'We need to use the select option.', 'start': 1127.593, 'duration': 1.902}, {'end': 1132.037, 'text': 'So as you can see here, we have the top 20 rows.', 'start': 1130.635, 'duration': 1.402}, {'end': 1137.106, 'text': 'Now, what we saw was the select query on this particular data frame.', 'start': 1133.223, 'duration': 3.883}], 'summary': 'The data frame contains 3.3 million records, with 3,036,776 to be exact. the select query was used to display the top 20 rows.', 'duration': 26.321, 'max_score': 1093.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1093166.jpg'}, {'end': 1188.066, 'src': 'embed', 'start': 1142.971, 'weight': 6, 'content': [{'end': 1150.337, 'text': 'suppose I want to check the what is the lowest count or the highest count in the particular distance column.', 'start': 1142.971, 'duration': 7.366}, {'end': 1152.238, 'text': 'I need to use the describe function here.', 'start': 1150.577, 'duration': 1.661}, {'end': 1154.74, 'text': "So I'll show you what the summary looks like.", 'start': 1152.859, 'duration': 1.881}, {'end': 1159.624, 'text': 'So the distance the count is the number of rows total number of rows.', 'start': 1155.881, 'duration': 3.743}, {'end': 1161.726, 'text': 'We have the mean the standard deviation.', 'start': 1160.305, 'duration': 1.421}, {'end': 1167.251, 'text': 'We have the minimum value, which is 17, and the maximum value, which is 4983.', 'start': 1161.766, 'duration': 5.485}, {'end': 1171.695, 'text': 'now, this gives you a summary of the particular column, if you want to know that.', 'start': 1167.251, 'duration': 4.444}, {'end': 1173.977, 'text': 'we know that the minimum distance is 17..', 'start': 1171.695, 'duration': 2.282}, {'end': 1178.861, 'text': "Let's go ahead and filter out our data using the filter function in which the distance is 17.", 'start': 1173.977, 'duration': 4.884}, {'end': 1181.883, 'text': 'So you can see here.', 'start': 1178.861, 'duration': 3.022}, {'end': 1188.066, 'text': 'We have one data in which in the 2013 year the minimum distance here is 17.', 'start': 1181.923, 'duration': 6.143}], 'summary': 'Using describe function, the minimum distance is 17, and the maximum distance is 4983. filtered data shows one entry with the minimum distance of 17 in the 2013 year.', 'duration': 45.095, 'max_score': 1142.971, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1142971.jpg'}, {'end': 1290.186, 'src': 'embed', 'start': 1266.299, 'weight': 8, 'content': [{'end': 1274.425, 'text': 'If someone is not good or is not acquainted to all these transformation and action and would rather use sequel queries on the data.', 'start': 1266.299, 'duration': 8.126}, {'end': 1279.723, 'text': 'They can use this register dot temp table to create a table for their particular data frame.', 'start': 1274.805, 'duration': 4.918}, {'end': 1287.045, 'text': "What I'll do is convert the NYC flights underscore D of data frame into NYC underscore flight table, which can be used later,", 'start': 1280.083, 'duration': 6.962}, {'end': 1290.186, 'text': 'and sequel queries can be performed on this particular table.', 'start': 1287.045, 'duration': 3.141}], 'summary': 'Using register.temptable, nyc flights data frame is converted to nyc_flight table for performing sequel queries.', 'duration': 23.887, 'max_score': 1266.299, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1266299.jpg'}, {'end': 1343.026, 'src': 'embed', 'start': 1311.991, 'weight': 9, 'content': [{'end': 1315.753, 'text': 'We pass all the sequel query in the sequel contest or sequel function.', 'start': 1311.991, 'duration': 3.762}, {'end': 1316.734, 'text': 'So you can see here.', 'start': 1316.093, 'duration': 0.641}, {'end': 1318.995, 'text': 'We have the minimum airtime as 20.', 'start': 1316.754, 'duration': 2.241}, {'end': 1323.997, 'text': 'now to have a look at the records in which the airtime is minimum 20.', 'start': 1318.995, 'duration': 5.002}, {'end': 1326.578, 'text': 'now we can also use nested sequel queries.', 'start': 1323.997, 'duration': 2.581}, {'end': 1331.341, 'text': 'or suppose, if I want to check which all flights have the minimum airtime as 20.', 'start': 1326.578, 'duration': 4.763}, {'end': 1334.262, 'text': 'now that cannot be done in a simple sequel query.', 'start': 1331.341, 'duration': 2.921}, {'end': 1336.123, 'text': 'We need nested query for that one.', 'start': 1334.722, 'duration': 1.401}, {'end': 1343.026, 'text': 'So selecting Asterix from New York flights where the airtime is in and inside that we have another query,', 'start': 1336.485, 'duration': 6.541}], 'summary': 'Using nested sql queries to find flights with minimum airtime of 20.', 'duration': 31.035, 'max_score': 1311.991, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1311991.jpg'}], 'start': 953.179, 'title': 'Data frames in pyspark', 'summary': 'Introduces data frames in pyspark, a distributed collection of rows under named columns, designed for processing large collections of structured or semi-structured data. it covers operations such as loading data, basic transformations, creating temporary tables, executing sql queries, with a sample dataset of 3.3 million records and examples of using count, describe, and nested sql queries.', 'chapters': [{'end': 1046.733, 'start': 953.179, 'title': 'Introduction to data frames in pyspark', 'summary': 'Introduces data frames in pyspark, highlighting that they are a distributed collection of rows under named columns, similar to relation database tables, and are immutable, allowing lazy evaluation and designed for processing large collections of structured or semi-structured data.', 'duration': 93.554, 'highlights': ['Data frames in PySpark are a distributed collection of rows under named columns, similar to relation database tables or Excel sheets, and they share common attributes with RDDs.', 'Data frames are immutable in nature, allowing lazy evaluation and designed for processing large collections of structured or semi-structured data.', 'Data frames can be created using different data formats like JSON or CSV, loading from source files or existing RDDs, and various databases and file types.', 'Creating a data frame in PySpark involves using the spark.read.CSV method and providing parameters like info schema and header for data loading.']}, {'end': 1356.289, 'start': 1047.116, 'title': 'Data frame operations in pyspark', 'summary': 'Introduces data frame operations in pyspark, including loading data, displaying the schema, performing basic transformations such as selecting specific columns, using filter and where clauses, creating temporary tables, and executing sql queries, with a sample dataset of 3.3 million records and examples of using count, describe, and nested sql queries.', 'duration': 309.173, 'highlights': ['The dataset contains 3.3 million records. The dataset loaded into the data frame comprises 3.3 million records.', 'The count function returns 3,036,776 records. The count function reveals that there are 3,036,776 records in the data frame.', "The describe function provides the summary of the 'distance' column with a minimum value of 17 and a maximum value of 4983. The describe function offers a summary of the 'distance' column, indicating a minimum value of 17 and a maximum value of 4983.", 'Using the filter function with the distance as 17 results in one record in the year 2013. Applying the filter function with the distance as 17 yields one record from the year 2013.', 'Creating a temporary table for SQL queries with the register.tempTable function. The register.tempTable function allows the creation of a temporary table for executing SQL queries on the data frame.', 'Executing a nested SQL query to find flights with the minimum airtime of 20. A nested SQL query is used to identify flights with the minimum airtime of 20, demonstrating the capability of executing nested SQL queries in PySpark.']}], 'duration': 403.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg953179.jpg', 'highlights': ['Data frames in PySpark are a distributed collection of rows under named columns, similar to relation database tables or Excel sheets, and they share common attributes with RDDs.', 'Data frames are immutable in nature, allowing lazy evaluation and designed for processing large collections of structured or semi-structured data.', 'Data frames can be created using different data formats like JSON or CSV, loading from source files or existing RDDs, and various databases and file types.', 'Creating a data frame in PySpark involves using the spark.read.CSV method and providing parameters like info schema and header for data loading.', 'The dataset contains 3.3 million records. The dataset loaded into the data frame comprises 3.3 million records.', 'The count function returns 3,036,776 records. The count function reveals that there are 3,036,776 records in the data frame.', "The describe function provides the summary of the 'distance' column with a minimum value of 17 and a maximum value of 4983.", 'Using the filter function with the distance as 17 results in one record in the year 2013.', 'Creating a temporary table for SQL queries with the register.tempTable function.', 'Executing a nested SQL query to find flights with the minimum airtime of 20.']}, {'end': 1609.123, 'segs': [{'end': 1380.82, 'src': 'embed', 'start': 1356.729, 'weight': 1, 'content': [{'end': 1363.23, 'text': "So let's get back to our presentation and have a look at the list which we were following we completed data frames next we have storage levels.", 'start': 1356.729, 'duration': 6.501}, {'end': 1369.274, 'text': 'Now, storage level in PySpark is a class which helps in deciding how the RDD should be stored.', 'start': 1363.891, 'duration': 5.383}, {'end': 1374.376, 'text': 'now, based on this, RDDs are either stored in disk or in memory, or in both.', 'start': 1369.274, 'duration': 5.102}, {'end': 1380.82, 'text': 'the class storage level also decides whether the RDD should be serialized or replicate its partition.', 'start': 1374.376, 'duration': 6.444}], 'summary': "Pyspark's storage level class decides rdd storage, serialization, and replication.", 'duration': 24.091, 'max_score': 1356.729, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1356729.jpg'}, {'end': 1398.962, 'src': 'heatmap', 'start': 1374.376, 'weight': 0.707, 'content': [{'end': 1380.82, 'text': 'the class storage level also decides whether the RDD should be serialized or replicate its partition.', 'start': 1374.376, 'duration': 6.444}, {'end': 1385.342, 'text': "for the final and the last topic for all, the today's list is the ML lib.", 'start': 1380.82, 'duration': 4.522}, {'end': 1390.585, 'text': 'now, ML lib is the machine learning API which is provided by Spark, which is also present in Python,', 'start': 1385.342, 'duration': 5.243}, {'end': 1394.618, 'text': 'and this library is heavily used in python for machine learning.', 'start': 1391.095, 'duration': 3.523}, {'end': 1398.962, 'text': 'as well as real-time streaming analytics, a various algorithm supported by this.', 'start': 1394.618, 'duration': 4.344}], 'summary': "Spark's mllib is a heavily used machine learning api in python, supporting various algorithms and real-time streaming analytics.", 'duration': 24.586, 'max_score': 1374.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1374376.jpg'}, {'end': 1422.529, 'src': 'embed', 'start': 1394.618, 'weight': 0, 'content': [{'end': 1398.962, 'text': 'as well as real-time streaming analytics, a various algorithm supported by this.', 'start': 1394.618, 'duration': 4.344}, {'end': 1403.166, 'text': 'libraries are first of all, we have the spark.ml live now.', 'start': 1398.962, 'duration': 4.204}, {'end': 1409.472, 'text': 'recently, the spy spark mlips supports model-based collaborative filtering by a small set of latent factors,', 'start': 1403.166, 'duration': 6.306}, {'end': 1414.937, 'text': 'and here all the users and the products are described which we can use to predict the missing entries.', 'start': 1409.472, 'duration': 5.465}, {'end': 1422.529, 'text': 'However to learn these latent factors Park dot mlb uses the alternating least square, which is the ALS algorithm.', 'start': 1415.387, 'duration': 7.142}], 'summary': 'Real-time streaming analytics with spark.ml supports model-based collaborative filtering using als algorithm.', 'duration': 27.911, 'max_score': 1394.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1394618.jpg'}, {'end': 1480.494, 'src': 'embed', 'start': 1438.012, 'weight': 4, 'content': [{'end': 1442.459, 'text': 'now, frequent pattern matching is mining frequent items, item set,', 'start': 1438.012, 'duration': 4.447}, {'end': 1448.842, 'text': 'sub sequences or other sub structures that are usually among the first steps to analyze a large-scale data set.', 'start': 1442.459, 'duration': 6.383}, {'end': 1452.223, 'text': 'This has been an active research topic in data mining for years.', 'start': 1449.202, 'duration': 3.021}, {'end': 1454.864, 'text': 'We have the linear algebra.', 'start': 1452.844, 'duration': 2.02}, {'end': 1459.306, 'text': 'Now this algorithm supports by Spark MLlib utilities for linear algebra.', 'start': 1455.425, 'duration': 3.881}, {'end': 1461.047, 'text': 'We have collaborative filtering.', 'start': 1459.647, 'duration': 1.4}, {'end': 1464.529, 'text': 'We have classification for binary classification.', 'start': 1461.788, 'duration': 2.741}, {'end': 1469.291, 'text': 'Variant methods are available in spark.mlib packets such as multi-class classification.', 'start': 1464.569, 'duration': 4.722}, {'end': 1472.891, 'text': 'as well as regression analysis in classification.', 'start': 1469.89, 'duration': 3.001}, {'end': 1480.494, 'text': 'Some of the most popular algorithms used are naive bias random forest decision tree and so much and finally we have the linear regression.', 'start': 1472.931, 'duration': 7.563}], 'summary': 'Frequent pattern mining is crucial in data analysis, supported by spark mllib for various algorithms.', 'duration': 42.482, 'max_score': 1438.012, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1438012.jpg'}, {'end': 1535.281, 'src': 'embed', 'start': 1499.634, 'weight': 3, 'content': [{'end': 1505.543, 'text': 'Now here we are going to use a heart disease prediction model and we are going to predict it using the decision tree,', 'start': 1499.634, 'duration': 5.909}, {'end': 1508.287, 'text': 'with the help of classification as well as regression.', 'start': 1505.543, 'duration': 2.744}, {'end': 1511.672, 'text': 'Now, these are all are part of the ML lib library here.', 'start': 1508.647, 'duration': 3.025}, {'end': 1514.917, 'text': "Let's see how we can perform these types of functions and queries.", 'start': 1511.992, 'duration': 2.925}, {'end': 1535.281, 'text': 'The first of all what we need to do is initialize the spark context.', 'start': 1532.179, 'duration': 3.102}], 'summary': 'Using decision tree model for heart disease prediction with ml lib in spark.', 'duration': 35.647, 'max_score': 1499.634, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1499634.jpg'}], 'start': 1356.729, 'title': 'Pyspark dataframes and mllib', 'summary': 'Covers pyspark data frames, storage levels, and mllib, including storage level classes for deciding rdd storage, serialization, replication, and machine learning api used in python. it also discusses various algorithms supported by spark mllib, such as model-based collaborative filtering, clustering, frequent pattern matching, linear algebra, classification, regression, and heart disease prediction using decision tree.', 'chapters': [{'end': 1394.618, 'start': 1356.729, 'title': 'Pyspark dataframes, storage levels, and mllib', 'summary': 'Discusses pyspark data frames, storage levels, and mllib, covering topics such as storage level classes for deciding rdd storage, serialization, and replication, along with the machine learning api provided by spark, heavily used in python.', 'duration': 37.889, 'highlights': ['ML lib is the machine learning API provided by Spark, heavily used in Python for machine learning.', 'Storage level in PySpark is a class which helps in deciding how the RDD should be stored, determining whether it should be stored in disk, in memory, or in both, and whether the RDD should be serialized or replicate its partition.', 'PySpark data frames were completed before discussing storage levels.']}, {'end': 1609.123, 'start': 1394.618, 'title': 'Spark mllib for heart disease prediction', 'summary': 'Covers the various algorithms supported by spark mllib, including model-based collaborative filtering, clustering, frequent pattern matching, linear algebra, collaborative filtering, classification, regression, and heart disease prediction using decision tree with classification and regression.', 'duration': 214.505, 'highlights': ['Spark MLlib supports model-based collaborative filtering by using the ALS algorithm to learn latent factors, enabling prediction of missing entries. The ALS algorithm in Spark MLlib is used to learn latent factors for model-based collaborative filtering, allowing prediction of missing entries.', 'The chapter explores clustering, frequent pattern matching, and linear algebra supported by Spark MLlib. The chapter covers clustering, frequent pattern matching, and linear algebra as supported by Spark MLlib.', 'Classification and regression are addressed with variant methods available in Spark MLlib packets, including popular algorithms such as naive bias, random forest, and decision tree. The Spark MLlib packets offer variant methods for classification and regression, including popular algorithms like naive bias, random forest, and decision tree.', 'A heart disease prediction model is implemented using decision tree with the help of classification and regression in the MLlib library. The chapter demonstrates the implementation of a heart disease prediction model using decision tree, classification, and regression in the MLlib library.']}], 'duration': 252.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1356729.jpg', 'highlights': ['ML lib is the machine learning API provided by Spark, heavily used in Python for machine learning.', 'Storage level in PySpark is a class which helps in deciding how the RDD should be stored, determining whether it should be stored in disk, in memory, or in both, and whether the RDD should be serialized or replicate its partition.', 'PySpark data frames were completed before discussing storage levels.', 'The chapter demonstrates the implementation of a heart disease prediction model using decision tree, classification, and regression in the MLlib library.', 'The Spark MLlib packets offer variant methods for classification and regression, including popular algorithms like naive bias, random forest, and decision tree.', 'The chapter covers clustering, frequent pattern matching, and linear algebra as supported by Spark MLlib.', 'The ALS algorithm in Spark MLlib is used to learn latent factors for model-based collaborative filtering, allowing prediction of missing entries.']}, {'end': 1832.402, 'segs': [{'end': 1633.583, 'src': 'embed', 'start': 1611.064, 'weight': 0, 'content': [{'end': 1619.22, 'text': 'Now to get a look at the data set here, now you can see here we have zero at many places instead of the question mark, which we had earlier.', 'start': 1611.064, 'duration': 8.156}, {'end': 1623.001, 'text': 'And now we are saving it to a txt file.', 'start': 1620.98, 'duration': 2.021}, {'end': 1627.342, 'text': 'And you can see here after dropping the rules with any empty values.', 'start': 1624.441, 'duration': 2.901}, {'end': 1629.702, 'text': 'We have 297 rows and 14 columns.', 'start': 1627.362, 'duration': 2.34}, {'end': 1633.583, 'text': 'Now, this is what the nuclear data set looks like now.', 'start': 1630.623, 'duration': 2.96}], 'summary': 'Data set has 297 rows and 14 columns after dropping empty values.', 'duration': 22.519, 'max_score': 1611.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1611064.jpg'}, {'end': 1709.914, 'src': 'embed', 'start': 1678.089, 'weight': 1, 'content': [{'end': 1681.492, 'text': 'So for that we need to import the PI spark dot mlib dot tree.', 'start': 1678.089, 'duration': 3.403}, {'end': 1688.979, 'text': 'So next what we have to do is split the data into the training and testing data and we split here the data into 70s to 30.', 'start': 1681.512, 'duration': 7.467}, {'end': 1694.944, 'text': 'This is a standard ratio 70 being the training data set and the 30% being the testing data set.', 'start': 1688.979, 'duration': 5.965}, {'end': 1700.469, 'text': 'The next what we do is that we train the model which we are created here using the training set.', 'start': 1695.665, 'duration': 4.804}, {'end': 1704.432, 'text': 'We have created a training model, decision tree dot train classifier.', 'start': 1701.271, 'duration': 3.161}, {'end': 1709.914, 'text': 'We have used the training data, number of classes filed, the categorical feature which we have given.', 'start': 1704.452, 'duration': 5.462}], 'summary': 'Import pi spark mllib tree, split data 70:30, train model using decision tree classifier.', 'duration': 31.825, 'max_score': 1678.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1678089.jpg'}, {'end': 1745.466, 'src': 'embed', 'start': 1721.719, 'weight': 2, 'content': [{'end': 1728.622, 'text': 'So here we are creating predictions and we are using the test data to get the predictions through the model which we created here.', 'start': 1721.719, 'duration': 6.903}, {'end': 1731.083, 'text': 'And we are also going to find the test errors here.', 'start': 1729.222, 'duration': 1.861}, {'end': 1736.761, 'text': 'So as you can see here, the test error is 0.2297.', 'start': 1732.079, 'duration': 4.682}, {'end': 1745.466, 'text': 'We have created a classification decision tree model in which the feature less than 12 is three, the value of the features less than zero is 54.', 'start': 1736.761, 'duration': 8.705}], 'summary': 'Created decision tree model with 0.2297 test error.', 'duration': 23.747, 'max_score': 1721.719, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1721719.jpg'}, {'end': 1797.935, 'src': 'embed', 'start': 1766.929, 'weight': 3, 'content': [{'end': 1768.03, 'text': 'Now we are using regression.', 'start': 1766.929, 'duration': 1.101}, {'end': 1777.318, 'text': 'Similarly, we are going to evaluate our model using our test data set and find the test errors, which is the mean squared error here for regression.', 'start': 1769.231, 'duration': 8.087}, {'end': 1779.94, 'text': "So let's have a look at the mean square error here.", 'start': 1777.998, 'duration': 1.942}, {'end': 1783.142, 'text': 'The mean square is 0.168.', 'start': 1780.42, 'duration': 2.722}, {'end': 1783.743, 'text': 'That is good.', 'start': 1783.142, 'duration': 0.601}, {'end': 1793.671, 'text': 'Finally, if we have a look at the learned regression tree model, So you can see we have created the regression tree model to the depth of three,', 'start': 1784.604, 'duration': 9.067}, {'end': 1797.935, 'text': 'with 15 nodes, and here we have all the features and classification of the tree.', 'start': 1793.671, 'duration': 4.264}], 'summary': "Regression model's mean square error is 0.168, with tree depth of 3 and 15 nodes.", 'duration': 31.006, 'max_score': 1766.929, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1766929.jpg'}], 'start': 1611.064, 'title': 'Data preprocessing, classification, and decision tree modeling in pyspark', 'summary': 'Covers data preprocessing resulting in 297 rows and 14 columns, classification using decision tree model with a test error of 0.2297, and regression with a mean squared error of 0.168. it also includes details on building and evaluating decision tree models, splitting data into training and testing sets, and creating regression tree model to the depth of three with 15 nodes.', 'chapters': [{'end': 1677.548, 'start': 1611.064, 'title': 'Mlib data preprocessing and classification', 'summary': 'Presents the process of preprocessing a dataset, including dropping empty values and saving it to a txt file, resulting in 297 rows and 14 columns, and then importing the mllib library for regression and performing classification using the decision tree.', 'duration': 66.484, 'highlights': ['The chapter presents the process of preprocessing a dataset, including dropping empty values and saving it to a txt file, resulting in 297 rows and 14 columns.', 'The chapter demonstrates the import of the MLlib library for regression and the creation of a label point associated with a response, and the conversion of minus one labels to zero.', 'The chapter also covers the process of performing classification using the decision tree.']}, {'end': 1832.402, 'start': 1678.089, 'title': 'Pyspark decision tree modeling', 'summary': 'Details the process of building and evaluating decision tree models in pyspark, including splitting the data into training and testing sets (70% and 30% respectively), training a classification decision tree model with a test error of 0.2297, performing regression using decision tree with a mean squared error of 0.168, and creating a regression tree model to the depth of three with 15 nodes.', 'duration': 154.313, 'highlights': ['The chapter details the process of building and evaluating decision tree models in PySpark, including splitting the data into training and testing sets (70% and 30% respectively), training a classification decision tree model with a test error of 0.2297, performing regression using decision tree with a mean squared error of 0.168, and creating a regression tree model to the depth of three with 15 nodes.', "The test error for the classification decision tree model is 0.2297, indicating the model's performance on the test data set.", "The mean squared error for the regression model is 0.168, demonstrating the accuracy of the regression model's predictions.", 'The regression tree model is created to the depth of three, with 15 nodes, showcasing the complexity and structure of the model.']}], 'duration': 221.338, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PRzSWWsyHZg/pics/PRzSWWsyHZg1611064.jpg', 'highlights': ['The chapter presents the process of preprocessing a dataset, resulting in 297 rows and 14 columns.', 'The chapter details the process of building and evaluating decision tree models in PySpark, including splitting the data into training and testing sets (70% and 30% respectively).', "The test error for the classification decision tree model is 0.2297, indicating the model's performance on the test data set.", "The mean squared error for the regression model is 0.168, demonstrating the accuracy of the regression model's predictions."]}], 'highlights': ['The tutorial includes a demo showcasing how to implement PySpark for real-life use cases.', 'PySpark is heavily used in the industry for real-time analytics and machine learning purposes.', 'The session covers the fundamental concepts of PySpark, including Spark context, DataFrames, MLLab, RDDs, and more.', 'The session also covers the advantages provided by PySpark.', 'The tutorial will guide the audience through the installation process of PySpark in their systems.', "PySpark integrates Python's simplicity with Apache Spark's power, providing default integration of Py4J library for big data processing.", "Python's dynamically typed nature enables RDDs to hold objects of multiple data types, simplifying the API.", "Python's readability, maintenance, familiarity, and visualization options make it advantageous for using Spark, especially compared to other languages.", 'The chapter provides a guide on installing PySpark on a Linux system, emphasizing the importance of having Hadoop installed.', 'RDDs are the building blocks of any spark application, operating on multiple nodes for parallel processing, and once created, they become immutable and have good fault tolerance ability.', 'Detailing the fundamentals of Spark context, explaining the role of Spark context as the heart of any Spark application, setting up internal services, establishing connections, and enabling the creation of RDDs, accumulators, and broadcast variables.', 'The chapter explains the categorization of transformations and actions, with transformations being lazy in nature and actions instructing Spark to apply computation and return the result to the driver.', 'Important RDD operations such as map, flatMap, filter, distinct, reduceByKey, mapPartition, sort, collect, map, reduce, and take are highlighted, showcasing their role in instructing Spark for computation and result retrieval.', 'Configuration of Spark to automatically redirect to the Jupyter notebook upon starting, with the option to rename the notebook for operations.', 'Demonstrating the process of converting data to lowercase and splitting it word by word using map transformation.', 'Explanation of broadcast variables used to save data on all nodes in a cluster and accumulators for aggregating incoming information with associative and commutative operations.', 'Data frames in PySpark are a distributed collection of rows under named columns, similar to relation database tables or Excel sheets, and they share common attributes with RDDs.', 'Data frames are immutable in nature, allowing lazy evaluation and designed for processing large collections of structured or semi-structured data.', 'Data frames can be created using different data formats like JSON or CSV, loading from source files or existing RDDs, and various databases and file types.', 'Creating a data frame in PySpark involves using the spark.read.CSV method and providing parameters like info schema and header for data loading.', 'The dataset contains 3.3 million records. The dataset loaded into the data frame comprises 3.3 million records.', 'The count function returns 3,036,776 records. The count function reveals that there are 3,036,776 records in the data frame.', "The describe function provides the summary of the 'distance' column with a minimum value of 17 and a maximum value of 4983.", 'Using the filter function with the distance as 17 results in one record in the year 2013.', 'Creating a temporary table for SQL queries with the register.tempTable function.', 'Executing a nested SQL query to find flights with the minimum airtime of 20.', 'ML lib is the machine learning API provided by Spark, heavily used in Python for machine learning.', 'Storage level in PySpark is a class which helps in deciding how the RDD should be stored, determining whether it should be stored in disk, in memory, or in both, and whether the RDD should be serialized or replicate its partition.', 'PySpark data frames were completed before discussing storage levels.', 'The chapter demonstrates the implementation of a heart disease prediction model using decision tree, classification, and regression in the MLlib library.', 'The Spark MLlib packets offer variant methods for classification and regression, including popular algorithms like naive bias, random forest, and decision tree.', 'The chapter covers clustering, frequent pattern matching, and linear algebra as supported by Spark MLlib.', 'The ALS algorithm in Spark MLlib is used to learn latent factors for model-based collaborative filtering, allowing prediction of missing entries.', 'The chapter presents the process of preprocessing a dataset, resulting in 297 rows and 14 columns.', 'The chapter details the process of building and evaluating decision tree models in PySpark, including splitting the data into training and testing sets (70% and 30% respectively).', "The test error for the classification decision tree model is 0.2297, indicating the model's performance on the test data set.", "The mean squared error for the regression model is 0.168, demonstrating the accuracy of the regression model's predictions."]}