title
Tutorial 1-Pyspark With Python-Pyspark Introduction and Installation

description
Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this. ⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite for a few months and I love it! https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=krishnaik&utm_content=description-only Subscribe my vlogging channel https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw Please donate if you want to support the channel through GPay UPID, Gpay: krishnaik06@okicici Telegram link: https://t.me/joinchat/N77M7xRvYUd403DgfE4TWw Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more https://www.youtube.com/channel/UCNU_lfiiWBdtULKOw6X0Dig/join Connect with me here: Twitter: https://twitter.com/Krishnaik06 Facebook: https://www.facebook.com/krishnaik06 instagram: https://www.instagram.com/krishnaik06

detail
{'title': 'Tutorial 1-Pyspark With Python-Pyspark Introduction and Installation', 'heatmap': [{'end': 653.316, 'start': 540.76, 'weight': 0.711}], 'summary': 'Tutorial series introduces apache spark with pyspark, emphasizing its usage in machine learning, data pre-processing, and application in cloud platforms, highlighting its 100 times faster workload processing compared to mapreduce, ease of use in various programming languages, and ability to run on different cloud platforms, while covering pyspark basics including csv data handling and operations.', 'chapters': [{'end': 128.6, 'segs': [{'end': 128.6, 'src': 'embed', 'start': 60.336, 'weight': 0, 'content': [{'end': 65.677, 'text': 'how we can actually pre-process our data set, how we can use the PySpark data frames.', 'start': 60.336, 'duration': 5.341}, {'end': 75.959, 'text': "We'll also try to see how we can implement or how we can use PySpark in cloud platforms like Databricks, Amazon, AWS, you know.", 'start': 65.697, 'duration': 10.262}, {'end': 78.36, 'text': "So all these kind of clouds we'll try to cover.", 'start': 76.36, 'duration': 2}, {'end': 81.942, 'text': 'And remember, Apache Spark is quite handy.', 'start': 79.04, 'duration': 2.902}, {'end': 86.985, 'text': 'Let me tell you, just let me just give you some of the reasons why Apache Spark is pretty much good.', 'start': 82.102, 'duration': 4.883}, {'end': 96.291, 'text': "Because understand, suppose, if you have a huge amount of data, okay, suppose, if I say that I'm having 64 GB data, 128 GB data,", 'start': 88.266, 'duration': 8.025}, {'end': 99.553, 'text': 'you know we may have some kind of systems or standalone systems, you know.', 'start': 96.291, 'duration': 3.262}, {'end': 103.678, 'text': 'where we can have 32 GB of RAM, probably 64 GB of RAM.', 'start': 99.953, 'duration': 3.725}, {'end': 107.022, 'text': "Right now in the workstation that I'm working in, it has 64 GB RAM.", 'start': 103.718, 'duration': 3.304}, {'end': 108.984, 'text': 'So, max to max.', 'start': 107.462, 'duration': 1.522}, {'end': 112.228, 'text': 'it can directly upload a data set of 32 GB.', 'start': 108.984, 'duration': 3.244}, {'end': 113.289, 'text': '48 GB, right?', 'start': 112.228, 'duration': 1.061}, {'end': 116.453, 'text': 'But what if we have a data set of 128 GB?', 'start': 113.729, 'duration': 2.724}, {'end': 119.895, 'text': 'you know, that is the time, guys.', 'start': 117.834, 'duration': 2.061}, {'end': 122.096, 'text': "we don't just depend on a local system.", 'start': 119.895, 'duration': 2.201}, {'end': 128.6, 'text': "we'll try to pre-process that particular data or perform any kind of operation in distributed systems.", 'start': 122.096, 'duration': 6.504}], 'summary': 'Using pyspark for pre-processing and analysis of large datasets on cloud platforms, due to limitations of local systems.', 'duration': 68.264, 'max_score': 60.336, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc60336.jpg'}], 'start': 13.953, 'title': 'Apache spark with pyspark', 'summary': 'Introduces apache spark series, emphasizing the usage of spark with python, mlib for machine learning, data pre-processing, and its application in cloud platforms, stressing the need for distributed systems for processing large datasets.', 'chapters': [{'end': 128.6, 'start': 13.953, 'title': 'Apache spark with pyspark: introduction and applications', 'summary': 'Introduces apache spark series, focusing on using spark with python, covering pyspark library, mlib for machine learning, data pre-processing, and usage in cloud platforms, highlighting the need for distributed systems for processing large datasets.', 'duration': 114.647, 'highlights': ['Apache Spark is required for processing large datasets, such as 128 GB, which cannot be handled by standalone systems with limited RAM.', 'The chapter covers the usage of PySpark library, MLib for machine learning, and data pre-processing, along with implementation in cloud platforms like Databricks, Amazon, and AWS.', 'The need for distributed systems is emphasized for handling huge amounts of data, such as 128 GB, which exceeds the capacity of local systems like standalone workstations with limited RAM.']}], 'duration': 114.647, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc13953.jpg', 'highlights': ['The need for distributed systems is emphasized for handling huge amounts of data, such as 128 GB, which exceeds the capacity of local systems like standalone workstations with limited RAM.', 'The chapter covers the usage of PySpark library, MLib for machine learning, and data pre-processing, along with implementation in cloud platforms like Databricks, Amazon, and AWS.', 'Apache Spark is required for processing large datasets, such as 128 GB, which cannot be handled by standalone systems with limited RAM.']}, {'end': 644.547, 'segs': [{'end': 156.299, 'src': 'embed', 'start': 128.6, 'weight': 6, 'content': [{'end': 133.202, 'text': 'right, distributed system basically means that all there will be multiple systems.', 'start': 128.6, 'duration': 4.602}, {'end': 139.566, 'text': 'you know where we can actually run this kind of jobs or processes or try to do any kind of activities that we really want,', 'start': 133.202, 'duration': 6.364}, {'end': 146.751, 'text': 'and definitely apache spark will actually help us to do that, and this has been pretty much amazing.', 'start': 140.146, 'duration': 6.605}, {'end': 150.134, 'text': 'and yes, people wanted this kind of videos a lot.', 'start': 146.751, 'duration': 3.383}, {'end': 156.299, 'text': "so how we are going to go through this specific playlist is that we'll try to first of all start with the installation.", 'start': 150.134, 'duration': 6.165}], 'summary': 'Apache spark enables running jobs on multiple systems, meeting high demand for related videos.', 'duration': 27.699, 'max_score': 128.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc128600.jpg'}, {'end': 193.506, 'src': 'embed', 'start': 168.048, 'weight': 3, 'content': [{'end': 173.532, 'text': 'And yes, we can also use Spark with other programming languages like Java, Scala R and all right?', 'start': 168.048, 'duration': 5.484}, {'end': 176.534, 'text': "And we'll try to understand from basics.", 'start': 174.312, 'duration': 2.222}, {'end': 179.016, 'text': 'you know, from basics, how do we read a data set?', 'start': 176.534, 'duration': 2.482}, {'end': 181.578, 'text': 'How do we connect to a data source, probably?', 'start': 179.056, 'duration': 2.522}, {'end': 183.839, 'text': 'How do we play with the data frames?', 'start': 181.958, 'duration': 1.881}, {'end': 190.784, 'text': 'You know, in this Apache Spark, that is your PySpark also, they provide you data structures like data frames, uh,', 'start': 183.859, 'duration': 6.925}, {'end': 193.506, 'text': 'which is pretty much similar to the pandas data frame.', 'start': 190.784, 'duration': 2.722}], 'summary': 'Introduction to using spark with java, scala, and r, covering basic data manipulation and data frame usage.', 'duration': 25.458, 'max_score': 168.048, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc168048.jpg'}, {'end': 235.36, 'src': 'embed', 'start': 211.018, 'weight': 5, 'content': [{'end': 218.964, 'text': "where we'll be able to perform some machine learning algorithm tasks, where we'll be able to do regression, classification, clustering, And finally,", 'start': 211.018, 'duration': 7.946}, {'end': 227.112, 'text': "we'll try to see how we can actually do the same operation in cloud, where I'll try to show you some examples where we'll be having a huge data set.", 'start': 218.964, 'duration': 8.148}, {'end': 232.938, 'text': 'We will try to do the operation in the clusters of system, you know, in a distributed system.', 'start': 227.472, 'duration': 5.466}, {'end': 235.36, 'text': "And we'll try to see how we can use Spark in that.", 'start': 233.338, 'duration': 2.022}], 'summary': 'Learn machine learning tasks and use spark for big data analysis in cloud.', 'duration': 24.342, 'max_score': 211.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc211018.jpg'}, {'end': 303.096, 'src': 'embed', 'start': 255.796, 'weight': 0, 'content': [{'end': 259.738, 'text': 'And there, if you have heard of this terminology called as map reduce right?', 'start': 255.796, 'duration': 3.942}, {'end': 266.622, 'text': 'So trust me, Apache Spark is much more faster, 100 times faster than MapReduce also.', 'start': 261, 'duration': 5.622}, {'end': 270.963, 'text': 'And it is some of the more advantages that it is ease of use.', 'start': 267.222, 'duration': 3.741}, {'end': 274.224, 'text': 'You can write application quickly in Java, Scala, Python or R.', 'start': 270.983, 'duration': 3.241}, {'end': 279.245, 'text': "As I said, we'll be focusing on Python where we'll be using a library called PySpark.", 'start': 274.224, 'duration': 5.021}, {'end': 284.087, 'text': 'Then you can also combine SQL streaming and complex analytics.', 'start': 279.745, 'duration': 4.342}, {'end': 285.648, 'text': 'When I talk about complex analytics,', 'start': 284.107, 'duration': 1.541}, {'end': 292.431, 'text': "I'm basically talking about this MLib machine learning libraries that will work definitely well with Apache Spark.", 'start': 285.648, 'duration': 6.783}, {'end': 300.575, 'text': 'And Apache Sparks can run on Hadoop, Apache Mesos Kubernetes, standalone or in the clouds, different types of clouds.', 'start': 293.091, 'duration': 7.484}, {'end': 303.096, 'text': 'guys, when I talk about AWS Databricks, all these things.', 'start': 300.575, 'duration': 2.521}], 'summary': 'Apache spark is 100 times faster than mapreduce, offers ease of use, and supports multiple programming languages and complex analytics, running on various platforms including aws databricks.', 'duration': 47.3, 'max_score': 255.796, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc255796.jpg'}], 'start': 128.6, 'title': 'Apache spark: advantages and installation', 'summary': 'Introduces apache spark, covering its 100 times faster workload processing compared to mapreduce, ease of use in various programming languages, and its ability to run on different cloud platforms. it also outlines the learning journey from installation to utilizing machine learning algorithms and working with distributed systems.', 'chapters': [{'end': 232.938, 'start': 128.6, 'title': 'Introduction to apache spark', 'summary': 'Introduces apache spark, covering its benefits and uses, and outlines the learning journey from installation to utilizing machine learning algorithms and working with distributed systems.', 'duration': 104.338, 'highlights': ['Apache Spark enables the use of multiple systems for running jobs and processes, providing significant support for various activities. People have shown a high demand for learning about this technology.', 'The learning journey begins with installation and utilization of PySpark, followed by exploring data handling and manipulation using data frames in Apache Spark, which provides data structures similar to Pandas data frames but with different supported operations.', 'The chapter also covers the utilization of Spark with other programming languages like Java, Scala, and R, and delves into performing machine learning tasks using Spark MLlib, including regression, classification, and clustering.', 'The learning journey concludes with a demonstration of performing operations in a distributed system using cloud computing, showcasing examples of working with huge datasets in cluster environments.']}, {'end': 644.547, 'start': 233.338, 'title': 'Apache spark: advantages and installation', 'summary': 'Explores the advantages of apache spark, including its 100 times faster workload processing compared to mapreduce, ease of use in various programming languages, and its ability to run on different cloud platforms. it also covers the installation process of pyspark, including creating a new environment, installing the library, and checking the installation.', 'duration': 411.209, 'highlights': ["Apache Spark's 100 times faster workload processing compared to MapReduce Apache Spark runs workloads 100 times faster than MapReduce, making it highly efficient for processing large datasets.", 'Ease of use in various programming languages Apache Spark allows writing applications quickly in Java, Scala, Python, or R, providing versatility and ease of use for developers.', 'Ability to run on different cloud platforms Apache Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or different cloud platforms such as AWS and Databricks, offering flexibility and scalability for deployment.', 'Installation process of PySpark, including creating a new environment and library installation The chapter covers the process of creating a new environment, installing PySpark library using pip, and checking the installation to ensure a smooth setup.']}], 'duration': 515.947, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc128600.jpg', 'highlights': ['Apache Spark runs workloads 100 times faster than MapReduce, making it highly efficient for processing large datasets.', 'Apache Spark allows writing applications quickly in Java, Scala, Python, or R, providing versatility and ease of use for developers.', 'Apache Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or different cloud platforms such as AWS and Databricks, offering flexibility and scalability for deployment.', 'The learning journey begins with installation and utilization of PySpark, followed by exploring data handling and manipulation using data frames in Apache Spark, which provides data structures similar to Pandas data frames but with different supported operations.', 'The chapter also covers the utilization of Spark with other programming languages like Java, Scala, and R, and delves into performing machine learning tasks using Spark MLlib, including regression, classification, and clustering.', 'The learning journey concludes with a demonstration of performing operations in a distributed system using cloud computing, showcasing examples of working with huge datasets in cluster environments.', 'Apache Spark enables the use of multiple systems for running jobs and processes, providing significant support for various activities.', 'People have shown a high demand for learning about this technology.']}, {'end': 987.31, 'segs': [{'end': 712.245, 'src': 'embed', 'start': 671.21, 'weight': 0, 'content': [{'end': 676.694, 'text': "And here I'm just going to write tips1.csv, right?", 'start': 671.21, 'duration': 5.484}, {'end': 684.158, 'text': "And if I just try to execute it here, I'm getting some error saying that this particular file does not exist.", 'start': 677.354, 'duration': 6.804}, {'end': 685.259, 'text': 'Let me see.', 'start': 684.258, 'duration': 1.001}, {'end': 688.441, 'text': 'I think this file is present.', 'start': 685.279, 'duration': 3.162}, {'end': 698.114, 'text': 'Mm-mm Just let me see guys.', 'start': 690.583, 'duration': 7.531}, {'end': 700.376, 'text': 'Why this is not getting executed.', 'start': 698.274, 'duration': 2.102}, {'end': 705.219, 'text': 'Tips 1.', 'start': 700.516, 'duration': 4.703}, {'end': 707.761, 'text': 'DF file open.', 'start': 705.219, 'duration': 2.542}, {'end': 709.703, 'text': 'Here I can see test1.csv.', 'start': 708.362, 'duration': 1.341}, {'end': 712.245, 'text': 'Okay sorry I did not write the csv file I guess.', 'start': 709.823, 'duration': 2.422}], 'summary': 'Encountered error while attempting to execute tips1.csv, but found test1.csv instead.', 'duration': 41.035, 'max_score': 671.21, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc671210.jpg'}, {'end': 810.688, 'src': 'embed', 'start': 777.378, 'weight': 1, 'content': [{'end': 781.661, 'text': "I guess there'll be something like key value that you will be providing an option.", 'start': 777.378, 'duration': 4.283}, {'end': 784.142, 'text': 'So what you can do, you can just write header.', 'start': 782.001, 'duration': 2.141}, {'end': 790.691, 'text': 'comma. true, so whatever value, the first column, first row value will be there.', 'start': 785.183, 'duration': 5.508}, {'end': 792.914, 'text': 'that will be considered as your header.', 'start': 790.691, 'duration': 2.223}, {'end': 802.886, 'text': "and if i write csv with respect to test one now, i'm just going to read this test one data set Test1.csv.", 'start': 792.914, 'duration': 9.972}, {'end': 810.688, 'text': "Now once I execute this, here you'll be able to see that I'm able to get now name string, age string, okay? But let's see our complete data set.", 'start': 803.326, 'duration': 7.362}], 'summary': 'Demonstrating how to use key value pairs in a csv file to create headers and access data sets.', 'duration': 33.31, 'max_score': 777.378, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc777378.jpg'}, {'end': 967.476, 'src': 'embed', 'start': 938.627, 'weight': 2, 'content': [{'end': 940.048, 'text': 'uh, where we will.', 'start': 938.627, 'duration': 1.421}, {'end': 943.591, 'text': 'uh, probably, okay, pi spark is already there.', 'start': 940.048, 'duration': 3.543}, {'end': 954.448, 'text': "okay, basic introduction, fine, so we will try to do this, uh, and we'll try to cover this entire thing as we go ahead in the next session.", 'start': 943.591, 'duration': 10.857}, {'end': 955.488, 'text': 'remember, guys, again,', 'start': 954.448, 'duration': 1.04}, {'end': 964.634, 'text': "our main aim is basically to make you understand how probably we'll be working in clouds and before that we really need to know all the basic stuffs.", 'start': 955.488, 'duration': 9.146}, {'end': 967.476, 'text': 'uh, that we need to understand regarding pie spot library.', 'start': 964.634, 'duration': 2.842}], 'summary': 'The main aim is to teach about working in clouds and understanding pyspark library.', 'duration': 28.849, 'max_score': 938.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc938627.jpg'}], 'start': 644.987, 'title': 'Pyspark basics and csv data handling', 'summary': 'Covers reading csv data in pyspark, resolving errors, and demonstrating basic operations in pyspark, highlighting its potential for data manipulation and preprocessing in machine learning, and underscoring the importance of installing pyspark.', 'chapters': [{'end': 802.886, 'start': 644.987, 'title': 'Reading and handling csv data in pyspark', 'summary': "Explains how to read a csv data set in pyspark, encountering errors, and resolving issues related to column names and headers, with a demonstration of using the 'show' function to display the data set.", 'duration': 157.899, 'highlights': ["Resolving error in reading CSV file The speaker encounters an error while reading the CSV file 'tips1.csv' in PySpark, and upon realizing the mistake in the file name, successfully reads the file as 'test1.csv'.", "Displaying columns using 'show' function The speaker uses the 'show' function in PySpark to display the columns of the data set, demonstrating the presence of columns named 'C0' and 'C1', and then mentions the desire to rename the columns.", "Setting header option while reading CSV file The speaker explains the technique of setting the 'header' option as 'true' when reading a CSV file in PySpark to consider the first row as the header, and demonstrates applying this technique to the data set 'Test1.csv'."]}, {'end': 987.31, 'start': 803.326, 'title': 'Introduction to pyspark basics', 'summary': 'Introduces the basics of pyspark, demonstrating operations like displaying data, checking data types, and understanding the potential of pyspark for data manipulation and preprocessing in the context of machine learning, emphasizing the importance of installing pyspark for upcoming sessions.', 'duration': 183.984, 'highlights': ['The chapter emphasizes the importance of installing PySpark to understand operations such as changing data types, working with data frames, handling null values, and performing data preprocessing, all crucial for machine learning, reinforcing the significance of PySpark for cloud-based work.', 'The demonstration includes operations like displaying the entire data set with specific columns, checking the data type of the dataframe, utilizing functionalities like head and print schema to view rows and column information and understanding the similarities and differences between Pandas and PySpark data frames.']}], 'duration': 342.323, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/WyZmM6K7ubc/pics/WyZmM6K7ubc644987.jpg', 'highlights': ["Resolving error in reading CSV file The speaker encounters an error while reading the CSV file 'tips1.csv' in PySpark, and upon realizing the mistake in the file name, successfully reads the file as 'test1.csv'.", "Setting header option while reading CSV file The speaker explains the technique of setting the 'header' option as 'true' when reading a CSV file in PySpark to consider the first row as the header, and demonstrates applying this technique to the data set 'Test1.csv'.", 'The chapter emphasizes the importance of installing PySpark to understand operations such as changing data types, working with data frames, handling null values, and performing data preprocessing, all crucial for machine learning, reinforcing the significance of PySpark for cloud-based work.']}], 'highlights': ['Apache Spark runs workloads 100 times faster than MapReduce, making it highly efficient for processing large datasets.', 'Apache Spark allows writing applications quickly in Java, Scala, Python, or R, providing versatility and ease of use for developers.', 'Apache Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or different cloud platforms such as AWS and Databricks, offering flexibility and scalability for deployment.', 'The chapter covers the usage of PySpark library, MLib for machine learning, and data pre-processing, along with implementation in cloud platforms like Databricks, Amazon, and AWS.', 'The learning journey begins with installation and utilization of PySpark, followed by exploring data handling and manipulation using data frames in Apache Spark, which provides data structures similar to Pandas data frames but with different supported operations.', 'The chapter also covers the utilization of Spark with other programming languages like Java, Scala, and R, and delves into performing machine learning tasks using Spark MLlib, including regression, classification, and clustering.', 'The need for distributed systems is emphasized for handling huge amounts of data, such as 128 GB, which exceeds the capacity of local systems like standalone workstations with limited RAM.', 'The chapter emphasizes the importance of installing PySpark to understand operations such as changing data types, working with data frames, handling null values, and performing data preprocessing, all crucial for machine learning, reinforcing the significance of PySpark for cloud-based work.']}