title
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka
description
( Apache Spark Training - https://www.edureka.co/apache-spark-scala-certification-training )
This Edureka Spark SQL Tutorial (Spark SQL Blog: https://goo.gl/DMFzga) will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
00:53 Limitations of Apache Hive
02:33 Spark SQL Advantages Over Hive
07:17 Spark SQL Success Story
08:49 Spark SQL Features
12:36 Architecture of Spark SQL
15:06 Spark SQL Libraries
17:40 Querying Using Spark SQL
33:13 Demo: Stock Market Analysis With Spark SQL
Subscribe to our channel to get video updates. Hit the subscribe button above.
Check our complete Apache Spark and Scala playlist here: https://goo.gl/ViRJ2K
Introducing Edureka Elevate, a one of its kind software development program where you only pay the program fees once you get a top tech job. If you are a 4th year engineering student or a fresh graduate, this program is open to you! Learn more: http://bit.ly/2vQKVu6
How it Works?
1. This is a 4 Week Instructor led Online Course, 32 hours of assignment and 20 hours of project work
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. At the end of the training you will have to work on a project, based on which we will provide you a Grade and a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
This Spark training will enable learners to understand how Spark executes in-memory data processing and runs much faster than Hadoop MapReduce. Learners will master Scala programming and will get trained on different APIs which Spark offers such as Spark Streaming, SparkSQL, Spark RDD, Spark MLlib and Spark GraphX. This Edureka course is an integral part of Big Data developer's learning path.
After completing the Apache Spark and Scala training, you will be able to:
1) Understand Scala and its implementation
2) Master the concepts of Traits and OOPS in Scala programming
3) Install Spark and implement Spark operations on Spark Shell
4) Understand the role of Spark RDD
5) Implement Spark applications on YARN (Hadoop)
6) Learn Spark Streaming API
7) Implement machine learning algorithms in Spark MLlib API
8) Analyze Hive and Spark SQL architecture
9) Understand Spark GraphX API and implement graph algorithms
10) Implement Broadcast variable and Accumulators for performance tuning
11) Spark Real-time Projects
- - - - - - - - - - - - - -
Who should go for this Course?
This course is a must for anyone who aspires to embark into the field of big data and keep abreast of the latest developments around fast and efficient processing of ever-growing data using Spark and related projects. The course is ideal for:
1. Big Data enthusiasts
2. Software Architects, Engineers and Developers
3. Data Scientists and Analytics professionals
- - - - - - - - - - - - - -
Why learn Apache Spark?
In this era of ever growing data, the need for analyzing it for meaningful business insights is paramount. There are different big data processing alternatives like Hadoop, Spark, Storm and many more. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast big data analysis platforms.
The following Edureka blogs will help you understand the significance of Spark training:
5 Reasons to Learn Spark: https://goo.gl/7nMcS0
Apache Spark with Hadoop, Why it matters: https://goo.gl/I2MCeP
For more information, Please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: 18338555775 (toll free).
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Customer Review:
Michael Harkins, System Architect, Hortonworks says: “The courses are top rate. The best part is live instruction, with playback. But my favorite feature is viewing a previous class. Also, they are always there to answer questions, and prompt when you open an issue if you are having any trouble. Added bonus ~ you get lifetime access to the course you took!!! Edureka lets you go back later, when your boss says "I want this ASAP!" ~ This is the killer education app... I've taken two courses, and I'm taking two more.”
detail
{'title': 'Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka', 'heatmap': [{'end': 585.71, 'start': 518.058, 'weight': 1}, {'end': 684.928, 'start': 643.042, 'weight': 0.816}, {'end': 1302.605, 'start': 743.042, 'weight': 0.715}], 'summary': 'This tutorial series on spark sql provides an introduction to spark sql, discusses its advantages over hive, its application in real-time data processing for stock market analysis and banking fraud detection, adding schema to rdd, data analysis, and stock class creation with code examples and practical demonstrations.', 'chapters': [{'end': 49.445, 'segs': [{'end': 49.445, 'src': 'embed', 'start': 0.169, 'weight': 0, 'content': [{'end': 5.734, 'text': 'Hello everyone, welcome to this interesting session of Spark SQL tutorial from Edureka.', 'start': 0.169, 'duration': 5.565}, {'end': 11.339, 'text': "So in today's session, we are going to learn about how we will be working with Spark SQL.", 'start': 5.894, 'duration': 5.445}, {'end': 16.264, 'text': 'Now what all you can expect from this course, from this particular session?', 'start': 11.86, 'duration': 4.404}, {'end': 20.607, 'text': 'so you can expect that we will be first learning what is Spark SQL.', 'start': 16.264, 'duration': 4.343}, {'end': 21.708, 'text': 'why is Spark SQL?', 'start': 20.607, 'duration': 1.101}, {'end': 25.971, 'text': 'What are the libraries which are present in Spark SQL?', 'start': 22.509, 'duration': 3.462}, {'end': 29.193, 'text': 'What are the important features of Sparkle?', 'start': 26.611, 'duration': 2.582}, {'end': 35.437, 'text': 'We will also be doing some hands-on example and in the end we will see some interesting use.', 'start': 29.593, 'duration': 5.844}, {'end': 38.098, 'text': 'case of stock market analysis.', 'start': 35.437, 'duration': 2.661}, {'end': 40.46, 'text': 'Now, why Spark SQL??', 'start': 38.639, 'duration': 1.821}, {'end': 43.441, 'text': 'Is it like, why we are learning it?', 'start': 40.66, 'duration': 2.781}, {'end': 47.884, 'text': 'Why it is really important for us to know about this Spark SQL site?', 'start': 43.842, 'duration': 4.042}, {'end': 49.445, 'text': 'Is it like really hot in market?', 'start': 47.964, 'duration': 1.481}], 'summary': "Edureka's spark sql tutorial covers basics, libraries, features, hands-on examples, and stock market analysis.", 'duration': 49.276, 'max_score': 0.169, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY169.jpg'}], 'start': 0.169, 'title': 'Spark sql introduction', 'summary': 'Provides an introduction to spark sql, highlighting its importance and key features, and includes a hands-on example. it also discusses the relevance of spark sql in the market.', 'chapters': [{'end': 49.445, 'start': 0.169, 'title': 'Introduction to spark sql', 'summary': 'Provides an introduction to spark sql, covering its importance, key features, and a hands-on example, and ends with a discussion on the relevance of spark sql in the market.', 'duration': 49.276, 'highlights': ['The importance of Spark SQL is emphasized, highlighting its relevance in the market and the need to understand its significance.', 'The session covers the key features and libraries present in Spark SQL, preparing the audience for a comprehensive understanding of the topic.', 'Hands-on examples are included in the session, providing practical exposure to working with Spark SQL, enhancing the learning experience.', 'The chapter concludes with a discussion on the relevance of Spark SQL in the market, particularly in the context of stock market analysis, providing real-world use cases to illustrate its significance.']}], 'duration': 49.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY169.jpg', 'highlights': ['The chapter concludes with a discussion on the relevance of Spark SQL in the market, particularly in the context of stock market analysis, providing real-world use cases to illustrate its significance.', 'The session covers the key features and libraries present in Spark SQL, preparing the audience for a comprehensive understanding of the topic.', 'Hands-on examples are included in the session, providing practical exposure to working with Spark SQL, enhancing the learning experience.', 'The importance of Spark SQL is emphasized, highlighting its relevance in the market and the need to understand its significance.']}, {'end': 484.116, 'segs': [{'end': 98.601, 'src': 'embed', 'start': 76.067, 'weight': 4, 'content': [{'end': 83.812, 'text': 'and since MapReduce is going to be slower in nature, then definitely your overall Hive query is going to be slower in nature.', 'start': 76.067, 'duration': 7.745}, {'end': 85.133, 'text': 'So that was one challenge.', 'start': 83.932, 'duration': 1.201}, {'end': 93.378, 'text': "So if you have, let's say, less than 200 GB of data, or if you have a smaller set of data this was actually a big challenge that in Hive,", 'start': 85.173, 'duration': 8.205}, {'end': 96.02, 'text': 'your performance was not that great.', 'start': 93.378, 'duration': 2.642}, {'end': 98.601, 'text': 'It also do not have any resuming capability.', 'start': 96.42, 'duration': 2.181}], 'summary': "Mapreduce's slowness affects hive query speed for data under 200gb, lacking resuming capability.", 'duration': 22.534, 'max_score': 76.067, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY76067.jpg'}, {'end': 186.843, 'src': 'embed', 'start': 162.843, 'weight': 2, 'content': [{'end': 170.993, 'text': "So a Hive query which is let's say taking around 10 minutes, in Spark SQL you can finish that same query in less than one minute.", 'start': 162.843, 'duration': 8.15}, {'end': 173.535, 'text': "Don't you think it's an awesome capability of Spark SQL?", 'start': 171.273, 'duration': 2.262}, {'end': 175.557, 'text': 'Definitely yes, right?', 'start': 174.176, 'duration': 1.381}, {'end': 179.719, 'text': "Now, second thing is when, if, let's say, you're writing something in Hive?", 'start': 175.917, 'duration': 3.802}, {'end': 186.843, 'text': "Now you can take an example of let's say a company who is let's say developing Hive queries from last 10 years.", 'start': 180.239, 'duration': 6.604}], 'summary': 'Spark sql can process a query 10x faster than hive queries, offering a significant performance improvement.', 'duration': 24, 'max_score': 162.843, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY162843.jpg'}, {'end': 288.785, 'src': 'embed', 'start': 258.345, 'weight': 3, 'content': [{'end': 260.827, 'text': 'Now Spark SQL came up with a smart solution.', 'start': 258.345, 'duration': 2.482}, {'end': 267.851, 'text': "What they said is, even if you're writing the query with Hive, you can execute that Hive query directly through Spark SQL.", 'start': 261.127, 'duration': 6.724}, {'end': 272.054, 'text': "Don't you think it's a kind of very important and awesome facility?", 'start': 268.271, 'duration': 3.783}, {'end': 278.315, 'text': "right?. Because even now, if you're a good Hive developer, you need not worry about that.", 'start': 272.054, 'duration': 6.261}, {'end': 281.518, 'text': 'how you will be now migrating to Spark SQL.', 'start': 278.315, 'duration': 3.203}, {'end': 288.785, 'text': 'You can still keep on writing to Hive query and your query will automatically be getting converted to Spark SQL.', 'start': 281.799, 'duration': 6.986}], 'summary': 'Spark sql allows direct execution of hive queries, simplifying migration to spark sql.', 'duration': 30.44, 'max_score': 258.345, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY258345.jpg'}, {'end': 329.282, 'src': 'embed', 'start': 303.018, 'weight': 0, 'content': [{'end': 307.803, 'text': 'Now this sort of facility is you can take leverage even in your Spark SQL.', 'start': 303.018, 'duration': 4.785}, {'end': 312.667, 'text': "So let's say you can do a real time processing and at the same time you can also perform your SQL query.", 'start': 307.823, 'duration': 4.844}, {'end': 314.669, 'text': 'Now with Hive that was a problem.', 'start': 313.087, 'duration': 1.582}, {'end': 321.315, 'text': "You cannot do that because when we talk about Hive, now in Hive it's all about, Hadoop is all about batch processing.", 'start': 314.789, 'duration': 6.526}, {'end': 325.599, 'text': 'Batch processing where you keep historical data and then later you process it.', 'start': 321.735, 'duration': 3.864}, {'end': 329.282, 'text': 'So it definitely Hive also follow the same approach.', 'start': 326.379, 'duration': 2.903}], 'summary': "Spark sql allows real-time processing and sql queries, unlike hive's batch processing.", 'duration': 26.264, 'max_score': 303.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY303018.jpg'}, {'end': 400.629, 'src': 'embed', 'start': 375.747, 'weight': 1, 'content': [{'end': 383.313, 'text': 'Now whatever metastore you have created with respect to Hive, same metastore you can also use it for your Spark SQL.', 'start': 375.747, 'duration': 7.566}, {'end': 386.856, 'text': 'And that is something which is really awesome about this Spark SQL.', 'start': 383.634, 'duration': 3.222}, {'end': 392.181, 'text': 'that you did not create a new meta store, you need not worry about a new storage space and all.', 'start': 387.717, 'duration': 4.464}, {'end': 397.246, 'text': 'Everything what you have done with respect to your hive, a same meta store you can use it.', 'start': 392.602, 'duration': 4.644}, {'end': 400.629, 'text': "Now you can ask me then how it is faster if they're using same meta store.", 'start': 397.546, 'duration': 3.083}], 'summary': 'Spark sql allows reusing hive metastore, eliminating need for new storage space.', 'duration': 24.882, 'max_score': 375.747, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY375747.jpg'}], 'start': 49.865, 'title': 'Hive performance challenges and spark sql advantages', 'summary': 'Discusses challenges with apache hive performance, focusing on slower mapreduce, impact on query performance for smaller datasets, and lack of resuming capability. it also explores advantages of spark sql, emphasizing faster processing speed, seamless execution of hive queries, real-time processing capability, and shared meta store with hive, illustrated with examples like twitter sentiment analysis.', 'chapters': [{'end': 98.601, 'start': 49.865, 'title': 'Challenges with apache hive performance', 'summary': 'Discusses the challenges with apache hive performance, highlighting the slower nature of mapreduce and the impact on overall query performance, particularly for datasets smaller than 200 gb and the lack of resuming capability.', 'duration': 48.736, 'highlights': ['MapReduce is slower in nature, impacting the overall performance of Hive queries, especially for datasets smaller than 200 GB.', 'Hive lacks the capability to resume tasks, posing a challenge for managing large queries or tasks.']}, {'end': 484.116, 'start': 98.621, 'title': 'Advantages of spark sql over hive', 'summary': 'Discusses the advantages of spark sql over hive, highlighting its faster processing speed, seamless execution of hive queries, real-time processing capability, and utilization of the same meta store as hive, along with examples like twitter sentiment analysis.', 'duration': 385.495, 'highlights': ['Spark SQL is faster than Hive, with queries taking less than one minute compared to around 10 minutes in Hive. A Hive query taking around 10 minutes can be finished in less than one minute using Spark SQL.', 'Spark SQL allows seamless execution of Hive queries, enabling easy migration for companies with existing Hive queries. Even if a company has been using Hive for 10 years, they can execute their Hive queries directly through Spark SQL, easing the migration process.', 'Spark SQL offers real-time processing capability, unlike Hive which is focused on batch processing. Unlike Hive, Spark SQL enables real-time processing in addition to SQL queries, providing a more versatile processing approach.', 'Spark SQL utilizes the same meta store as Hive, eliminating the need for a new meta store and storage space for Spark SQL. Spark SQL uses the same meta store created for Hive, eliminating the need for a new storage space and meta store for Spark SQL.', 'Example of Twitter sentiment analysis showcases the advantages of Spark SQL in data analysis and processing. The example of Twitter sentiment analysis demonstrates the advantages of using Spark SQL for data analysis and processing, as seen in the session discussing Twitter sentiment analysis.']}], 'duration': 434.251, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY49865.jpg', 'highlights': ['Spark SQL offers real-time processing capability, unlike Hive which is focused on batch processing.', 'Spark SQL utilizes the same meta store as Hive, eliminating the need for a new meta store and storage space for Spark SQL.', 'A Hive query taking around 10 minutes can be finished in less than one minute using Spark SQL.', 'Even if a company has been using Hive for 10 years, they can execute their Hive queries directly through Spark SQL, easing the migration process.', 'MapReduce is slower in nature, impacting the overall performance of Hive queries, especially for datasets smaller than 200 GB.', 'Hive lacks the capability to resume tasks, posing a challenge for managing large queries or tasks.']}, {'end': 1245.686, 'segs': [{'end': 518.058, 'src': 'embed', 'start': 484.457, 'weight': 0, 'content': [{'end': 490.6, 'text': 'So, in this session, as you are noticing what we are doing, we just want to kind of show that, once you are streaming the data in the real time,', 'start': 484.457, 'duration': 6.143}, {'end': 492.921, 'text': 'you can also do a processing using Spark SQL.', 'start': 490.6, 'duration': 2.321}, {'end': 496.143, 'text': 'Thus you are doing all the processing at the real time.', 'start': 493.321, 'duration': 2.822}, {'end': 500.746, 'text': 'Similarly, in the stock market analysis, you can use Spark SQL, a lot of queries you can adopt there.', 'start': 496.303, 'duration': 4.443}, {'end': 503.889, 'text': 'In the banking fraud case transactions and all, you can use that.', 'start': 500.946, 'duration': 2.943}, {'end': 510.854, 'text': "So let's say, your credit card currently is getting swiped in India and in next 10 minutes, if your credit card is getting swiped in, let's say,", 'start': 504.189, 'duration': 6.665}, {'end': 511.394, 'text': 'in US.', 'start': 510.854, 'duration': 0.54}, {'end': 512.975, 'text': 'definitely that is not possible, right?', 'start': 511.394, 'duration': 1.581}, {'end': 515.517, 'text': "So let's say, you are doing all that processing real time.", 'start': 513.335, 'duration': 2.182}, {'end': 518.058, 'text': "you're detecting everything with respect to Spark's timing.", 'start': 515.517, 'duration': 2.541}], 'summary': 'Real-time data processing with spark sql for stock market and banking fraud analysis.', 'duration': 33.601, 'max_score': 484.457, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY484457.jpg'}, {'end': 585.71, 'src': 'heatmap', 'start': 518.058, 'weight': 1, 'content': [{'end': 523.663, 'text': "then you are, let's say, applying your Spark SQL to verify that, whether it's a user trend or not, right?", 'start': 518.058, 'duration': 5.605}, {'end': 526.926, 'text': 'So all those things you want to match up with Spark SQL so you can do that.', 'start': 523.962, 'duration': 2.964}, {'end': 529.228, 'text': 'Similarly, the medical domain, you can use that.', 'start': 527.266, 'duration': 1.962}, {'end': 531.469, 'text': "Let's talk about some Spark SQL features.", 'start': 529.468, 'duration': 2.001}, {'end': 533.491, 'text': 'So there will be some features related to it.', 'start': 531.509, 'duration': 1.982}, {'end': 541.458, 'text': 'Now, you can use what happens when the SQL got combined with the Spark, we started calling it as Spark SQL.', 'start': 534.192, 'duration': 7.266}, {'end': 547.302, 'text': 'Now, when definitely we are talking about SQL, we are talking about either a structured data or a semi-structured data.', 'start': 541.578, 'duration': 5.724}, {'end': 550.145, 'text': 'Now, SQL queries cannot deal with the unstructured data.', 'start': 547.623, 'duration': 2.522}, {'end': 552.667, 'text': 'So that is definitely one of the thing you need to keep in mind.', 'start': 550.165, 'duration': 2.502}, {'end': 556.77, 'text': 'Now your Spark SQL also support various data formats.', 'start': 553.007, 'duration': 3.763}, {'end': 558.391, 'text': 'You can get the data from Parkit.', 'start': 556.97, 'duration': 1.421}, {'end': 566.457, 'text': 'You must have heard about Parkit that it is a columnar-based storage and it is kind of very much a compressed format of the data, what you have,', 'start': 558.731, 'duration': 7.726}, {'end': 567.758, 'text': "but it's not human readable.", 'start': 566.457, 'duration': 1.301}, {'end': 575.424, 'text': 'Similarly, you must have heard about JSON, Avro, where we keep the value as a key value pair, Hive, Cassandra, right? These are no SQL DBs.', 'start': 568.058, 'duration': 7.366}, {'end': 577.846, 'text': 'So you can get all the data from these sources.', 'start': 575.684, 'duration': 2.162}, {'end': 582.329, 'text': 'Now you can also convert your SQL queries to your RDD way.', 'start': 578.166, 'duration': 4.163}, {'end': 585.71, 'text': 'So you will be able to perform all the transformation steps.', 'start': 582.449, 'duration': 3.261}], 'summary': 'Spark sql enables processing structured and semi-structured data, supporting various formats and integration with nosql databases.', 'duration': 67.652, 'max_score': 518.058, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY518058.jpg'}, {'end': 666.633, 'src': 'embed', 'start': 643.042, 'weight': 4, 'content': [{'end': 650.525, 'text': 'If it is not available, you can create a UDF, means user defined function, and you can directly execute that user defined function.', 'start': 643.042, 'duration': 7.483}, {'end': 651.886, 'text': 'and get your desired set.', 'start': 650.825, 'duration': 1.061}, {'end': 659.609, 'text': "So this is one example where we have shown that you can convert let's say, if you don't have an uppercase API present in stock SQL,", 'start': 652.226, 'duration': 7.383}, {'end': 663.311, 'text': 'how you can create a simple UDF for it and can execute it.', 'start': 659.609, 'duration': 3.702}, {'end': 666.633, 'text': "So if you notice here what we are doing, let's say this is my data.", 'start': 663.631, 'duration': 3.002}], 'summary': 'Create udf to execute a user-defined function to get desired data.', 'duration': 23.591, 'max_score': 643.042, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY643042.jpg'}, {'end': 684.928, 'src': 'heatmap', 'start': 643.042, 'weight': 0.816, 'content': [{'end': 650.525, 'text': 'If it is not available, you can create a UDF, means user defined function, and you can directly execute that user defined function.', 'start': 643.042, 'duration': 7.483}, {'end': 651.886, 'text': 'and get your desired set.', 'start': 650.825, 'duration': 1.061}, {'end': 659.609, 'text': "So this is one example where we have shown that you can convert let's say, if you don't have an uppercase API present in stock SQL,", 'start': 652.226, 'duration': 7.383}, {'end': 663.311, 'text': 'how you can create a simple UDF for it and can execute it.', 'start': 659.609, 'duration': 3.702}, {'end': 666.633, 'text': "So if you notice here what we are doing, let's say this is my data.", 'start': 663.631, 'duration': 3.002}, {'end': 670.315, 'text': 'So, if you notice, in this case this is data set.', 'start': 666.933, 'duration': 3.382}, {'end': 671.495, 'text': 'is my data part right?', 'start': 670.315, 'duration': 1.18}, {'end': 673.696, 'text': "So this is I'm generating as a sequence.", 'start': 671.795, 'duration': 1.901}, {'end': 675.777, 'text': "I'm creating it as a data frame.", 'start': 673.696, 'duration': 2.081}, {'end': 677.158, 'text': 'see this 2DF part here.', 'start': 675.777, 'duration': 1.381}, {'end': 684.928, 'text': 'Now after that, we are creating a upper UDF here and notice we are converting any value which is coming to my uppercase.', 'start': 677.658, 'duration': 7.27}], 'summary': 'You can create a udf to execute a user defined function and convert values to uppercase.', 'duration': 41.886, 'max_score': 643.042, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY643042.jpg'}, {'end': 919.26, 'src': 'embed', 'start': 887.605, 'weight': 5, 'content': [{'end': 890.067, 'text': 'So this thing is called as DataFrames.', 'start': 887.605, 'duration': 2.462}, {'end': 892.269, 'text': 'So that is called your DataFrame.', 'start': 891.028, 'duration': 1.241}, {'end': 893.85, 'text': 'So that is what we are going to do.', 'start': 892.609, 'duration': 1.241}, {'end': 896.473, 'text': 'So we are going to convert it to do a DataFrame API.', 'start': 893.87, 'duration': 2.603}, {'end': 903.839, 'text': 'then, using the DataFrame DSLs or by using Spark SQL or SQL, you will be processing the results and giving the output.', 'start': 896.473, 'duration': 7.366}, {'end': 905.661, 'text': 'We will learn about all these things in detail.', 'start': 903.879, 'duration': 1.782}, {'end': 908.409, 'text': "So let's see this Park SQL libraries.", 'start': 906.587, 'duration': 1.822}, {'end': 919.26, 'text': 'Now there are multiple APIs available to us, like we have data source API, we have data frame API, We have interpreter and optimizer and SQL service.', 'start': 909.07, 'duration': 10.19}], 'summary': 'Learning about dataframes, dataframe api, and spark sql for data processing.', 'duration': 31.655, 'max_score': 887.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY887605.jpg'}, {'end': 965.325, 'src': 'embed', 'start': 934.743, 'weight': 3, 'content': [{'end': 939.164, 'text': 'So as you can notice, in Spark SQL, we can fetch the data using multiple sources.', 'start': 934.743, 'duration': 4.421}, {'end': 946.009, 'text': 'You can get it from Hive, PIC, Cassandra, CSV, Apache base, DBase, Oracle DB, so many formats are available.', 'start': 939.484, 'duration': 6.525}, {'end': 951.894, 'text': 'So this API is going to help you to get all the data, to read all the data, store it wherever you want to use it.', 'start': 946.09, 'duration': 5.804}, {'end': 959.1, 'text': 'Now after that, your DataFrame API is going to help you to convert that into a named column and row.', 'start': 952.275, 'duration': 6.825}, {'end': 965.325, 'text': "Remember I just explained you that how you store the data in that because here you're not keeping it like a IDD.", 'start': 959.18, 'duration': 6.145}], 'summary': 'Spark sql fetches data from multiple sources like hive, pic, cassandra, csv, apache base, dbase, oracle db, and enables storing and reading data with dataframe api.', 'duration': 30.582, 'max_score': 934.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY934743.jpg'}], 'start': 484.457, 'title': 'Spark sql for real-time data processing', 'summary': 'Introduces spark sql for real-time data processing, highlighting its application in stock market analysis and banking fraud detection. it covers its features, architecture, performance benefits over hadoop, support for data formats, udfs, and dataframe apis, and sql service.', 'chapters': [{'end': 518.058, 'start': 484.457, 'title': 'Real-time data processing with spark sql', 'summary': 'Highlights the use of spark sql for real-time data processing in various scenarios like stock market analysis and banking fraud detection, emphasizing the real-time processing aspect and the ability to detect fraudulent transactions.', 'duration': 33.601, 'highlights': ['Using Spark SQL for real-time data processing in various scenarios like stock market analysis and banking fraud detection', "Emphasizing the ability to detect fraudulent transactions in real time by using Spark's timing", 'Demonstrating the capability of processing streaming data in real time using Spark SQL']}, {'end': 1245.686, 'start': 518.058, 'title': 'Introduction to spark sql', 'summary': 'Covers the features and architecture of spark sql, including its performance in comparison to hadoop, support for various data formats, the creation of udfs, and the data source and dataframe apis, and sql service, with a focus on the performance benefits of dataframes over rdds.', 'duration': 727.628, 'highlights': ['Spark SQL performance comparison with Hadoop Spark SQL outperforms Hadoop, as demonstrated by the red color graph showing better performance in comparison to the blue color graph.', 'Support for various data formats Spark SQL supports data formats including Parquet, JSON, Avro, Hive, and Cassandra, enabling data retrieval from diverse sources.', 'Creation of User Defined Functions (UDFs) Users can create UDFs to perform custom functions, such as converting data to uppercase, when standard functions are not available.', 'Introduction of DataFrames and their performance benefits DataFrames provide named columns and rows, offering better performance compared to RDDs, particularly in terms of deserialization for faster data retrieval.']}], 'duration': 761.229, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY484457.jpg', 'highlights': ['Using Spark SQL for real-time data processing in various scenarios like stock market analysis and banking fraud detection', 'Demonstrating the capability of processing streaming data in real time using Spark SQL', "Emphasizing the ability to detect fraudulent transactions in real time by using Spark's timing", 'Support for various data formats including Parquet, JSON, Avro, Hive, and Cassandra', 'Creation of User Defined Functions (UDFs) for custom functions', 'Introduction of DataFrames and their performance benefits over RDDs']}, {'end': 1709.344, 'segs': [{'end': 1314.259, 'src': 'embed', 'start': 1279.159, 'weight': 0, 'content': [{'end': 1282.88, 'text': 'Now, in order to add the schema to RDD, what we are going to do.', 'start': 1279.159, 'duration': 3.721}, {'end': 1285.781, 'text': 'so in this case also, you can look at.', 'start': 1282.88, 'duration': 2.901}, {'end': 1289.742, 'text': 'we are importing all the values, like we are importing all the libraries, whatever are required.', 'start': 1285.781, 'duration': 3.961}, {'end': 1296.863, 'text': 'Then, after that, we are using this Spark context text file, reading the data, splitting it with respect to comma,', 'start': 1290.202, 'duration': 6.661}, {'end': 1302.605, 'text': "then mapping the attributes to an employee case that's what we have done and converting this values to integer.", 'start': 1296.863, 'duration': 5.742}, {'end': 1307.695, 'text': 'So in the end, we are converting to 2DF, right? After that, we are going to create a temporary viewer table.', 'start': 1302.932, 'duration': 4.763}, {'end': 1309.956, 'text': "So let's create this temporary viewer employee.", 'start': 1307.995, 'duration': 1.961}, {'end': 1314.259, 'text': 'Then we are going to use Spark.SQL and passing up our SQL query.', 'start': 1310.196, 'duration': 4.063}], 'summary': 'Adding schema to rdd, reading data, mapping attributes, converting to 2df, creating temporary viewer table, and using spark.sql for sql query.', 'duration': 35.1, 'max_score': 1279.159, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1279159.jpg'}, {'end': 1370.279, 'src': 'embed', 'start': 1336.392, 'weight': 1, 'content': [{'end': 1337.593, 'text': 'This is not even supported.', 'start': 1336.392, 'duration': 1.201}, {'end': 1341.256, 'text': 'So you cannot do select a strict from your data frame.', 'start': 1337.994, 'duration': 3.262}, {'end': 1346.359, 'text': 'So instead of that, what we need to do is we need to create a temporary table or a temporary view.', 'start': 1341.536, 'duration': 4.823}, {'end': 1349.622, 'text': 'So you can notice here we are using this create or replace temp view.', 'start': 1346.659, 'duration': 2.963}, {'end': 1352.944, 'text': 'Why replace? Because if it is already existing overwrite on top of it.', 'start': 1349.662, 'duration': 3.282}, {'end': 1358.209, 'text': 'So now we are creating a temporary table which will be exactly similar to my test data frame.', 'start': 1353.004, 'duration': 5.205}, {'end': 1363.573, 'text': 'Now you can just directly execute all the query on your temporary view or temporary table.', 'start': 1358.729, 'duration': 4.844}, {'end': 1370.279, 'text': "So you can notice here instead of using employee DF which was our data frame, I'm using here temporary view.", 'start': 1363.593, 'duration': 6.686}], 'summary': 'Data frame queries executed on temporary view for support.', 'duration': 33.887, 'max_score': 1336.392, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1336392.jpg'}, {'end': 1439.132, 'src': 'embed', 'start': 1409.694, 'weight': 2, 'content': [{'end': 1415.864, 'text': 'Okay, so this is what we are going to do because remember in the employee case class, we have the name and the age column that we want to map now.', 'start': 1409.694, 'duration': 6.17}, {'end': 1419.029, 'text': 'Now in this case, we are mapping the names to the ages.', 'start': 1416.084, 'duration': 2.945}, {'end': 1424.718, 'text': 'So you can notice that we are doing for ages of our youngest Df data frame that what we have created earlier.', 'start': 1419.309, 'duration': 5.409}, {'end': 1426.58, 'text': 'and the result is an array.', 'start': 1425.138, 'duration': 1.442}, {'end': 1431.545, 'text': "So the result what you're going to get will be an array with the name mapped to your respective ages.", 'start': 1426.72, 'duration': 4.825}, {'end': 1432.706, 'text': 'You can see this output here.', 'start': 1431.585, 'duration': 1.121}, {'end': 1439.132, 'text': 'So you can see that this is getting mapped, right? So we are getting seeing this output like name is John, age is 28.', 'start': 1433.026, 'duration': 6.106}], 'summary': 'Mapping names to ages in employee case class, resulting in an array output.', 'duration': 29.438, 'max_score': 1409.694, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1409694.jpg'}, {'end': 1683.476, 'src': 'embed', 'start': 1651.442, 'weight': 4, 'content': [{'end': 1653.682, 'text': "Now, let's talk about JSON data.", 'start': 1651.442, 'duration': 2.24}, {'end': 1659.104, 'text': "Now, when we talk about JSON data, let's talk about how we can load our files and work on this.", 'start': 1654.142, 'duration': 4.962}, {'end': 1662.805, 'text': "So in this case, we will be first, let's say, importing our libraries.", 'start': 1659.484, 'duration': 3.321}, {'end': 1669.868, 'text': 'Once we are done with that, Now after that, we can just say that read.json, we are just bringing up our employee.json here.', 'start': 1663.185, 'duration': 6.683}, {'end': 1671.869, 'text': 'See, this is the execution of this part.', 'start': 1670.148, 'duration': 1.721}, {'end': 1676.792, 'text': 'Now similarly, we can also write back in the parquet or we can also read the value from parquet.', 'start': 1672.11, 'duration': 4.682}, {'end': 1683.476, 'text': "You can notice this, if you want to write, let's say, this value, employeeDFDataFrame to my parquet.", 'start': 1677.132, 'duration': 6.344}], 'summary': 'Discussion on loading json data and working with file formats in python using libraries and examples.', 'duration': 32.034, 'max_score': 1651.442, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1651442.jpg'}], 'start': 1246.007, 'title': 'Adding and working with spark schema', 'summary': 'Covers adding schema to rdd using spark sql, creating a temporary view for executing sql queries, mapping names to ages in a data frame, and working with json data, including loading and writing to parquet format, demonstrated with code examples and explanations.', 'chapters': [{'end': 1443.517, 'start': 1246.007, 'title': 'Adding schema to rdd and spark sql', 'summary': 'Covers adding schema to rdd using spark sql, creating a temporary view for executing sql queries, and mapping names to ages in a data frame, demonstrating the process with code examples and explanations.', 'duration': 197.51, 'highlights': ['Creating a temporary view for executing SQL queries by using create or replace temp view to overcome the limitation of executing SQL queries directly on a data frame.', 'Mapping names to ages in a data frame using map encoder from the implicit class, resulting in an array with the name mapped to respective ages.', 'Adding schema to RDD using Spark SQL, reading and splitting the data, mapping the attributes to an employee case, and converting values to integers before converting to a data frame.']}, {'end': 1709.344, 'start': 1443.937, 'title': 'Working with spark schema and json data', 'summary': 'Covers the process of importing and mapping rdd schema, creating a data frame, and executing sql queries using spark, as well as working with json data, including loading and writing to parquet format.', 'duration': 265.407, 'highlights': ['The chapter covers the process of importing and mapping RDD schema, creating a data frame, and executing SQL queries using Spark Explains the process of importing type and row classes, creating RDD, defining schema, mapping values, and creating a data frame for executing SQL queries.', 'Working with JSON data, including loading and writing to parquet format Describes the process of importing libraries, reading JSON data, and writing to parquet format for non-human-readable data.']}], 'duration': 463.337, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1246007.jpg', 'highlights': ['Adding schema to RDD using Spark SQL, reading and splitting the data, mapping the attributes to an employee case, and converting values to integers before converting to a data frame.', 'Creating a temporary view for executing SQL queries by using create or replace temp view to overcome the limitation of executing SQL queries directly on a data frame.', 'Mapping names to ages in a data frame using map encoder from the implicit class, resulting in an array with the name mapped to respective ages.', 'The chapter covers the process of importing and mapping RDD schema, creating a data frame, and executing SQL queries using Spark.', 'Working with JSON data, including loading and writing to parquet format Describes the process of importing libraries, reading JSON data, and writing to parquet format for non-human-readable data.']}, {'end': 2731.81, 'segs': [{'end': 1759.143, 'src': 'embed', 'start': 1730.366, 'weight': 4, 'content': [{'end': 1732.429, 'text': 'This is how we can execute all these things up.', 'start': 1730.366, 'duration': 2.063}, {'end': 1736.554, 'text': "Now once we have done all this, let's see how we can create our data frames.", 'start': 1732.729, 'duration': 3.825}, {'end': 1739.017, 'text': "So let's create this file path.", 'start': 1736.974, 'duration': 2.043}, {'end': 1741.8, 'text': "So let's say we have created this file, employed.json.", 'start': 1739.057, 'duration': 2.743}, {'end': 1745.325, 'text': 'After that, we can create a data frame from our JSON path.', 'start': 1742.381, 'duration': 2.944}, {'end': 1748.068, 'text': 'So we are creating this by using read.json.', 'start': 1745.845, 'duration': 2.223}, {'end': 1749.85, 'text': 'then we can print the schema.', 'start': 1748.348, 'duration': 1.502}, {'end': 1750.551, 'text': 'What does it do?', 'start': 1750.01, 'duration': 0.541}, {'end': 1755.278, 'text': 'This is going to print the schema of my employee data frame.', 'start': 1750.892, 'duration': 4.386}, {'end': 1759.143, 'text': 'okay?. So we are going to use this print schema to print up all the values.', 'start': 1755.278, 'duration': 3.865}], 'summary': 'Demonstrates creating data frames from json file and printing schema.', 'duration': 28.777, 'max_score': 1730.366, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1730366.jpg'}, {'end': 1795.326, 'src': 'embed', 'start': 1770.567, 'weight': 6, 'content': [{'end': 1775.99, 'text': "So let's say we are executing our SQL query from employee where age is between 18 and 30.", 'start': 1770.567, 'duration': 5.423}, {'end': 1778.692, 'text': "So this kind of SQL query, let's say we want to do, we can get that.", 'start': 1775.99, 'duration': 2.702}, {'end': 1780.614, 'text': 'And then we can see the output also.', 'start': 1778.992, 'duration': 1.622}, {'end': 1781.634, 'text': "Let's see this execution.", 'start': 1780.654, 'duration': 0.98}, {'end': 1788.179, 'text': "So you can see that all the employees whose age are let's say between 18 and 30, that is showing up in the output.", 'start': 1782.015, 'duration': 6.164}, {'end': 1790.882, 'text': "Now let's see this RDD operation way.", 'start': 1788.599, 'duration': 2.283}, {'end': 1795.326, 'text': 'Now what you can do, so we are going to create this RDD, other employee RDD.', 'start': 1791.122, 'duration': 4.204}], 'summary': 'Executing sql query for employees aged 18-30 and creating employee rdd.', 'duration': 24.759, 'max_score': 1770.567, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1770567.jpg'}, {'end': 1953.023, 'src': 'embed', 'start': 1925.125, 'weight': 5, 'content': [{'end': 1929.748, 'text': 'So we can say select count from SRC to select the number of keys in the SRC tables.', 'start': 1925.125, 'duration': 4.623}, {'end': 1931.55, 'text': 'Now select all the records.', 'start': 1930.089, 'duration': 1.461}, {'end': 1934.371, 'text': 'So we can say that key select key comma value.', 'start': 1931.97, 'duration': 2.401}, {'end': 1938.154, 'text': 'So you can see that we can perform all of our hive operations here on this.', 'start': 1934.592, 'duration': 3.562}, {'end': 1941.856, 'text': 'Similarly, we can create our data set string DS from Spark DF.', 'start': 1938.794, 'duration': 3.062}, {'end': 1945.439, 'text': 'So you can see this also by using SQL DF what we already have.', 'start': 1941.876, 'duration': 3.563}, {'end': 1949.962, 'text': 'We can just say map and then provide the case class and can map this key comma value pair.', 'start': 1945.499, 'duration': 4.463}, {'end': 1953.023, 'text': 'and then in the end we can show up all this value.', 'start': 1950.522, 'duration': 2.501}], 'summary': 'Perform hive operations, create dataset from spark df, and use sql df.', 'duration': 27.898, 'max_score': 1925.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1925125.jpg'}, {'end': 2054.473, 'src': 'embed', 'start': 2019.476, 'weight': 1, 'content': [{'end': 2025.822, 'text': "So now let's say a company has collected a lot of data for different 10 companies and they want to do some computation.", 'start': 2019.476, 'duration': 6.346}, {'end': 2028.484, 'text': "Let's say they want to compute the average closing price.", 'start': 2025.962, 'duration': 2.522}, {'end': 2031.687, 'text': 'They want to list the companies with the highest closing prices.', 'start': 2028.784, 'duration': 2.903}, {'end': 2035.05, 'text': 'They want to compute the average closing price per month.', 'start': 2032.147, 'duration': 2.903}, {'end': 2041.037, 'text': 'They want to list the number of big price rises and fall and compute some statistical correlation.', 'start': 2035.47, 'duration': 5.567}, {'end': 2044.621, 'text': 'So these things we are going to do with the help of our Spark SQL statement.', 'start': 2041.317, 'duration': 3.304}, {'end': 2045.843, 'text': 'So this is our requirement.', 'start': 2044.862, 'duration': 0.981}, {'end': 2047.705, 'text': 'We want to process the huge data.', 'start': 2046.143, 'duration': 1.562}, {'end': 2050.688, 'text': 'We want to handle the input from the multiple sources.', 'start': 2047.985, 'duration': 2.703}, {'end': 2054.473, 'text': 'We want to process the data in real time and it should be easy to use.', 'start': 2050.929, 'duration': 3.544}], 'summary': 'Using spark sql to process data for 10 companies, computing average closing price, identifying companies with highest closing prices, and analyzing statistical correlation.', 'duration': 34.997, 'max_score': 2019.476, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2019476.jpg'}, {'end': 2120.885, 'src': 'embed', 'start': 2093.132, 'weight': 0, 'content': [{'end': 2097.474, 'text': "Now, so let's see how we can implement a stock analysis using Spark SQL.", 'start': 2093.132, 'duration': 4.342}, {'end': 2101.635, 'text': 'So what we have to do for that, so this is how my data flow diagram will sound like.', 'start': 2097.874, 'duration': 3.761}, {'end': 2108.017, 'text': 'So we are going to initially have the huge amount of real-time stock data that we are going to process it with Spark SQL,', 'start': 2101.975, 'duration': 6.042}, {'end': 2110.298, 'text': 'so going to convert it into a named column way.', 'start': 2108.017, 'duration': 2.281}, {'end': 2113.96, 'text': "Then we are going to create an RDD for functional programming, so let's do that.", 'start': 2110.538, 'duration': 3.422}, {'end': 2120.885, 'text': 'Then we are going to use our Spark SQL which will calculate the average closing price per year, calculating the company with highest closing per year.', 'start': 2114.321, 'duration': 6.564}], 'summary': 'Implement stock analysis using spark sql to process real-time stock data, calculate average closing price per year, and identify company with highest closing per year.', 'duration': 27.753, 'max_score': 2093.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2093132.jpg'}, {'end': 2655.046, 'src': 'embed', 'start': 2625.794, 'weight': 3, 'content': [{'end': 2627.776, 'text': 'That how I will be getting all this VM and all.', 'start': 2625.794, 'duration': 1.982}, {'end': 2632.642, 'text': 'So once you enroll for the courses and all, so you will be getting all this VM from the Edureka site.', 'start': 2628.117, 'duration': 4.525}, {'end': 2636.304, 'text': "So even if I'm working on Mac operating system, my VM will work.", 'start': 2632.942, 'duration': 3.362}, {'end': 2638.865, 'text': 'Yes, every operating system, it will be supported.', 'start': 2636.384, 'duration': 2.481}, {'end': 2644.028, 'text': 'So no trouble, you can just use any sort of VM and all means any operating system to do that.', 'start': 2639.266, 'duration': 4.762}, {'end': 2649.491, 'text': "So what Erudite could do is they just don't want you to be troubled in any sort of stuff here.", 'start': 2644.368, 'duration': 5.123}, {'end': 2655.046, 'text': 'So what they do is they kind of ensure that whatever is required for the practicals, they take care of it.', 'start': 2649.861, 'duration': 5.185}], 'summary': 'Edureka provides vm for courses, supporting all operating systems, ensuring no trouble for practicals.', 'duration': 29.252, 'max_score': 2625.794, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2625794.jpg'}], 'start': 1709.664, 'title': 'Apache spark data analysis', 'summary': "Covers spark data manipulation, stock market analysis with spark sql, and apache spark data analysis, including creating data frames, executing sql queries, processing data for 10 companies, computing average closing prices, and support through edureka's vm and technical team.", 'chapters': [{'end': 1981.236, 'start': 1709.664, 'title': 'Spark data manipulation and analysis', 'summary': 'Explains how to read and manipulate data using spark, including creating data frames and executing sql queries, as well as performing rdd operations, hive table operations, and creating data sets, with examples and outputs provided.', 'duration': 271.572, 'highlights': ['Executing SQL queries on data frames The chapter demonstrates executing SQL queries on data frames, including filtering employees based on age, with an example of retrieving employees aged between 18 and 30 and viewing the output.', 'Creating data frames from JSON files The process of creating a data frame from a JSON file is explained, including printing the schema and creating a temporary view for the data frame to enable SQL queries.', 'Performing hive table operations The chapter illustrates performing hive table operations in Spark SQL, including creating a table, loading data into the table, and performing SQL operations such as counting records and selecting specific columns.']}, {'end': 2374.063, 'start': 1981.256, 'title': 'Stock market analysis with spark sql', 'summary': 'Discusses analyzing stock market data using spark sql, including processing data for 10 companies, computing average closing prices, identifying companies with highest closing prices, and performing statistical correlation, all while handling huge data and input from multiple sources.', 'duration': 392.807, 'highlights': ['Analyzing stock market data using Spark SQL The chapter covers the use of Spark SQL to analyze stock market data, including processing data for 10 companies, computing average closing prices, identifying companies with highest closing prices, and performing statistical correlation.', 'Processing data for 10 companies The use case involves processing data for 10 different companies to analyze stock market activities, such as computing average closing prices and identifying companies with the highest closing prices.', 'Computing average closing prices and statistical correlation The requirement includes computing the average closing price, listing companies with the highest closing prices, and performing statistical correlation using Spark SQL.', 'Handling huge data and input from multiple sources Spark SQL is utilized to handle huge data and input from multiple sources for stock market analysis, ensuring ease of use and real-time data processing.']}, {'end': 2731.81, 'start': 2374.083, 'title': 'Apache spark data analysis', 'summary': "Covers the creation of new tables, execution of sql queries to find average closing prices, transformation of data, computation of best performing companies, correlation analysis between securities, and the provision of support and assistance through edureka's vm and technical team.", 'duration': 357.727, 'highlights': ['Creation of new tables, execution of SQL queries to find average closing prices, and transformation of data. The process involves creating new tables, executing SQL queries to find average closing prices of specific companies, and transforming the data for further analysis.', 'Computation of best performing companies and correlation analysis between securities. The chapter involves the computation of best performing companies based on average closing prices and correlation analysis between securities to measure the degree of their movement relation.', "Provision of support and assistance through Edureka's VM and technical team. Edureka provides a VM for practicals and 24/7 technical support to assist learners with any project-related issues."]}], 'duration': 1022.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY1709664.jpg', 'highlights': ['Analyzing stock market data using Spark SQL The chapter covers the use of Spark SQL to analyze stock market data, including processing data for 10 companies, computing average closing prices, identifying companies with highest closing prices, and performing statistical correlation.', 'Handling huge data and input from multiple sources Spark SQL is utilized to handle huge data and input from multiple sources for stock market analysis, ensuring ease of use and real-time data processing.', 'Computation of best performing companies and correlation analysis between securities. The chapter involves the computation of best performing companies based on average closing prices and correlation analysis between securities to measure the degree of their movement relation.', "Provision of support and assistance through Edureka's VM and technical team. Edureka provides a VM for practicals and 24/7 technical support to assist learners with any project-related issues.", 'Creating data frames from JSON files The process of creating a data frame from a JSON file is explained, including printing the schema and creating a temporary view for the data frame to enable SQL queries.', 'Performing hive table operations The chapter illustrates performing hive table operations in Spark SQL, including creating a table, loading data into the table, and performing SQL operations such as counting records and selecting specific columns.', 'Executing SQL queries on data frames The chapter demonstrates executing SQL queries on data frames, including filtering employees based on age, with an example of retrieving employees aged between 18 and 30 and viewing the output.', 'Processing data for 10 companies The use case involves processing data for 10 different companies to analyze stock market activities, such as computing average closing prices and identifying companies with the highest closing prices.']}, {'end': 3236.777, 'segs': [{'end': 2787.173, 'src': 'embed', 'start': 2732.33, 'weight': 3, 'content': [{'end': 2737.071, 'text': 'Now what we want to do, if you notice, this is the same code which I have just shown you earlier also.', 'start': 2732.33, 'duration': 4.741}, {'end': 2739.472, 'text': 'Now let us just execute this code.', 'start': 2737.531, 'duration': 1.941}, {'end': 2743.792, 'text': 'So in order to execute this, what we can do, we can connect to my Spark Shell.', 'start': 2739.752, 'duration': 4.04}, {'end': 2745.713, 'text': "So let's get connected to Spark Shell.", 'start': 2744.152, 'duration': 1.561}, {'end': 2750.994, 'text': "So once we'll be connected to Spark Shell, we will go step by step.", 'start': 2747.673, 'duration': 3.321}, {'end': 2752.974, 'text': 'So first we will be importing our package.', 'start': 2751.014, 'duration': 1.96}, {'end': 2760.133, 'text': 'This takes some time, let it just get connected.', 'start': 2757.412, 'duration': 2.721}, {'end': 2768.356, 'text': "Once this is connected, now you can notice that I'm just importing all the important libraries.", 'start': 2762.394, 'duration': 5.962}, {'end': 2769.656, 'text': 'We have already learned about that.', 'start': 2768.396, 'duration': 1.26}, {'end': 2774.858, 'text': 'After that, you will be initializing your Spark session.', 'start': 2771.877, 'duration': 2.981}, {'end': 2775.678, 'text': "So let's do that.", 'start': 2774.958, 'duration': 0.72}, {'end': 2778.139, 'text': 'Again, the same steps what we have done before.', 'start': 2776.058, 'duration': 2.081}, {'end': 2787.173, 'text': 'Once we will be done, we will be creating a stock class.', 'start': 2784.651, 'duration': 2.522}], 'summary': 'Demonstration of executing code in spark shell, importing libraries, and initializing spark session', 'duration': 54.843, 'max_score': 2732.33, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2732330.jpg'}, {'end': 2913.396, 'src': 'embed', 'start': 2841.803, 'weight': 1, 'content': [{'end': 2844.744, 'text': 'So this is just giving you the similar way all the outputs will be shown up here.', 'start': 2841.803, 'duration': 2.941}, {'end': 2850.366, 'text': 'Company four, company five, all these companies you can see this in execution.', 'start': 2845.804, 'duration': 4.562}, {'end': 2858.969, 'text': 'After that we will be creating our temporary view so that we can execute our SQL queries.', 'start': 2854.047, 'duration': 4.922}, {'end': 2863.962, 'text': "So let's do it for company 10 also.", 'start': 2862.421, 'duration': 1.541}, {'end': 2868.406, 'text': 'Then after that, we can just create all our temporary table for it.', 'start': 2864.423, 'duration': 3.983}, {'end': 2871.388, 'text': 'Once we are done, now we can do our queries.', 'start': 2868.946, 'duration': 2.442}, {'end': 2875.291, 'text': "Let's say we can display the average of adjusting closing price for each one.", 'start': 2871.448, 'duration': 3.843}, {'end': 2876.772, 'text': 'So we can hit this query.', 'start': 2875.671, 'duration': 1.101}, {'end': 2882.857, 'text': 'So all these queries will happen on your temporary view.', 'start': 2880.635, 'duration': 2.222}, {'end': 2887.26, 'text': 'Because we cannot anyway do all these queries on our data frames or not.', 'start': 2883.557, 'duration': 3.703}, {'end': 2889.422, 'text': 'So you can see this, this is getting executed.', 'start': 2887.28, 'duration': 2.142}, {'end': 2892.583, 'text': 'showing it output also.', 'start': 2891.502, 'duration': 1.081}, {'end': 2896.585, 'text': "Now because we have done dot show, that's the reason you're getting this output.", 'start': 2893.443, 'duration': 3.142}, {'end': 2905.311, 'text': "Similarly if we want to let's say list the closing price for MSFT which went up more than two dollars, right? So that query also we can execute now.", 'start': 2896.926, 'duration': 8.385}, {'end': 2907.873, 'text': 'We have already understood this query in detail.', 'start': 2906.031, 'duration': 1.842}, {'end': 2913.396, 'text': "We're just seeing this execution part now so that you can appreciate whatever you have learned.", 'start': 2909.093, 'duration': 4.303}], 'summary': 'Executing queries for various companies using temporary views and displaying average adjusting closing prices.', 'duration': 71.593, 'max_score': 2841.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2841803.jpg'}, {'end': 3013.245, 'src': 'embed', 'start': 2956.264, 'weight': 0, 'content': [{'end': 2960.727, 'text': 'Now once we are done with this step also, then what? So we have done it till step six.', 'start': 2956.264, 'duration': 4.463}, {'end': 2968.031, 'text': "Now we want to perform let's say our transformation on new table corresponding to the three companies so that we can compare.", 'start': 2961.387, 'duration': 6.644}, {'end': 2973.254, 'text': 'We want to create the best company containing the best average closing price for all these three companies.', 'start': 2968.591, 'duration': 4.663}, {'end': 2976.956, 'text': 'We want to find the companies with the best closing price average per year.', 'start': 2973.714, 'duration': 3.242}, {'end': 2978.597, 'text': "So let's do all that as well.", 'start': 2977.196, 'duration': 1.401}, {'end': 2990.308, 'text': 'So you can see best company of the year, right? Now here also the same stuff we are doing.', 'start': 2984.744, 'duration': 5.564}, {'end': 2991.849, 'text': 'So we are registering our temp table.', 'start': 2990.348, 'duration': 1.501}, {'end': 3001.456, 'text': "Okay, so there's a mistake here.", 'start': 2999.895, 'duration': 1.561}, {'end': 3007.581, 'text': "So if you notice here, it is one, but here we are doing a show of all, right? So there's a mistake here.", 'start': 3001.536, 'duration': 6.045}, {'end': 3008.842, 'text': "I'm just correcting it.", 'start': 3008.021, 'duration': 0.821}, {'end': 3013.245, 'text': 'So here also it should be one.', 'start': 3010.962, 'duration': 2.283}], 'summary': 'Performing transformation on new table for 3 companies to find best average closing price.', 'duration': 56.981, 'max_score': 2956.264, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2956264.jpg'}, {'end': 3156.708, 'src': 'embed', 'start': 3129.393, 'weight': 6, 'content': [{'end': 3132.655, 'text': "It means definitely they're impacting each other's stock price.", 'start': 3129.393, 'duration': 3.262}, {'end': 3135.476, 'text': 'So this is all about the project part.', 'start': 3133.395, 'duration': 2.081}, {'end': 3141.441, 'text': 'Now I hope all of you have enjoyed this session of Spark SQL.', 'start': 3136.638, 'duration': 4.803}, {'end': 3145.242, 'text': 'So this is all which we wanted to go through.', 'start': 3142.041, 'duration': 3.201}, {'end': 3150.425, 'text': 'Again as I said, this was just a high level overview of Spark SQL.', 'start': 3145.443, 'duration': 4.982}, {'end': 3156.708, 'text': 'Once you learn all these things from Edureka, when you kind of get enrolled from in detail,', 'start': 3151.025, 'duration': 5.683}], 'summary': 'Spark sql impacts stock prices. high-level overview from edureka.', 'duration': 27.315, 'max_score': 3129.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY3129393.jpg'}, {'end': 3209.715, 'src': 'embed', 'start': 3179.581, 'weight': 9, 'content': [{'end': 3180.722, 'text': 'Any questions from anyone?', 'start': 3179.581, 'duration': 1.141}, {'end': 3185.843, 'text': 'Great, so how I can enroll with Edureka this course?', 'start': 3182.742, 'duration': 3.101}, {'end': 3188.585, 'text': 'Yes, Mohit, so what you need to do in this case?', 'start': 3186.604, 'duration': 1.981}, {'end': 3190.886, 'text': 'you just need to kind of talk with the support team.', 'start': 3188.585, 'duration': 2.301}, {'end': 3192.506, 'text': 'They will be telling you all the classes.', 'start': 3190.906, 'duration': 1.6}, {'end': 3200.51, 'text': 'From the Edureka courses site also, you can get to know all these things that where exactly or that will be mentioned, about the batches,', 'start': 3192.967, 'duration': 7.543}, {'end': 3201.11, 'text': 'timings and all.', 'start': 3200.51, 'duration': 0.6}, {'end': 3207.774, 'text': 'So regularly the batches happens and you just need to go ahead and kind of enroll for it or you can take the help of support teams.', 'start': 3201.13, 'duration': 6.644}, {'end': 3209.715, 'text': 'The contact number will be mentioned in the page.', 'start': 3207.814, 'duration': 1.901}], 'summary': 'Enroll in edureka courses by contacting support team or checking website for class info and batch timings.', 'duration': 30.134, 'max_score': 3179.581, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY3179581.jpg'}], 'start': 2732.33, 'title': 'Spark sql and stock class creation', 'summary': 'Covers executing spark code, creating a stock class, performing operations and sql queries for 10 companies, creating temporary views, and manipulating data. it also includes creating a new table for three companies, finding the best company with the highest average closing price, creating correlation between stock prices, and information on enrolling for a detailed course on spark sql.', 'chapters': [{'end': 2787.173, 'start': 2732.33, 'title': 'Executing spark code and creating stock class', 'summary': 'Demonstrates executing spark code and creating a stock class by connecting to spark shell, importing libraries, initializing spark session, and creating a stock class.', 'duration': 54.843, 'highlights': ['Connecting to Spark Shell and importing important libraries', 'Initializing Spark session and creating a stock class', 'Step by step execution for connecting to Spark Shell and importing libraries']}, {'end': 2955.744, 'start': 2792.998, 'title': 'Executing spark operations and queries', 'summary': 'Demonstrates executing spark operations and sql queries for 10 companies, creating temporary views, and performing various data manipulation tasks, facilitating better understanding of the concepts.', 'duration': 162.746, 'highlights': ['The chapter demonstrates executing Spark operations and SQL queries for 10 companies, facilitating better understanding of the concepts.', 'The process involves creating temporary views and performing various data manipulation tasks, showcasing practical application of learned concepts.', 'The queries include displaying the average of adjusting closing price for each company, listing the closing price for MSFT which went up more than two dollars, and joining and saving data in different formats.']}, {'end': 3236.777, 'start': 2956.264, 'title': 'Spark sql project overview', 'summary': 'Covers the process of creating a new table for three companies, finding the best company with the highest average closing price, correcting mistakes in the code, creating correlation between stock prices, and concluding with information on enrolling for a detailed course on spark sql.', 'duration': 280.513, 'highlights': ['Creating a new table for three companies to compare their average closing prices to find the best company. Three companies', 'Identifying and correcting mistakes in the code related to displaying data and ensuring consistency. Mistakes in the code', 'Explanation of the process of creating correlation between stock prices to assess their impact on each other. Correlation between stock prices', 'Information on enrolling for a detailed course on Spark SQL from Edureka. Enrollment process']}], 'duration': 504.447, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Mxw6QZk1CMY/pics/Mxw6QZk1CMY2732330.jpg', 'highlights': ['Creating a new table for three companies to compare their average closing prices to find the best company. Three companies', 'The chapter demonstrates executing Spark operations and SQL queries for 10 companies, facilitating better understanding of the concepts.', 'The process involves creating temporary views and performing various data manipulation tasks, showcasing practical application of learned concepts.', 'Connecting to Spark Shell and importing important libraries', 'Initializing Spark session and creating a stock class', 'The queries include displaying the average of adjusting closing price for each company, listing the closing price for MSFT which went up more than two dollars, and joining and saving data in different formats.', 'Explanation of the process of creating correlation between stock prices to assess their impact on each other. Correlation between stock prices', 'Identifying and correcting mistakes in the code related to displaying data and ensuring consistency. Mistakes in the code', 'Step by step execution for connecting to Spark Shell and importing libraries', 'Information on enrolling for a detailed course on Spark SQL from Edureka. Enrollment process']}], 'highlights': ['Spark SQL offers real-time processing capability, unlike Hive which is focused on batch processing.', 'Using Spark SQL for real-time data processing in various scenarios like stock market analysis and banking fraud detection', 'The chapter concludes with a discussion on the relevance of Spark SQL in the market, particularly in the context of stock market analysis, providing real-world use cases to illustrate its significance.', 'The session covers the key features and libraries present in Spark SQL, preparing the audience for a comprehensive understanding of the topic.', 'Analyzing stock market data using Spark SQL The chapter covers the use of Spark SQL to analyze stock market data, including processing data for 10 companies, computing average closing prices, identifying companies with highest closing prices, and performing statistical correlation.', 'Adding schema to RDD using Spark SQL, reading and splitting the data, mapping the attributes to an employee case, and converting values to integers before converting to a data frame.', 'Creating a new table for three companies to compare their average closing prices to find the best company. Three companies', 'The importance of Spark SQL is emphasized, highlighting its relevance in the market and the need to understand its significance.', 'The process involves creating temporary views and performing various data manipulation tasks, showcasing practical application of learned concepts.', 'Hands-on examples are included in the session, providing practical exposure to working with Spark SQL, enhancing the learning experience.']}