title
Big Data & Hadoop Full Course | Hadoop Training | Big Data Tutorial | Intellipaat

description
🔥Intellipaat Big Data & Hadoop Full Course Training: https://intellipaat.com/big-data-hadoop-training/ In this hadoop training, you will learn big data & hadoop full course right from beginning to all the advanced concepts with hands on demo. There is a project as well at the end so that you can master this technology. This hadoop training will help you learn hadoop in 12 hours completely. #bigdatahadoop #hadooptraining #bigdatahadoopfullcourse #learnhadoop #learnbigdata #bigdatatutorialforbeginners #bigdatatutorial #intellipaat 🔵Following topics are covered in this video: 00:00 - Big Data & Hadoop Full Course. 11:32 - Skill Needed: Hadoop. 23:47 - Big Data Tools: Salaries. 03:35:16 - What is Pig? 05:28:13 - Introduction to Hive. 07:26:43 - Spark Components. 07:28:22 - Spark SQL. 09:23:15 - Architecture of Kafka Cluster. 09:47:56 - Kafka Workflow. 📌 Do subscribe to Intellipaat channel & get regular updates on videos: http://bit.ly/Intellipaat 📕 Read complete Big Data Hadoop tutorial here: https://intellipaat.com/blog/tutorial/hadoop-tutorial/ 🔗 Watch Big Data Hadoop video tutorials here: https://goo.gl/9ZjpBh ⭐ Get Hive cheat sheet here: https://intellipaat.com/blog/tutorial/hadoop-tutorial/hive-cheat-sheet/ ⭐Get Pig basic cheat sheet here: https://intellipaat.com/blog/tutorial/hadoop-tutorial/pig-basics-cheat-sheet/ ⭐Get Pig built in cheat sheet here: https://intellipaat.com/blog/tutorial/hadoop-tutorial/pig-built-functions-cheat-sheet/ 📰Interested to learn big data hadoop still more? Please check similar hadoop blogs here: https://intellipaat.com/blog/what-is-hadoop/ Are you looking for something more? Enroll in our big data hadoop training and become a certified big data hadoop professional (https://intellipaat.com/big-data-hadoop-training/). It is a 60 hrs instructor led Intellipaat hadoop training which is completely aligned with industry standards and certification bodies. If you’ve enjoyed this big data tutorial for beginners, like us and subscribe to our channel for more similar hadoop videos and free tutorials. big data training? Ask us in the comment section below. --------------------------- Intellipaat Edge 1. 24*7 Life time Access & Support 2. Flexible Class Schedule 3. Job Assistance 4. Mentors with +14 yrs 5. Industry Oriented Course ware 6. Life time free Course Upgrade ------------------------------ Why should you watch this Big Data & Hadoop Full course video? You can learn big data hadoop much faster than any other technology and this big data hadoop tutorial for beginners helps you do just that. This Intellipaat big data hadoop tutorial for beginners will familiarize you with the various big data hadoop concepts. Big data hadoop is one of the best technological advances that is finding increased applications for big data and in a lot of industry domains. Our big data hadoop training course has been created with extensive inputs from the industry experts so that you can learn big data hadoop and apply it for real world scenarios. Who should watch this Hadoop Tutorial for Beginners video? If you want to learn what is hadoop & hadoop introduction to become a big data hadoop expert then this Intellipaat big data hadoop tutorial for beginners video is for you. The Intellipaat big data hadoop tutorial for beginners video is your first step to learn big data hadoop. We are covering the most important Cloudera Spark and hadoop developer certification (CCA175) exam with hands on projects as well. Since this hadoop tutorial video can be taken by anybody, so if you are a beginner in technology then you can enroll for big data training to take your skills to the next level. Why Big Data Hadoop is important? Data is being generated hugely in each and every industry domain and to process and distribute effectively hadoop is being deployed everywhere and in every industry. Taking the Intellipaat big data hadoop training can help professionals to build a solid career in a rising technology domain and get the best jobs in top organizations. Why should you opt for a Big Data Hadoop career? If you want to fast-track your career then you should strongly consider big data hadoop. The reason for this is that it is one of the fastest growing technology. There is a huge demand for professionals in big data hadoop. The salaries for big data hadoop professionals is fantastic.There is a huge growth opportunity in this domain as well. Hence this Intellipaat hadoop training is your stepping stone to a successful career! ------------------------------ For more Information: Please write us to sales@intellipaat.com, or call us at: +91- 7847955955 Website: https://intellipaat.com/big-data-hadoop-training/ Facebook: https://www.facebook.com/intellipaatonline LinkedIn: https://www.linkedin.com/in/intellipaat/ Twitter: https://twitter.com/Intellipaat Telegram: https://t.me/s/Learn_with_Intellipaat Instagram: https://www.instagram.com/intellipaat

detail
{'title': 'Big Data & Hadoop Full Course | Hadoop Training | Big Data Tutorial | Intellipaat', 'heatmap': [{'end': 2574.949, 'start': 1281.116, 'weight': 0.921}, {'end': 5134.719, 'start': 2991.559, 'weight': 0.852}, {'end': 26097.843, 'start': 25665.387, 'weight': 0.706}, {'end': 42774.685, 'start': 42341.2, 'weight': 0.911}], 'summary': 'Covers big data technologies such as hadoop, challenges of handling massive data, essential hadoop developer skills, job opportunities, mapreduce, hadoop basics, hadoop ecosystem, pig, hive, oltp, rdbms, data warehousing, cloudera, hive partitioning, spark, spark sql, hive integration, kafka, zookeeper, real-time data analytics, aws setup, and multi-node cluster configuration, providing comprehensive insights and practical use cases.', 'chapters': [{'end': 660.153, 'segs': [{'end': 184.168, 'src': 'embed', 'start': 114.331, 'weight': 0, 'content': [{'end': 120.315, 'text': 'So data is this raw, unruly entity when we talk about it in the world of big data?', 'start': 114.331, 'duration': 5.984}, {'end': 121.796, 'text': "And you're going to find out why.", 'start': 120.515, 'duration': 1.281}, {'end': 127.901, 'text': 'But then coming to the examples of data around us, think of a lot of types of data that are existent basically.', 'start': 121.936, 'duration': 5.965}, {'end': 130.241, 'text': 'In fact, age is a type of data.', 'start': 128.261, 'duration': 1.98}, {'end': 137.044, 'text': "you know, it's a numeric data from 0 to 100, 0 to 1000, whatever it is, depending on whose age we're talking about.", 'start': 130.241, 'duration': 6.803}, {'end': 142.666, 'text': "when you come to video stuff and then when you talk about data, it's there it's a lot of images moving right?", 'start': 137.044, 'duration': 5.622}, {'end': 144.126, 'text': "So it's 24 frames per second.", 'start': 142.686, 'duration': 1.44}, {'end': 146.987, 'text': "it's 24 images moving every second, sort of something like that.", 'start': 144.126, 'duration': 2.861}, {'end': 149.609, 'text': 'this creates videos for us.', 'start': 147.687, 'duration': 1.922}, {'end': 156.436, 'text': 'and again, coming to the data generated by social media oh my god, this is a world apart.', 'start': 149.609, 'duration': 6.827}, {'end': 162.182, 'text': "and why I say that just stay with me for a couple of slides and you're gonna find out why.", 'start': 156.436, 'duration': 5.746}, {'end': 167.243, 'text': 'and then coming to quickly analyze what big data is.', 'start': 162.182, 'duration': 5.061}, {'end': 175.985, 'text': 'well, guys, big data is basically, as the name literally says, huge amounts of data and then these huge amounts of data,', 'start': 167.243, 'duration': 8.742}, {'end': 184.168, 'text': 'what is present is pretty much, you know, gone through analytics and done various computations on which, at the end of the day,', 'start': 175.985, 'duration': 8.183}], 'summary': 'Data in various forms is abundant in big data, including age and video data, as well as social media-generated data.', 'duration': 69.837, 'max_score': 114.331, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40114331.jpg'}, {'end': 301.03, 'src': 'embed', 'start': 259.406, 'weight': 2, 'content': [{'end': 266.813, 'text': "But then, when you're talking specifically about analytics, analytics is a little different when you talk about, when you talk about using data,", 'start': 259.406, 'duration': 7.407}, {'end': 271.617, 'text': "because here you'll be collaborating with past data, present data and much, much more.", 'start': 266.813, 'duration': 4.804}, {'end': 274.5, 'text': 'In fact, data from a million different sources at the same time.', 'start': 271.677, 'duration': 2.823}, {'end': 276.241, 'text': 'And you will do all of this.', 'start': 274.9, 'duration': 1.341}, {'end': 284.027, 'text': 'Why? Well, to actually obtain future results, future insights, future trends, predictions, whatever you can think of it guys.', 'start': 276.561, 'duration': 7.466}, {'end': 288.069, 'text': 'So, taking past data, bringing it to the present is analysis.', 'start': 284.407, 'duration': 3.662}, {'end': 295.154, 'text': 'Using data to obtain future insights and details regarding trends is basically analytics guys.', 'start': 288.77, 'duration': 6.384}, {'end': 301.03, 'text': 'So, On that note, pretty much, you know, you need to understand data better.', 'start': 295.575, 'duration': 5.455}], 'summary': 'Analytics involves using data from various sources to obtain future insights and trends for better understanding of data.', 'duration': 41.624, 'max_score': 259.406, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40259406.jpg'}], 'start': 2.912, 'title': 'Big data and its challenges', 'summary': 'Covers the significance of big data technologies such as hadoop, hdfs, mapreduce, hive, pig, spark, kafka, and spark streaming in uncovering insights from large data sets, as well as the challenges of handling massive data for revenue generation, marketing effectiveness, and operational efficiency.', 'chapters': [{'end': 401.569, 'start': 2.912, 'title': 'Understanding big data and its technologies', 'summary': 'Discusses the importance of big data in generating insights from large amounts of data, covering various technologies including hadoop, hdfs, mapreduce, hive, pig, spark, kafka, and spark streaming, and emphasizes the significance of big data analytics in uncovering patterns, market trends, and customer preferences.', 'duration': 398.657, 'highlights': ['Big data analytics uncovers the information, reveals patterns, and assesses trends, market trends, and customer preferences, and involves future insights and predictions. Big data analytics uncovers information, reveals patterns, and assesses trends and market preferences, involving future insights and predictions.', 'The chapter covers various technologies including Hadoop, HDFS, MapReduce, Hive, PIG, Spark, Kafka, and Spark streaming, providing a comprehensive understanding of big data technologies. The chapter covers various technologies including Hadoop, HDFS, MapReduce, Hive, PIG, Spark, Kafka, and Spark streaming, providing a comprehensive understanding of big data technologies.', 'The importance of big data in generating insights from large amounts of data is emphasized, highlighting the need for computational and cost-effective processing of data. The importance of big data in generating insights from large amounts of data is emphasized, highlighting the need for computational and cost-effective processing of data.', 'Storing and maintaining data is expensive due to its various flavors such as unstructured, semi-structured, and structured data, leading to bottlenecks in data processing. Storing and maintaining data is expensive due to its various flavors such as unstructured, semi-structured, and structured data, leading to bottlenecks in data processing.']}, {'end': 660.153, 'start': 401.569, 'title': 'Challenges of big data handling', 'summary': 'Discusses the challenges of handling big data, including the massive amount of data generated by companies like facebook, instagram, and twitter, and the importance of efficient data handling for revenue generation, marketing effectiveness, customer service, and operational efficiency.', 'duration': 258.584, 'highlights': ['Companies like Facebook generate 4 petabytes of data every day from 2.3 billion monthly active users and 250 billion photos uploaded Facebook generates 4 petabytes of data every day from 2.3 billion monthly active users and 250 billion photos uploaded, equating to 350 million photos every day.', 'Instagram shares 40 billion photos and 500 million videos daily, leading to petabytes of data being used and moved around each day Instagram shares 40 billion photos and 500 million videos daily, resulting in petabytes of data being used and moved around each day.', 'Twitter experiences 500 million tweets and 6,000 tweets every second, causing terabytes of data to be uploaded daily Twitter experiences 500 million tweets and 6,000 tweets every second, leading to terabytes of data being uploaded daily.', 'Efficient data handling is crucial for revenue generation, marketing effectiveness, customer service, and operational efficiency Efficient data handling is essential for revenue generation, marketing effectiveness, customer service, and operational efficiency, as it can drive more sales, increase profits, improve marketing, and enhance customer service.', 'Competitive advantage is achieved by companies handling big data well Companies gain a competitive advantage by handling big data effectively, leading to overall success and profitability.']}], 'duration': 657.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx402912.jpg', 'highlights': ['Big data analytics uncovers information, reveals patterns, and assesses trends and market preferences, involving future insights and predictions.', 'Companies like Facebook generate 4 petabytes of data every day from 2.3 billion monthly active users and 250 billion photos uploaded, equating to 350 million photos every day.', 'The importance of big data in generating insights from large amounts of data is emphasized, highlighting the need for computational and cost-effective processing of data.', 'Efficient data handling is essential for revenue generation, marketing effectiveness, customer service, and operational efficiency, as it can drive more sales, increase profits, improve marketing, and enhance customer service.', 'The chapter covers various technologies including Hadoop, HDFS, MapReduce, Hive, PIG, Spark, Kafka, and Spark streaming, providing a comprehensive understanding of big data technologies.']}, {'end': 1515.326, 'segs': [{'end': 686.962, 'src': 'embed', 'start': 660.153, 'weight': 0, 'content': [{'end': 667.036, 'text': 'uh, we have apache, hadoop, we have storm, we have cassandra, we have hive, spark, mongodb and much, much more, guys.', 'start': 660.153, 'duration': 6.883}, {'end': 674.158, 'text': 'so each of these tools has its own niche, it has its own role that it plays in the world where data is is the problem.', 'start': 667.036, 'duration': 7.122}, {'end': 679.44, 'text': 'but then we are trying to make uh data our friendlier entity by making use of these tools.', 'start': 674.158, 'duration': 5.282}, {'end': 681.46, 'text': 'guys, just a quick info, guys.', 'start': 679.44, 'duration': 2.02}, {'end': 686.962, 'text': 'if you also want to become a certified big data, hadoop, professional intellipaat offers a complete course on the same,', 'start': 681.46, 'duration': 5.502}], 'summary': 'Various tools like apache, hadoop, storm, cassandra, hive, spark, and mongodb are used in the world of big data to address data challenges. intellipaat offers a complete course for becoming a certified big data hadoop professional.', 'duration': 26.809, 'max_score': 660.153, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40660153.jpg'}, {'end': 776.121, 'src': 'embed', 'start': 756.475, 'weight': 1, 'content': [{'end': 776.121, 'text': "So, another thing about Linux that you guys should know is that having certain knowledge when it comes to the commands that you use in Linux and working with editors in the command line is something that pretty much will make your life very easy as a person who's learning Hadoop and pretty much how you can go on to installing Hadoop,", 'start': 756.475, 'duration': 19.646}], 'summary': 'Knowing linux commands and editors makes learning hadoop easier', 'duration': 19.646, 'max_score': 756.475, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40756475.jpg'}, {'end': 968.091, 'src': 'embed', 'start': 937.128, 'weight': 2, 'content': [{'end': 941.672, 'text': "and coming to the step-by-step guide where you'll be learning Hadoop.", 'start': 937.128, 'duration': 4.544}, {'end': 946.436, 'text': 'So, how can you go about learning Hadoop, guys? Well, just a quick info, guys.', 'start': 941.872, 'duration': 4.564}, {'end': 949.838, 'text': 'If you also want to become a certified Big Data Hadoop professional.', 'start': 946.736, 'duration': 3.102}, {'end': 954.322, 'text': 'Intellipaat offers a complete course on the same, the links for which are given in the description box below.', 'start': 949.838, 'duration': 4.484}, {'end': 955.783, 'text': "Now, let's continue with the session.", 'start': 954.342, 'duration': 1.441}, {'end': 958.445, 'text': 'Let me give you a step-by-step guide.', 'start': 956.204, 'duration': 2.241}, {'end': 960.267, 'text': "Here's the first step of it.", 'start': 958.926, 'duration': 1.341}, {'end': 968.091, 'text': 'The first thing you will do to understand Hadoop is by basically starting out by understanding how its file system works.', 'start': 960.687, 'duration': 7.404}], 'summary': "Learn hadoop with intellipaat's complete course for becoming a certified big data hadoop professional.", 'duration': 30.963, 'max_score': 937.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40937128.jpg'}], 'start': 660.153, 'title': 'Big data tools and hadoop developer skills', 'summary': "Discusses the significance of big data tools like apache, hadoop, storm, cassandra, hive, spark, and mongodb in addressing data challenges, emphasizing hadoop's role and outlining essential skills for becoming a hadoop developer, including understanding linux, programming languages, sql knowledge, and potential salaries for big data professionals.", 'chapters': [{'end': 724.462, 'start': 660.153, 'title': 'Importance of big data tools', 'summary': "Discusses the significance of various big data tools such as apache, hadoop, storm, cassandra, hive, spark, and mongodb in addressing data challenges, emphasizing hadoop's role in handling big data problems and highlighting the skills required to understand and master hadoop.", 'duration': 64.309, 'highlights': ["Hadoop is an open source software by the people of Apache Software Foundation, and they are doing an amazing job by handling the world's big data problems.", 'Various big data tools such as Apache, Hadoop, Storm, Cassandra, Hive, Spark, and MongoDB are utilized to address data challenges and make data more manageable.', 'There are three extremely important skills required to understand and master Hadoop, emphasizing the significance of acquiring specific skills for Hadoop proficiency.']}, {'end': 1515.326, 'start': 725.342, 'title': 'Key skills for hadoop developer', 'summary': 'Outlines the essential skills for becoming a hadoop developer, including understanding linux, programming languages (java, scala, python), and sql knowledge, as well as a step-by-step guide for learning hadoop, emphasizing key components such as hdfs, mapreduce, data ingestion tools, and the potential salaries for big data professionals.', 'duration': 789.984, 'highlights': ['The chapter outlines the essential skills for becoming a Hadoop developer, including understanding Linux, programming languages (Java, Scala, Python), and SQL knowledge. It is crucial for aspiring Hadoop developers to have a strong grasp of Linux, various programming languages, and SQL knowledge to excel in the field.', 'Provides a step-by-step guide for learning Hadoop, emphasizing key components such as HDFS, MapReduce, data ingestion tools, and the potential salaries for big data professionals. The step-by-step guide offers insights into crucial components of Hadoop, such as HDFS, MapReduce, data ingestion tools, and also provides information about potential salaries for big data professionals.', 'The potential salaries for big data professionals are highlighted, showcasing the lucrative earning potential in the field, including six-digit salaries in US dollars and the subcontinent of India. The chapter highlights the lucrative earning potential for big data professionals, with six-digit salaries in US dollars and the subcontinent of India, emphasizing the attractive job opportunities in the field.']}], 'duration': 855.173, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx40660153.jpg', 'highlights': ['Various big data tools such as Apache, Hadoop, Storm, Cassandra, Hive, Spark, and MongoDB are utilized to address data challenges and make data more manageable.', 'The potential salaries for big data professionals are highlighted, showcasing the lucrative earning potential in the field, including six-digit salaries in US dollars and the subcontinent of India.', 'The chapter outlines the essential skills for becoming a Hadoop developer, including understanding Linux, programming languages (Java, Scala, Python), and SQL knowledge.']}, {'end': 2939.083, 'segs': [{'end': 1584.369, 'src': 'embed', 'start': 1555.974, 'weight': 7, 'content': [{'end': 1559.916, 'text': 'There are over billion active devices every single hour of every single day.', 'start': 1555.974, 'duration': 3.942}, {'end': 1564.459, 'text': "So again, generating all of these data, handling it, maintaining it, guys, it's a task.", 'start': 1560.356, 'duration': 4.103}, {'end': 1568.603, 'text': 'So, regardless of where you are, which part of the world you are, people,', 'start': 1564.579, 'duration': 4.024}, {'end': 1573.947, 'text': "big companies are ready to hire you if you've got the skills to land a job in one of these guys.", 'start': 1568.603, 'duration': 5.344}, {'end': 1579.772, 'text': "And make sure to stick for a small while around because I'm going to guide you on how you can actually get these jobs.", 'start': 1574.167, 'duration': 5.605}, {'end': 1584.369, 'text': "How does MapReduce really work? Okay guys, so it's a very simple concept.", 'start': 1579.972, 'duration': 4.397}], 'summary': 'Over a billion active devices generate massive data, creating job opportunities worldwide. stay tuned for guidance on securing these jobs.', 'duration': 28.395, 'max_score': 1555.974, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx401555974.jpg'}, {'end': 1707.184, 'src': 'embed', 'start': 1664.11, 'weight': 3, 'content': [{'end': 1677.239, 'text': 'We have four guys Ajay, Abhay, Amit and Abul.', 'start': 1664.11, 'duration': 13.129}, {'end': 1681.001, 'text': 'So these are four friends who work for an organization.', 'start': 1678.119, 'duration': 2.882}, {'end': 1682.862, 'text': 'Ajay Abhay, Amit and Abul.', 'start': 1681.361, 'duration': 1.501}, {'end': 1697.398, 'text': 'four guys are there who work for this organization and their manager has told them that for next one year they are supposed to work on every Saturday and submit a report.', 'start': 1682.862, 'duration': 14.536}, {'end': 1701.24, 'text': 'Oh man, this is bad for them, right?', 'start': 1699.579, 'duration': 1.661}, {'end': 1707.184, 'text': 'So four guys Ajay Abhay, Amit and Abul, four guys.', 'start': 1702.301, 'duration': 4.883}], 'summary': 'Four friends working for an organization must work on saturdays and submit a report for one year.', 'duration': 43.074, 'max_score': 1664.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx401664110.jpg'}, {'end': 1814.45, 'src': 'embed', 'start': 1729.595, 'weight': 0, 'content': [{'end': 1737.599, 'text': "what they do is they say that let's have a small deal, that let's come at 8 o'clock in the morning.", 'start': 1729.595, 'duration': 8.004}, {'end': 1744.743, 'text': "let's work till 10 o'clock, let's work from 8 o'clock to 10 o'clock and 10 o'clock one guy will stay back and consolidate the whole thing.", 'start': 1737.599, 'duration': 7.144}, {'end': 1748.502, 'text': 'one guy will stay back and consolidate the whole thing.', 'start': 1746.3, 'duration': 2.202}, {'end': 1751.004, 'text': 'So that not everybody is wasting the whole weekend right.', 'start': 1748.842, 'duration': 2.162}, {'end': 1755.668, 'text': 'I mean we will come for two hours and then every week we will take turns like week one.', 'start': 1751.444, 'duration': 4.224}, {'end': 1756.609, 'text': 'Ajay may stay back.', 'start': 1755.668, 'duration': 0.941}, {'end': 1756.949, 'text': 'week two.', 'start': 1756.609, 'duration': 0.34}, {'end': 1757.93, 'text': 'Abhay will stay back.', 'start': 1756.949, 'duration': 0.981}, {'end': 1758.31, 'text': 'week three.', 'start': 1757.93, 'duration': 0.38}, {'end': 1759.111, 'text': 'Amit will stay back.', 'start': 1758.31, 'duration': 0.801}, {'end': 1759.712, 'text': 'week four.', 'start': 1759.111, 'duration': 0.601}, {'end': 1760.492, 'text': 'Abul will stay back.', 'start': 1759.712, 'duration': 0.78}, {'end': 1761.093, 'text': "So that's fine.", 'start': 1760.532, 'duration': 0.561}, {'end': 1761.954, 'text': "Let's do that.", 'start': 1761.513, 'duration': 0.441}, {'end': 1771.762, 'text': "So it is this Saturday early morning 8 o'clock Ajay, Abhay, Amit and Abul all four are in office and from 8 o'clock they start working.", 'start': 1763.054, 'duration': 8.708}, {'end': 1774.734, 'text': '8, 9, 10.', 'start': 1771.782, 'duration': 2.952}, {'end': 1777.957, 'text': "everybody is done with their work 8, 9, 10, 10 o'clock.", 'start': 1774.736, 'duration': 3.221}, {'end': 1784.64, 'text': 'everybody is done with his work and this week Amit is supposed to stay back.', 'start': 1777.957, 'duration': 6.683}, {'end': 1786.541, 'text': 'this week Amit is supposed to stay back.', 'start': 1784.64, 'duration': 1.901}, {'end': 1807.923, 'text': 'So all of them will hand over work to Amit, Amit will consolidate and give the output, my friends This is map reduce, nothing big.', 'start': 1790.282, 'duration': 17.641}, {'end': 1814.45, 'text': 'Map reduce is so very near real to what our life looks like, how consolidation looks like.', 'start': 1808.264, 'duration': 6.186}], 'summary': 'Team implements a small deal to work 8-10 am on saturdays, taking turns to consolidate work, reducing weekend time wastage.', 'duration': 84.855, 'max_score': 1729.595, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx401729595.jpg'}, {'end': 2554.304, 'src': 'embed', 'start': 2514.552, 'weight': 2, 'content': [{'end': 2526.8, 'text': 'So the point is by default the number of reducers in MapReduce program is one and you can change it, but you can change it.', 'start': 2514.552, 'duration': 12.248}, {'end': 2546.639, 'text': 'So if I wanted to have a load balancing I could go for should go for two reducers but then if there are two reducers my whole thing goes in a mess.', 'start': 2527.901, 'duration': 18.738}, {'end': 2549.741, 'text': "Why it's a mess?", 'start': 2549.081, 'duration': 0.66}, {'end': 2552.143, 'text': 'How will I know what should go where?', 'start': 2550.662, 'duration': 1.481}, {'end': 2554.304, 'text': 'How will I know what should go where?', 'start': 2552.863, 'duration': 1.441}], 'summary': 'Mapreduce program by default has one reducer, can be changed for load balancing, but may lead to confusion.', 'duration': 39.752, 'max_score': 2514.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx402514552.jpg'}], 'start': 1515.326, 'title': 'Big data job opportunities and mapreduce', 'summary': 'Discusses the increasing demand for professionals with big data expertise, highlighting job prospects and explains mapreduce in data processing, including process, comparison with rdbms, basics, and limitations.', 'chapters': [{'end': 1573.947, 'start': 1515.326, 'title': 'Big data job opportunities', 'summary': 'Discusses the increasing demand for professionals with big data expertise, highlighting the exponential growth of data generation and the lucrative job prospects for individuals with relevant certifications and hands-on experience.', 'duration': 58.621, 'highlights': ['The exponential growth of data generation is highlighted, with over a billion active devices every single hour of every single day.', 'The demand for professionals with big data expertise is emphasized, with an emphasis on certifications, project experience, and hands-on experience.', 'The significant job prospects for individuals with big data expertise are mentioned, positioning it as one of the top jobs for the next decade.']}, {'end': 1897.349, 'start': 1574.167, 'title': 'Understanding mapreduce in data processing', 'summary': 'Explains the concept of mapreduce in data processing, using a simple example of four individuals working on reports, and highlights the divide and conquer approach of mapreduce.', 'duration': 323.182, 'highlights': ['MapReduce is a simple concept, where data is divided and consolidated for processing, demonstrated through the example of four individuals working on reports. Simple concept of MapReduce demonstrated | Data is divided and consolidated for processing | Example of four individuals working on reports', "The concept of MapReduce is likened to real-life experiences, with the 'map' phase representing individual work and the 'reduce' phase representing consolidation. Comparison of MapReduce to real-life experiences | 'Map' phase as individual work and 'reduce' phase as consolidation", "MapReduce is explained as a 'divide and conquer' approach, where 'map' signifies division and 'reduce' signifies conquering, simplifying its understanding. Explanation of 'divide and conquer' approach in MapReduce | 'Map' signifies division and 'reduce' signifies conquering"]}, {'end': 2404.996, 'start': 1935.216, 'title': 'Map reduce process and rdbms comparison', 'summary': 'Explains the map reduce process, indicating that with 189 mb file size, there will be two hdfs blocks, resulting in two mappers, followed by the creation of five groups during the shuffle phase, and finally, the reducer will process the groups, with a comparison to rdbms query process.', 'duration': 469.78, 'highlights': ['With 189 MB file size, there will be two HDFS blocks. The file size of 189 MB will result in two HDFS blocks, with one block of 128 MB and the other of 61 MB.', 'Creation of five groups during the shuffle phase. During the shuffle phase, five groups will be created, with countries such as China, India, Malaysia, Singapore, and Thailand being part of these groups.', 'Comparison to RDBMS query process. The explanation provides a comparison to the RDBMS query process, detailing the loop through each record, creation of groups, and the insertion of records into the groups, with a specific example of finding and inserting records for India and China.']}, {'end': 2939.083, 'start': 2404.996, 'title': 'Mapreduce basics and limitations', 'summary': 'Discusses the basics of mapreduce, including map phase, shuffle and sort, and the role of partitioner in load balancing. it also highlights the limitation of default single reducer and the potential solution through partitioning. the chapter concludes with a brief overview of the combiner.', 'duration': 534.087, 'highlights': ['The role of partitioner in load balancing is crucial in MapReduce, where it decides which data should go to which reducer, ensuring an efficient distribution of data and workload. Partitioner ensures load balancing by deciding the distribution of data to different reducers, optimizing the workload and data processing.', 'The limitation of default single reducer in MapReduce is addressed through the concept of partitioning, where the number of reducers can be changed to improve load balancing and prevent potential bottlenecks. The default single reducer limitation is mitigated by allowing the adjustment of the number of reducers, enabling better load balancing and addressing potential storage bottlenecks.', 'The concept of combiner is briefly introduced as a means to consolidate and optimize the data output from the mapper, potentially reducing the data volume and improving efficiency. The combiner is presented as a method to consolidate and optimize mapper output, potentially reducing data volume and enhancing processing efficiency.']}], 'duration': 1423.757, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx401515326.jpg', 'highlights': ['The significant job prospects for individuals with big data expertise are mentioned, positioning it as one of the top jobs for the next decade.', 'The demand for professionals with big data expertise is emphasized, with an emphasis on certifications, project experience, and hands-on experience.', 'The exponential growth of data generation is highlighted, with over a billion active devices every single hour of every single day.', "MapReduce is explained as a 'divide and conquer' approach, where 'map' signifies division and 'reduce' signifies conquering, simplifying its understanding.", 'MapReduce is a simple concept, where data is divided and consolidated for processing, demonstrated through the example of four individuals working on reports.', "The concept of MapReduce is likened to real-life experiences, with the 'map' phase representing individual work and the 'reduce' phase representing consolidation.", 'With 189 MB file size, there will be two HDFS blocks. The file size of 189 MB will result in two HDFS blocks, with one block of 128 MB and the other of 61 MB.', 'Creation of five groups during the shuffle phase. During the shuffle phase, five groups will be created, with countries such as China, India, Malaysia, Singapore, and Thailand being part of these groups.', 'The role of partitioner in load balancing is crucial in MapReduce, where it decides which data should go to which reducer, ensuring an efficient distribution of data and workload.', 'The limitation of default single reducer in MapReduce is addressed through the concept of partitioning, where the number of reducers can be changed to improve load balancing and prevent potential bottlenecks.', 'The concept of combiner is briefly introduced as a means to consolidate and optimize the data output from the mapper, potentially reducing the data volume and improving efficiency.', 'Comparison to RDBMS query process. The explanation provides a comparison to the RDBMS query process, detailing the loop through each record, creation of groups, and the insertion of records into the groups, with a specific example of finding and inserting records for India and China.']}, {'end': 4725.151, 'segs': [{'end': 3124.636, 'src': 'embed', 'start': 2939.083, 'weight': 0, 'content': [{'end': 2976.672, 'text': 'Now, instead of China and India like this with combiner, we will get China 22.8, India 22.8 and here we will get Malaysia to be 6.4, sorry, 6.14,', 'start': 2939.083, 'duration': 37.589}, {'end': 2990.898, 'text': '6.4 and we get the consolidation.', 'start': 2976.672, 'duration': 14.226}, {'end': 2999.503, 'text': 'This is done by one guy sitting before the partitioner called as combiner.', 'start': 2991.559, 'duration': 7.944}, {'end': 3009.368, 'text': 'So that is combiner and partitioner.', 'start': 3007.947, 'duration': 1.421}, {'end': 3023.29, 'text': 'So that is combiner and partitioner.', 'start': 3017.566, 'duration': 5.724}, {'end': 3029.034, 'text': 'So now this kind of consolidates the entire map reduce for us.', 'start': 3023.91, 'duration': 5.124}, {'end': 3033.296, 'text': 'We understand map, that is one recorded time.', 'start': 3029.634, 'duration': 3.662}, {'end': 3038.56, 'text': 'then we understand partitioner, which is load balancer, but before load balancing, if you want to combine something,', 'start': 3033.296, 'duration': 5.264}, {'end': 3042.242, 'text': 'we can run combiner and then we do the reduce part.', 'start': 3038.56, 'duration': 3.682}, {'end': 3043.343, 'text': 'then we do the reduce part.', 'start': 3042.242, 'duration': 1.101}, {'end': 3046.305, 'text': "So let's understand this carefully.", 'start': 3044.684, 'duration': 1.621}, {'end': 3056.485, 'text': 'as a programmer.', 'start': 3055.424, 'duration': 1.061}, {'end': 3060.007, 'text': 'how do I predict my number of users now?', 'start': 3056.485, 'duration': 3.522}, {'end': 3063.95, 'text': 'actually, there are no royal ways, but there will be couple of tips out here.', 'start': 3060.007, 'duration': 3.943}, {'end': 3114.093, 'text': "so let's understand And in 30 seconds I'll put the answer and check if your answer is confirming.", 'start': 3063.95, 'duration': 50.143}, {'end': 3120.917, 'text': 'Answer should be two reducers.', 'start': 3119.736, 'duration': 1.181}, {'end': 3123.215, 'text': 'Why you should go for 12 reducers?', 'start': 3121.954, 'duration': 1.261}, {'end': 3124.636, 'text': 'because here use case.', 'start': 3123.215, 'duration': 1.421}], 'summary': 'Mapreduce process explained with china 22.8, india 22.8, malaysia 6.4 data and recommendation of two reducers for use case.', 'duration': 185.553, 'max_score': 2939.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx402939083.jpg'}, {'end': 3939.261, 'src': 'embed', 'start': 3910.606, 'weight': 9, 'content': [{'end': 3924.357, 'text': 'sudo user, java, latest bin JPS and by the way, this command is not married to Hadoop.', 'start': 3910.606, 'duration': 13.751}, {'end': 3926.758, 'text': 'this command is married to Java.', 'start': 3924.357, 'duration': 2.401}, {'end': 3929.679, 'text': 'that means on your server, if any other Java process is running.', 'start': 3926.758, 'duration': 2.921}, {'end': 3930.919, 'text': 'I mean take an example.', 'start': 3929.679, 'duration': 1.24}, {'end': 3932.299, 'text': "let's give an example.", 'start': 3930.919, 'duration': 1.38}, {'end': 3936.241, 'text': "let's say on your left, during in that cluster, let's say Lotus Notes is also installed.", 'start': 3932.299, 'duration': 3.942}, {'end': 3939.261, 'text': 'Lotus Notes is an IBM product which runs on Java.', 'start': 3936.241, 'duration': 3.02}], 'summary': "Use 'sudo user, java, latest bin jps' to check running java processes, not necessarily related to hadoop.", 'duration': 28.655, 'max_score': 3910.606, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx403910606.jpg'}, {'end': 4130.397, 'src': 'embed', 'start': 4100.002, 'weight': 3, 'content': [{'end': 4106.185, 'text': 'So most Linux commands will work like that, hadoop fs hyphen and put the Linux command.', 'start': 4100.002, 'duration': 6.183}, {'end': 4113.368, 'text': 'So if hadoop fs is not there this is Linux command, if hadoop fs is there this is HDFS command.', 'start': 4108.286, 'duration': 5.082}, {'end': 4124.615, 'text': 'Now there is one exception to this rule and that is that we cannot do vi that is we cannot edit a file in hadoop.', 'start': 4114.75, 'duration': 9.865}, {'end': 4130.397, 'text': 'In Hadoop file system, once you write the data, the file cannot be updated.', 'start': 4125.734, 'duration': 4.663}], 'summary': 'Most linux commands work in hadoop, except for file editing and updates in hdfs.', 'duration': 30.395, 'max_score': 4100.002, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx404100002.jpg'}, {'end': 4194.046, 'src': 'embed', 'start': 4161.218, 'weight': 1, 'content': [{'end': 4168.46, 'text': 'The second variation to this is in Linux, there is an amazing command called as cd.', 'start': 4161.218, 'duration': 7.242}, {'end': 4174.743, 'text': "So I can say my pwd, and I can say let's say cd workspace.", 'start': 4169.56, 'duration': 5.183}, {'end': 4179.064, 'text': 'And when I do my pwd, my present working directory changes.', 'start': 4175.723, 'duration': 3.341}, {'end': 4186.263, 'text': 'But, my friends, in Hadoop there is no cd command.', 'start': 4181.665, 'duration': 4.598}, {'end': 4189.505, 'text': "that means let's look at a simple example.", 'start': 4186.263, 'duration': 3.242}, {'end': 4194.046, 'text': 'I say hadoop fs, ls temp.', 'start': 4189.505, 'duration': 4.541}], 'summary': "In linux, 'cd' command changes directory; in hadoop, 'cd' is not available. example: 'hadoop fs, ls temp'.", 'duration': 32.828, 'max_score': 4161.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx404161218.jpg'}], 'start': 2939.083, 'title': 'Mapreduce and hadoop basics', 'summary': 'Covers the role of combiner and partitioner in map reduce process, the need for 12 reducers, and quantifiable data of china 22.8, india 22.8, malaysia 6.4. it also discusses mapreduce workflow, hadoop setup on windows, mac, and centos, and hdfs basics including unique features.', 'chapters': [{'end': 3147.731, 'start': 2939.083, 'title': 'Map reduce combiner and partitioner', 'summary': 'Discusses the role of combiner and partitioner in the map reduce process, emphasizing the need for 12 reducers based on a specific use case and the quantifiable data of china 22.8, india 22.8, and malaysia 6.4.', 'duration': 208.648, 'highlights': ["The use case specifies the requirement for 12 output files, each containing one month's data, indicating the need for 12 reducers.", 'The quantifiable data of China 22.8, India 22.8, and Malaysia 6.4 showcases the role of combiner and partitioner in consolidating the map reduce process.', 'The explanation of combiner and partitioner in the map reduce process, highlighting their functions in combining and load balancing.']}, {'end': 3508.856, 'start': 3148.952, 'title': 'Understanding mapreduce workflow', 'summary': 'Explains the trial and error process of determining the number of mappers and reducers based on the nature of the data, emphasizing the need for understanding the data and the absence of a universal approach. it also outlines the mapreduce workflow, including the allocation of mappers and reducers, the shuffle phase, and the generation of output files.', 'duration': 359.904, 'highlights': ['The process of determining the number of mappers and reducers involves trial and error, requiring a deep understanding of the data, as there is no universal approach for this.', 'The chapter outlines the MapReduce workflow, including the allocation of mappers and reducers, the shuffle phase, and the generation of output files.', 'The nature of the data plays a crucial role in determining the number of mappers and reducers, and any changes in the data require revisiting the trial and error process to settle on an appropriate number.']}, {'end': 3819.531, 'start': 3511.107, 'title': 'Setup and practice for hadoop', 'summary': 'Explains the setup process for hadoop on windows, mac, and centos, including the services running on the virtual machine and the practical exercises for hdfs and mapreduce commands, with a focus on the differences between unix and hdfs file systems, and the functionalities of name node, data node, secondary name node, resource manager, node manager, and job history server.', 'duration': 308.424, 'highlights': ['The virtual machine for Hadoop runs on top of Windows operating system and CentOS, with six services already running, including name node, data node, secondary name node, resource manager, node manager, and job history server. The virtual machine for Hadoop runs on Windows and CentOS, with six services already running, such as name node, data node, secondary name node, resource manager, node manager, and job history server.', 'The practical exercises cover HDFS and MapReduce commands, focusing on the differences between Unix and HDFS file systems, and the functionalities of name node, data node, secondary name node, resource manager, node manager, and job history server. The practical exercises cover HDFS and MapReduce commands, emphasizing the differences between Unix and HDFS file systems, and the functionalities of name node, data node, secondary name node, resource manager, node manager, and job history server.', 'Resource manager in Hadoop handles multiple frameworks like MapReduce and Spark but does not archive job status, which is managed by the job history server. Resource manager in Hadoop handles multiple frameworks like MapReduce and Spark but does not archive job status, which is managed by the job history server.']}, {'end': 4725.151, 'start': 3819.571, 'title': 'Hadoop verification and hdfs basics', 'summary': "Explains the use of 'jps' command to verify java processes running in a cluster, and demonstrates the difference between linux and hadoop file systems using 'ls' and 'hadoop fs ls' commands, also highlighting the unique features of hdfs such as 'write once, read many' and the absence of 'cd' command.", 'duration': 905.58, 'highlights': ["The 'JPS' command reveals all the Java processes running in the cluster, including Hadoop processes such as OAS name node, data node, secondary name node, resource manager, node manager, and job history server. The 'JPS' command provides a concise list of all Java processes running in the cluster, including specific Hadoop processes, thereby allowing for easy verification of Hadoop installation.", "The difference between Linux and Hadoop file systems is demonstrated by comparing the output of 'ls' and 'hadoop fs ls' commands, indicating the clear distinction in the content displayed. The comparison of 'ls' and 'hadoop fs ls' commands clearly illustrates the differences in content displayed, confirming the uniqueness of the Linux and Hadoop file systems.", "HDFS commands closely resemble Linux commands, with the exception of the absence of a 'cd' command and the requirement to use full paths for all operations. The explanation emphasizes the similarity between HDFS and Linux commands, while highlighting the absence of 'cd' command in HDFS and the necessity to use full paths for all operations."]}], 'duration': 1786.068, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx402939083.jpg', 'highlights': ["The use case specifies the requirement for 12 output files, each containing one month's data, indicating the need for 12 reducers.", 'The quantifiable data of China 22.8, India 22.8, and Malaysia 6.4 showcases the role of combiner and partitioner in consolidating the map reduce process.', 'The explanation of combiner and partitioner in the map reduce process, highlighting their functions in combining and load balancing.', 'The process of determining the number of mappers and reducers involves trial and error, requiring a deep understanding of the data, as there is no universal approach for this.', 'The chapter outlines the MapReduce workflow, including the allocation of mappers and reducers, the shuffle phase, and the generation of output files.', 'The virtual machine for Hadoop runs on top of Windows operating system and CentOS, with six services already running, including name node, data node, secondary name node, resource manager, node manager, and job history server.', 'The practical exercises cover HDFS and MapReduce commands, focusing on the differences between Unix and HDFS file systems, and the functionalities of name node, data node, secondary name node, resource manager, node manager, and job history server.', 'Resource manager in Hadoop handles multiple frameworks like MapReduce and Spark but does not archive job status, which is managed by the job history server.', "The 'JPS' command reveals all the Java processes running in the cluster, including Hadoop processes such as OAS name node, data node, secondary name node, resource manager, node manager, and job history server.", "The difference between Linux and Hadoop file systems is demonstrated by comparing the output of 'ls' and 'hadoop fs ls' commands, indicating the clear distinction in the content displayed.", "HDFS commands closely resemble Linux commands, with the exception of the absence of a 'cd' command and the requirement to use full paths for all operations."]}, {'end': 6824.693, 'segs': [{'end': 4821.127, 'src': 'embed', 'start': 4790.12, 'weight': 3, 'content': [{'end': 4795.702, 'text': "So this is some ready made code wordcount.jar, I have written it for you, it's ready, just leave it, it's fine.", 'start': 4790.12, 'duration': 5.582}, {'end': 4798.044, 'text': 'Right now we are not interested in how to add code.', 'start': 4796.223, 'duration': 1.821}, {'end': 4803.506, 'text': 'Hadoop jar wordcount.jar, the input folder I created is input.', 'start': 4798.644, 'duration': 4.862}, {'end': 4806.747, 'text': 'And the output that I will get is output.', 'start': 4804.646, 'duration': 2.101}, {'end': 4811.389, 'text': 'But right now my perspective is how does it get deployed on cluster.', 'start': 4807.568, 'duration': 3.821}, {'end': 4813.29, 'text': 'So we will see both of it.', 'start': 4812.15, 'duration': 1.14}, {'end': 4821.127, 'text': 'First, I will go to localhost, and I will go to 8088.', 'start': 4813.67, 'duration': 7.457}], 'summary': 'Ready-made code wordcount.jar for hadoop deployment on cluster, accessing via localhost on port 8088.', 'duration': 31.007, 'max_score': 4790.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx404790120.jpg'}, {'end': 5218.902, 'src': 'embed', 'start': 5193.576, 'weight': 1, 'content': [{'end': 5203.098, 'text': 'so, for example, the students example what should be done with each student and here what should be done with China comma.', 'start': 5193.576, 'duration': 9.522}, {'end': 5206.839, 'text': "let's say 11111 the whole packet.", 'start': 5203.098, 'duration': 3.741}, {'end': 5211.48, 'text': 'so the driver you initialize driver will call the mapper.', 'start': 5206.839, 'duration': 4.641}, {'end': 5216.181, 'text': 'mapper will write its output automatically, framework will take care of shuffle and sort.', 'start': 5211.48, 'duration': 4.701}, {'end': 5218.902, 'text': 'then the driver will call the reducer and you will get your output.', 'start': 5216.181, 'duration': 2.721}], 'summary': 'Process flow: initialize driver, mapper outputs, shuffle/sort, reducer, output', 'duration': 25.326, 'max_score': 5193.576, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx405193576.jpg'}, {'end': 6621.58, 'src': 'embed', 'start': 6583.651, 'weight': 4, 'content': [{'end': 6590.572, 'text': 'We are saying that both mapper and reducer their output is text comma integer.', 'start': 6583.651, 'duration': 6.921}, {'end': 6592.693, 'text': "Let's carefully understand, let's carefully understand.", 'start': 6590.652, 'duration': 2.041}, {'end': 6608.924, 'text': 'So we were looking at the output folder right, hdfs dfs get output slash part r 0 0 0 0 0.', 'start': 6595.896, 'duration': 13.028}, {'end': 6614.253, 'text': 'So this is text, this is text comma integer.', 'start': 6608.928, 'duration': 5.325}, {'end': 6621.58, 'text': 'So we are saying that our mapper output is word comma 1 and our reducer output is also word comma integer.', 'start': 6615.314, 'duration': 6.266}], 'summary': 'Mapper and reducer outputs are text, integer. mapper: word, 1. reducer: word, integer.', 'duration': 37.929, 'max_score': 6583.651, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx406583651.jpg'}, {'end': 6757.681, 'src': 'embed', 'start': 6702.042, 'weight': 0, 'content': [{'end': 6706.845, 'text': 'Am I missing something? Ls input.', 'start': 6702.042, 'duration': 4.803}, {'end': 6711.308, 'text': "Okay, it's got the spelling wrong.", 'start': 6708.826, 'duration': 2.482}, {'end': 6735.499, 'text': 'so one record means Intellipaat is one of the very first companies who got into big data.', 'start': 6718.826, 'duration': 16.673}, {'end': 6741.224, 'text': 'so this map will take one record at a time.', 'start': 6735.499, 'duration': 5.725}, {'end': 6750.575, 'text': 'that means this is my record and in the notepad file or text file this starts with a byte offset zero.', 'start': 6741.224, 'duration': 9.351}, {'end': 6757.681, 'text': 'so it says this is an object which is key and the value is this entire string.', 'start': 6750.575, 'duration': 7.106}], 'summary': 'Intellipaat is one of the very first companies in big data with one record at a time.', 'duration': 55.639, 'max_score': 6702.042, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx406702042.jpg'}], 'start': 4725.151, 'title': 'Hadoop data transfer, job deployment, mapreduce basics, and java setup', 'summary': 'Covers transferring data from linux to hadoop, job deployment and monitoring, mapreduce programming basics, setting up eclipse for a java project, and building and running mapreduce programs, including resolving 48 errors, and understanding java code execution in hadoop.', 'chapters': [{'end': 4789.139, 'start': 4725.151, 'title': 'Transfer data from linux to hadoop', 'summary': 'Details the process of transferring data from a linux box to hadoop, involving creating a folder, putting a file into the hadoop directory, and running a code.', 'duration': 63.988, 'highlights': ["Transferring data involves creating a folder in Hadoop using the command 'hdfs dfs mkdir' and then putting a file into the created directory using 'hdfs dfs put', demonstrated by creating a folder called 'input' and putting a file named 'intellipad.txt'.", "The process involves using Hadoop commands such as 'hdfs dfs mkdir' and 'hdfs dfs put' to transfer data, with the example showcasing the creation of a folder and putting a file into the Hadoop directory.", "The transcript discusses the process of transferring data from a Linux box to Hadoop, including creating a folder named 'input' and putting a file called 'intellipad.txt' into the Hadoop directory."]}, {'end': 5024.628, 'start': 4790.12, 'title': 'Job deployment and monitoring in hadoop', 'summary': "Outlines the process of deploying a job on a hadoop cluster, monitoring its progress through the resource manager ui, and understanding the significance of output files such as 'success' and 'part-r-00'. it also emphasizes the role of tracking urls and the job history server in tracking job status and completion.", 'duration': 234.508, 'highlights': ["The job is submitted using the command 'Hadoop jar wordcount.jar', resulting in successful submission and subsequent execution, as indicated by the appearance of the job in the history server. Successful submission and execution of the job, tracked through the history server.", "The 'success' file serves as a flag for job success, while the 'part-r-00' file contains the output of the job, with the number of reducers directly influencing the number of output files. Significance of 'success' and 'part-r-00' files in indicating job success and providing output, influenced by the number of reducers.", 'Tracking URLs and the job history server are pivotal in monitoring job progress and completion, with the tracking URL leading to the job history server upon job execution. Importance of tracking URLs and job history server in monitoring job progress and completion.']}, {'end': 5427.07, 'start': 5024.628, 'title': 'Mapreduce programming basics', 'summary': 'Discusses the basics of mapreduce programming, including the structure of a mapreduce program with three main classes - driver, mapper, and reducer, and outlines the agenda to cover four code samples: word count, aggregation, and two types of joins in mapper and reducer.', 'duration': 402.442, 'highlights': ['The chapter discusses the basics of MapReduce programming, including the structure of a MapReduce program with three main classes - driver, mapper, and reducer. The MapReduce program consists of three main Java classes: driver, mapper, and reducer, with the driver serving as the ignition point, the mapper handling each record, and the reducer managing each group or bucket.', 'The chapter outlines the agenda to cover four code samples: word count, aggregation, and two types of joins in mapper and reducer. The agenda includes covering four code samples: word count, aggregation, and two types of joins in mapper and reducer, demonstrating the practical application of the MapReduce programming concepts.', 'The chapter emphasizes the importance of maintaining simplicity and good practice in MapReduce programming by structuring workflows with multiple MapReduce programs and chaining them for maintainability. The chapter stresses the importance of maintaining simplicity and good practice in MapReduce programming, suggesting structuring workflows with multiple MapReduce programs and chaining them for maintainability, while also allowing for more complex workflows if necessary.']}, {'end': 5964.465, 'start': 5429.059, 'title': 'Setting up eclipse for java project', 'summary': 'Explains how to set up a java project in eclipse, including importing code, adding libraries, exporting a jar file, and running the project with detailed step-by-step instructions, including resolving errors and running the job, with a total of 48 errors resolved during the process.', 'duration': 535.406, 'highlights': ['The chapter explains how to set up a Java project in Eclipse, including importing code, adding libraries, exporting a jar file, and running the project with detailed step-by-step instructions, including resolving 48 errors during the process.', 'The process involves importing code, adding libraries such as user libhadoop, Hadoop, MapReduce, exporting the code as a jar file, and running the jar file to complete the setup.', 'The chapter also highlights the process of resolving errors during the setup, with specific steps such as adding external jars, including Hadoop and MapReduce libraries, and resolving issues related to generic options parser.', 'The chapter provides detailed instructions on running the job, including clearing the output folder, submitting the job, and documenting the steps for future reference, emphasizing a step-by-step approach for creating the code and running the project.']}, {'end': 6447.939, 'start': 5966.044, 'title': 'Building and running mapreduce program', 'summary': 'Explains building and running a mapreduce program from the command line using ant, including the steps for installing ant, copying the code, building the jar file, running the code, and verifying the output. it also emphasizes the importance of documentation and understanding the logic of the program.', 'duration': 481.895, 'highlights': ['Installing ant is the first step to use as a build tool in the VM. To use a build tool, the first step is to install ant in the VM.', 'Copying the word count folder to a new location, such as America, and deleting the existing jar file are necessary steps before running the ant command to build the jar file. Copying the word count folder to a new location and deleting the existing jar file are necessary before running the ant command to build the jar file.', 'The chapter emphasizes the importance of proper documentation, including the existence of a readme file in the word count code files folder, which provides detailed deployment instructions. The chapter emphasizes the importance of proper documentation, including the existence of a readme file in the word count code files folder, which provides detailed deployment instructions.']}, {'end': 6824.693, 'start': 6448.98, 'title': 'Understanding java code execution in hadoop', 'summary': 'Explains the execution flow of a java code in hadoop, including reading configuration, specifying input and output path, setting mapper and reducer classes, and tokenizing input records in the mapper class.', 'duration': 375.713, 'highlights': ['The chapter explains the execution flow of a Java code in Hadoop, including reading configuration, specifying input and output path, setting mapper and reducer classes, and tokenizing input records in the mapper class. The summary provides an overview of the entire transcript, including key points such as reading configuration, specifying input and output paths, setting mapper and reducer classes, and tokenizing input records in the mapper class.', 'The chapter describes the process of reading the configuration and checking for user-provided arguments, such as input and output paths. The transcript details the process of reading the configuration and checking for user-provided arguments, such as input and output paths, to ensure proper job execution.', 'The transcript explains the importance of specifying the input and output paths and the expected behavior if the user does not provide them. It emphasizes the importance of specifying the input and output paths and the expected behavior if the user does not provide them, ensuring job failure in case of inadequate input.', 'The chapter elaborates on specifying the classes for mapper, combiner, and reducer, and setting the main class for the jar file. It elaborates on specifying the classes for mapper, combiner, and reducer, and setting the main class for the jar file, ensuring the correct execution flow within the Hadoop framework.', 'The transcript details the output format specifications, including the output key and value classes, and the submission of the job. It details the output format specifications, including the output key and value classes, and the submission of the job, ensuring the proper configuration of the output format before job submission.', 'The chapter outlines the role of the driver class in setting expectations and calling the mapper and reducer classes, without incorporating business logic. It outlines the role of the driver class in setting expectations and calling the mapper and reducer classes, without incorporating business logic, emphasizing its preparatory nature for the actual business logic implementation.', 'The transcript explains the role of the mapper class in processing one record at a time, tokenizing input records, and handling individual data entries. It explains the role of the mapper class in processing one record at a time, tokenizing input records, and handling individual data entries, providing insights into the specific operations performed by the mapper.']}], 'duration': 2099.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx404725151.jpg', 'highlights': ["The process involves using Hadoop commands such as 'hdfs dfs mkdir' and 'hdfs dfs put' to transfer data, with the example showcasing the creation of a folder and putting a file into the Hadoop directory.", "The job is submitted using the command 'Hadoop jar wordcount.jar', resulting in successful submission and subsequent execution, as indicated by the appearance of the job in the history server.", 'The chapter discusses the basics of MapReduce programming, including the structure of a MapReduce program with three main classes - driver, mapper, and reducer.', 'The chapter explains how to set up a Java project in Eclipse, including importing code, adding libraries, exporting a jar file, and running the project with detailed step-by-step instructions, including resolving 48 errors during the process.', 'The chapter emphasizes the importance of proper documentation, including the existence of a readme file in the word count code files folder, which provides detailed deployment instructions.', 'The chapter explains the execution flow of a Java code in Hadoop, including reading configuration, specifying input and output path, setting mapper and reducer classes, and tokenizing input records in the mapper class.']}, {'end': 12729.62, 'segs': [{'end': 7111.843, 'src': 'embed', 'start': 7036.092, 'weight': 4, 'content': [{'end': 7039.695, 'text': 'So I say workspace, life is good.', 'start': 7036.092, 'duration': 3.603}, {'end': 7045.58, 'text': 'Say programming, Eclipse.', 'start': 7043.378, 'duration': 2.202}, {'end': 7055.068, 'text': 'So we close word count.', 'start': 7054.007, 'duration': 1.061}, {'end': 7058.571, 'text': "I'll start a new project, new Java project.", 'start': 7055.869, 'duration': 2.702}, {'end': 7062.094, 'text': 'Time diff, finish.', 'start': 7060.273, 'duration': 1.821}, {'end': 7069.221, 'text': "And now you know how to get rid of errors and all, but I'll just do it just to steer clear of it.", 'start': 7063.297, 'duration': 5.924}, {'end': 7072.463, 'text': 'Libraries, add external jars.', 'start': 7070.161, 'duration': 2.302}, {'end': 7082.789, 'text': "Same process which you've already seen.", 'start': 7081.188, 'duration': 1.601}, {'end': 7111.843, 'text': 'ok. so now what we are trying to do here?', 'start': 7105.382, 'duration': 6.461}], 'summary': 'Using eclipse, creating a new java project and managing errors.', 'duration': 75.751, 'max_score': 7036.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx407036092.jpg'}, {'end': 7409.822, 'src': 'embed', 'start': 7346.986, 'weight': 3, 'content': [{'end': 7366.67, 'text': 'So, I say cp-r Hadoop core files multiple input formats multiple to workspace.', 'start': 7346.986, 'duration': 19.684}, {'end': 7371.503, 'text': "So it's multiple input, equal to Eclipse.", 'start': 7369.161, 'duration': 2.342}, {'end': 7397.265, 'text': "This is slightly strange and slightly interesting because, so we'll ignore the errors, you know how to fix them.", 'start': 7390.479, 'duration': 6.786}, {'end': 7402.658, 'text': 'Here what we are doing is there are multiple mappers.', 'start': 7398.655, 'duration': 4.003}, {'end': 7409.822, 'text': 'There is one mapper that reads customer detail and there is another one which is reading delivery status.', 'start': 7402.978, 'duration': 6.844}], 'summary': 'Using multiple input formats in hadoop, with multiple mappers for reading customer details and delivery status.', 'duration': 62.836, 'max_score': 7346.986, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx407346986.jpg'}, {'end': 7541.105, 'src': 'embed', 'start': 7484.704, 'weight': 5, 'content': [{'end': 7505.114, 'text': 'so this mapper it reads customer data and this mapper reads delivery data.', 'start': 7484.704, 'duration': 20.41}, {'end': 7509.476, 'text': 'So, this is complicated in terms of you know you can always have multiple mappers like that.', 'start': 7505.334, 'duration': 4.142}, {'end': 7516.54, 'text': "Now, this guy emits let's say customer id.", 'start': 7511.097, 'duration': 5.443}, {'end': 7524.777, 'text': 'comma some details and this guy also emits CID comma some details.', 'start': 7517.674, 'duration': 7.103}, {'end': 7541.105, 'text': 'What we get here is CID comma details one comma details two and here we can emit details one comma details two.', 'start': 7526.438, 'duration': 14.667}], 'summary': 'The mapper reads customer and delivery data, emitting customer ids and details for complex processing.', 'duration': 56.401, 'max_score': 7484.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx407484704.jpg'}, {'end': 9629.669, 'src': 'embed', 'start': 9582.749, 'weight': 6, 'content': [{'end': 9584.851, 'text': 'How will Hive do that.', 'start': 9582.749, 'duration': 2.102}, {'end': 9591.398, 'text': 'And how is internally this whole thing is working, how this whole setup is there, next week is all for that.', 'start': 9585.412, 'duration': 5.986}, {'end': 9593.4, 'text': 'We will go very deep into Hive.', 'start': 9591.718, 'duration': 1.682}, {'end': 9605.418, 'text': "So, this is like the setup part that you'll be doing on your VM.", 'start': 9602.136, 'duration': 3.282}, {'end': 9618.964, 'text': 'Now, all we need to do is, we need to go to Hive.', 'start': 9608.039, 'duration': 10.925}, {'end': 9622.465, 'text': "So, I'm on Hive prompt, and I say show tables.", 'start': 9619.184, 'duration': 3.281}, {'end': 9628.408, 'text': 'Oops, there is only one dummy table.', 'start': 9626.828, 'duration': 1.58}, {'end': 9629.669, 'text': "Let's drop that.", 'start': 9629.089, 'duration': 0.58}], 'summary': 'Next week will focus on deep dive into hive setup and usage.', 'duration': 46.92, 'max_score': 9582.749, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx409582749.jpg'}, {'end': 10751.176, 'src': 'embed', 'start': 10712.532, 'weight': 7, 'content': [{'end': 10715.475, 'text': 'I take this, say capture, and this is E.', 'start': 10712.532, 'duration': 2.943}, {'end': 10731.346, 'text': 'This is T and this is L.', 'start': 10723.661, 'duration': 7.685}, {'end': 10737.809, 'text': 'Now, having said that, does PEG allow me to connect to Hive tables? Yes, absolutely possible.', 'start': 10731.346, 'duration': 6.463}, {'end': 10741.571, 'text': 'But in real world you will not normally see people doing that.', 'start': 10738.469, 'duration': 3.102}, {'end': 10742.872, 'text': 'So, answer is yes.', 'start': 10741.931, 'duration': 0.941}, {'end': 10751.176, 'text': "But when somebody is doing or you want to do it, ask yourself couple of times in production Do I really need to do that? That's the point I'm making.", 'start': 10743.452, 'duration': 7.724}], 'summary': 'Peg allows connecting to hive tables, but not common in real world. consider necessity in production.', 'duration': 38.644, 'max_score': 10712.532, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4010712532.jpg'}, {'end': 10983.773, 'src': 'embed', 'start': 10940.182, 'weight': 1, 'content': [{'end': 10962.486, 'text': "what they're doing is using you're loading that HD effect in 5 and writing.", 'start': 10940.182, 'duration': 22.304}, {'end': 10965.927, 'text': 'is that clear??', 'start': 10962.486, 'duration': 3.441}, {'end': 10970.161, 'text': 'this straightforward exercise completely self, explicitly.', 'start': 10967.484, 'duration': 2.677}, {'end': 10978.229, 'text': 'So, if you look at exercise 5x, you have MySQL where there is some data.', 'start': 10972.226, 'duration': 6.003}, {'end': 10978.79, 'text': "It's fine.", 'start': 10978.409, 'duration': 0.381}, {'end': 10983.773, 'text': 'You are getting that data out using Scoop and dumping it to Hadoop file system.', 'start': 10979.49, 'duration': 4.283}], 'summary': 'Exercise 5x involves loading hd data into mysql and transferring it to hadoop using scoop.', 'duration': 43.591, 'max_score': 10940.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4010940182.jpg'}, {'end': 11410.627, 'src': 'embed', 'start': 11376.923, 'weight': 0, 'content': [{'end': 11387.132, 'text': 'you can do this, that instead of giving from the command line, you can basically what you call, give it from hyphen P, hyphen password,', 'start': 11376.923, 'duration': 10.209}, {'end': 11391.016, 'text': 'and it would ask you prompt you the password so that it does not appear in your command.', 'start': 11387.132, 'duration': 3.884}, {'end': 11392.777, 'text': "that's it, nothing else.", 'start': 11391.016, 'duration': 1.761}, {'end': 11395.8, 'text': 'everything else is good.', 'start': 11392.777, 'duration': 3.023}, {'end': 11397.121, 'text': "yeah, it's perfect.", 'start': 11395.8, 'duration': 1.321}, {'end': 11407.546, 'text': "Yeah, that's the same problem, Tang.", 'start': 11405.245, 'duration': 2.301}, {'end': 11410.627, 'text': "See, you've not started MySQL.", 'start': 11408.226, 'duration': 2.401}], 'summary': 'You can provide the password using hyphen p instead of the command line, making it more secure.', 'duration': 33.704, 'max_score': 11376.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4011376923.jpg'}, {'end': 12340.187, 'src': 'embed', 'start': 12303.555, 'weight': 2, 'content': [{'end': 12308.416, 'text': 'So we will copy these files to workspace and then open Eclipse.', 'start': 12303.555, 'duration': 4.861}, {'end': 12314.653, 'text': 'So now you start running the commands in the step 2.', 'start': 12308.616, 'duration': 6.037}, {'end': 12317.715, 'text': 'So you can just clear your screen and run command from step number two.', 'start': 12314.653, 'duration': 3.062}, {'end': 12320.416, 'text': 'Yeah We just say cp.', 'start': 12318.635, 'duration': 1.781}, {'end': 12323.298, 'text': 'Yeah, cp.', 'start': 12322.597, 'duration': 0.701}, {'end': 12331.562, 'text': "So now we are copying all the commands that I've had.", 'start': 12326.9, 'duration': 4.662}, {'end': 12340.187, 'text': 'So we are taking the code and dumping it into home training workspace word count.', 'start': 12333.503, 'duration': 6.684}], 'summary': 'Copying files to workspace, running commands, dumping code into workspace.', 'duration': 36.632, 'max_score': 12303.555, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012303555.jpg'}], 'start': 6825.133, 'title': 'Hadoop ecosystem and mapreduce operations', 'summary': 'Covers mapreduce code and project setup for word count, mapreduce join operations, introduction to the hadoop ecosystem, migrating data from mysql to hadoop using scoop, hive and pig functions, and hadoop and scoop integration.', 'chapters': [{'end': 7111.843, 'start': 6825.133, 'title': 'Mapreduce code and project setup', 'summary': 'Explains the process of coding a mapreduce job for word count, demonstrating the iterative process and the final output of big comma four, and then discusses the setup and coding of a new project named time diff in eclipse.', 'duration': 286.71, 'highlights': ["The process of iterating through the tokens in the MapReduce job for word count, emitting the word and its count, results in the final output of big comma four by summing the occurrences of the word 'big' in the input data.", 'The reducer takes one bucket at a time, summing the occurrences to produce the final output, demonstrating the coding process for word count in MapReduce.', 'The setup and coding of a new project named time diff in Eclipse, including the creation of a new Java project, adding external jars, and resolving potential errors.', 'The explanation of the iterative process in coding a MapReduce job for word count, showcasing the definition and usage of the iterator and the integer 1 in the emitted output.']}, {'end': 7900.68, 'start': 7111.843, 'title': 'Mapreduce join operations', 'summary': 'Explains the concept of mapreduce join operations and provides a detailed explanation of the process, including the use of mappers, reducers, and the join operation with multiple examples and insights into key considerations for handling separate data sets and achieving the desired outcome.', 'duration': 788.837, 'highlights': ['The process of MapReduce join operations is thoroughly explained, covering the use of mappers, reducers, and the join operation with multiple examples provided. The explanation of MapReduce join operations covers the use of mappers, reducers, and the join operation with multiple examples provided.', 'Insights are provided into key considerations for handling separate data sets and achieving the desired outcome, including the need to find a way to get a key for separate data sets and the comparison of use cases to ETL processes. The transcript provides insights into key considerations for handling separate data sets and achieving the desired outcome, including the need to find a way to get a key for separate data sets and the comparison of use cases to ETL processes.', 'The chapter emphasizes the importance of having the same key for performing join operations and discusses the potential need for developers to handle the process of making sure that data has the same key, similar to traditional BI processes. The chapter emphasizes the importance of having the same key for performing join operations and discusses the potential need for developers to handle the process of making sure that data has the same key, similar to traditional BI processes.']}, {'end': 9055.44, 'start': 7901.861, 'title': 'Introduction to hadoop ecosystem', 'summary': 'Introduces the concept of hadoop ecosystem, emphasizing the importance of various ecosystem projects and their specific use cases. it discusses the significance of core hadoop components like hdfs, yarn, and mapreduce, along with the installation and demonstration of ecosystem projects like scoop, flume, uzi, pig, and hive.', 'duration': 1153.579, 'highlights': ['Hadoop ecosystem projects have specific use cases just like mobile apps, with popular ecosystem projects like Scoop for RDBMS to Hadoop, Flume for real-time logs to Hadoop, Uzi for workflows, Pig for PL SQL, and Hive for SQL on Hadoop. The chapter emphasizes the specific use cases of Hadoop ecosystem projects, such as Scoop for RDBMS to Hadoop, Flume for real-time logs to Hadoop, Uzi for workflows, Pig for PL SQL, and Hive for SQL on Hadoop.', 'The installation and demonstration of ecosystem projects like Scoop, Flume, Uzi, Pig, and Hive are discussed, with a focus on getting data from RDBMS to Hadoop using Scoop and processing data with Hive and Pig. The chapter covers the installation and demonstration of ecosystem projects like Scoop, Flume, Uzi, Pig, and Hive, with a focus on getting data from RDBMS to Hadoop using Scoop and processing data with Hive and Pig.', 'The importance of core Hadoop components like HDFS, YARN, and MapReduce is highlighted, along with the concept of supporting ecosystem projects and the need for their separate installation. The chapter emphasizes the importance of core Hadoop components like HDFS, YARN, and MapReduce, while also highlighting the concept of supporting ecosystem projects and the need for their separate installation.']}, {'end': 10684.481, 'start': 9058.416, 'title': 'Migrate data from mysql to hadoop using scoop and perform analysis using hive and pig', 'summary': 'Demonstrates using scoop to migrate data from mysql to hadoop, importing 50,000 records, verifying data import, writing sql queries in hive, and using pig to filter data by careers and assigning an additional task of writing a pig program.', 'duration': 1626.065, 'highlights': ['Using Scoop to migrate data from MySQL to Hadoop and importing 50,000 records Scoop is used to migrate data from MySQL to Hadoop, importing 50,000 records.', 'Verifying data import and writing SQL queries in Hive Data import is verified, and SQL queries are written in Hive to analyze the imported data.', 'Using Pig to filter data by careers and assigning an additional task of writing a Pig program Pig is utilized to filter data by careers, and learners are assigned the task of writing an additional Pig program.']}, {'end': 11584.544, 'start': 10684.481, 'title': 'Hive and pig functions', 'summary': 'Discusses the functions and capabilities of hive and pig, including their use in data processing and manipulation, the availability of user-defined functions, and the process of importing data from mysql to hadoop using scoop and querying it with hive, emphasizing the importance of starting the mysql service and addressing potential errors.', 'duration': 900.063, 'highlights': ['Hive and Pig functions and user-defined functions Hive and Pig have their own libraries with a multitude of functions, and user-defined functions can be created in Java and Python, expanding the capabilities for data processing and manipulation.', 'Importing data from MySQL to Hadoop using Scoop The process involves using Scoop to extract data from MySQL and load it into the Hadoop file system, followed by querying the data using Hive SQL.', 'Starting MySQL service and addressing potential errors Emphasizing the importance of starting the MySQL service before importing data to Hadoop, and addressing potential errors such as communication link failure and syntax errors in commands.']}, {'end': 12729.62, 'start': 11584.544, 'title': 'Hadoop and spook integration', 'summary': 'Covers the integration of hadoop and spook, including the process of bundling java code as a jar file, importing data to hadoop, setting up hadoop libraries in eclipse, and running a mapreduce program, resulting in a word count output.', 'duration': 1145.076, 'highlights': ['The process of bundling Java code as a jar file and submitting it to the Hadoop cluster is explained, including the creation of an input folder in HDFS and dumping data into it. Explaining the process of bundling Java code as a jar file and submitting it to the Hadoop cluster', 'The detailed steps for setting up Hadoop libraries in Eclipse to resolve errors in the project are provided, including importing the necessary jar files. Providing detailed steps for setting up Hadoop libraries in Eclipse', 'The process of running a MapReduce program to obtain the word count output is described, including checking the output using Hadoop fs -ls and Hadoop fs -cat commands. Describing the process of running a MapReduce program to obtain the word count output']}], 'duration': 5904.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx406825133.jpg', 'highlights': ["The process of iterating through the tokens in the MapReduce job for word count, emitting the word and its count, results in the final output of big comma four by summing the occurrences of the word 'big' in the input data.", 'The reducer takes one bucket at a time, summing the occurrences to produce the final output, demonstrating the coding process for word count in MapReduce.', 'The process of MapReduce join operations is thoroughly explained, covering the use of mappers, reducers, and the join operation with multiple examples provided.', 'Insights are provided into key considerations for handling separate data sets and achieving the desired outcome, including the need to find a way to get a key for separate data sets and the comparison of use cases to ETL processes.', 'Hadoop ecosystem projects have specific use cases just like mobile apps, with popular ecosystem projects like Scoop for RDBMS to Hadoop, Flume for real-time logs to Hadoop, Uzi for workflows, Pig for PL SQL, and Hive for SQL on Hadoop.', 'Using Scoop to migrate data from MySQL to Hadoop and importing 50,000 records Scoop is used to migrate data from MySQL to Hadoop, importing 50,000 records.', 'Hive and Pig functions and user-defined functions Hive and Pig have their own libraries with a multitude of functions, and user-defined functions can be created in Java and Python, expanding the capabilities for data processing and manipulation.', 'The process of bundling Java code as a jar file and submitting it to the Hadoop cluster is explained, including the creation of an input folder in HDFS and dumping data into it.']}, {'end': 14032.548, 'segs': [{'end': 12759.028, 'src': 'embed', 'start': 12730.161, 'weight': 1, 'content': [{'end': 12740.286, 'text': 'So the question is like what is peg and is it something similar to hive? I have heard a lot about peg, so where it is used, where it is not used.', 'start': 12730.161, 'duration': 10.125}, {'end': 12746.369, 'text': "The first thing I'm gonna tell about peg is that so this might be a little bit confusing,", 'start': 12740.446, 'duration': 5.923}, {'end': 12755.223, 'text': 'but still peg and hive were invented almost at the same time to solve the same problem,', 'start': 12746.369, 'duration': 8.854}, {'end': 12759.028, 'text': 'meaning both these tools were invented almost at the same time.', 'start': 12755.223, 'duration': 3.805}], 'summary': 'Peg and hive were invented at the same time to solve the same problem.', 'duration': 28.867, 'max_score': 12730.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012730161.jpg'}, {'end': 12834.152, 'src': 'embed', 'start': 12806.287, 'weight': 5, 'content': [{'end': 12808.708, 'text': 'so they practically invented something called Hive.', 'start': 12806.287, 'duration': 2.421}, {'end': 12813.97, 'text': 'At the same time, Yahoo was facing the same problem, but in a different way.', 'start': 12808.888, 'duration': 5.082}, {'end': 12817.754, 'text': 'So you see, Yahoo is the company who practically kind of invented Hadoop.', 'start': 12814.07, 'duration': 3.684}, {'end': 12819.096, 'text': 'They did not invent Hadoop.', 'start': 12817.895, 'duration': 1.201}, {'end': 12826.285, 'text': 'They basically acquired the Nuts project actually, but the first stable release of Hadoop actually spun out of Yahoo, right?', 'start': 12819.296, 'duration': 6.989}, {'end': 12829.749, 'text': 'So when Yahoo had Hadoop, okay, the same problem was there.', 'start': 12826.305, 'duration': 3.444}, {'end': 12829.989, 'text': 'all right?', 'start': 12829.749, 'duration': 0.24}, {'end': 12834.152, 'text': 'I want to work with the data in Hadoop, but the only way is MapReduce.', 'start': 12830.289, 'duration': 3.863}], 'summary': 'Yahoo practically invented hadoop, leading to the creation of mapreduce for data processing.', 'duration': 27.865, 'max_score': 12806.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012806287.jpg'}, {'end': 12955.615, 'src': 'embed', 'start': 12910.921, 'weight': 0, 'content': [{'end': 12913.702, 'text': "So they didn't know that they was trying to solve the same problem.", 'start': 12910.921, 'duration': 2.781}, {'end': 12915.543, 'text': 'and ended up inventing two tools.', 'start': 12913.942, 'duration': 1.601}, {'end': 12922.686, 'text': "What is Pig? So it's basically a scripting language used for exploring large data sets.", 'start': 12915.843, 'duration': 6.843}, {'end': 12927.748, 'text': 'Now it is a new language which means you will get some time to use to it.', 'start': 12922.906, 'duration': 4.842}, {'end': 12931.87, 'text': 'So the learning curve of Pig is higher than that of Hive.', 'start': 12927.888, 'duration': 3.982}, {'end': 12940.133, 'text': 'To put it in another way, if somebody has to learn Hive, It will be much easier since Hive follows SQL and most of us know SQL.', 'start': 12932.07, 'duration': 8.063}, {'end': 12943.933, 'text': 'You can just get a Hive CLI and start exploring right away.', 'start': 12940.233, 'duration': 3.7}, {'end': 12948.734, 'text': 'But when you start with Pig, Pig has its own language called Pig Latin.', 'start': 12943.973, 'duration': 4.761}, {'end': 12950.935, 'text': "So Pig's language is called Pig Latin.", 'start': 12948.954, 'duration': 1.981}, {'end': 12952.775, 'text': 'And Pig Latin is a new language.', 'start': 12951.175, 'duration': 1.6}, {'end': 12955.615, 'text': 'So the syntax, how you use it, all are new.', 'start': 12952.815, 'duration': 2.8}], 'summary': 'Inventors created pig and hive for big data, with pig being a new and complex language.', 'duration': 44.694, 'max_score': 12910.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012910921.jpg'}, {'end': 13021.813, 'src': 'embed', 'start': 12996.501, 'weight': 2, 'content': [{'end': 13003.265, 'text': 'You can turn on and off the optimization but by default code optimization is turned on by PIG.', 'start': 12996.501, 'duration': 6.764}, {'end': 13010.529, 'text': 'So that means PIG will create the best optimized MapReduce code for you and it insulates the users from Hadoop interface,', 'start': 13003.365, 'duration': 7.164}, {'end': 13014.331, 'text': "so you don't have to practically learn MapReduce or Hadoop or anything.", 'start': 13010.529, 'duration': 3.802}, {'end': 13021.813, 'text': "So if you're supposed to write some 200 lines of Java code, you can write the same thing in 10 lines of Big Latin.", 'start': 13014.531, 'duration': 7.282}], 'summary': 'Pig optimizes mapreduce code, reducing java code from 200 to 10 lines.', 'duration': 25.312, 'max_score': 12996.501, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012996501.jpg'}], 'start': 12730.161, 'title': 'Pig and hive comparison', 'summary': "Discusses the similarities, differences, evolution, advantages, and key use cases of pig and hive, including pig's 70-80% ad hoc job usage in the world of hadoop around 2013-2014 and yahoo running 40% of its jobs in pig.", 'chapters': [{'end': 12782.312, 'start': 12730.161, 'title': 'Comparison: pig vs. hive', 'summary': 'Discusses the similarities and differences between pig and hive, both invented almost at the same time to solve similar problems, with most use cases overlapping, raising the question of why two tools were invented.', 'duration': 52.151, 'highlights': ['Pig and Hive were invented almost at the same time to solve the same problem, allowing you to solve almost the same problem, with most use cases overlapping.', 'Both tools were invented to address similar problems, leading to the question of why two tools were created if they are almost the same.', 'There are some situations where Pig and Hive cannot be used, but mostly their use cases overlap.']}, {'end': 13014.331, 'start': 12782.512, 'title': 'Evolution of hive and pig', 'summary': 'Discusses the evolution of hive and pig, highlighting the problem-solving and invention process by facebook and yahoo, the popularity of pig in the world of hadoop, and the advantages of pig in optimizing and simplifying mapreduce code, with pig accounting for around 70% to 80% of ad hoc jobs in the world of hadoop around 2013 and 2014.', 'duration': 231.819, 'highlights': ["Pig accounted for around 70% to 80% of ad hoc jobs in the world of Hadoop around 2013 and 2014. Pig's popularity in the world of Hadoop, with around 70% to 80% of ad hoc jobs being achieved using Pig during 2013 and 2014.", "Advantages of Pig in optimizing and simplifying MapReduce code, with Pig creating the best optimized MapReduce code by default and insulating users from needing to learn Hadoop or MapReduce. Pig's advantages in optimizing and simplifying MapReduce code, creating the best optimized MapReduce code by default, and insulating users from the need to learn Hadoop or MapReduce.", "Facebook's problem-solving process leading to the invention of Hive to work with massive structured data and the transition to Hadoop. The problem-solving process at Facebook leading to the invention of Hive to work with massive structured data and the transition to Hadoop.", "Yahoo's problem-solving process leading to the invention of Pig as a scripting tool to convert scripts into MapReduce programs, contributing it to Apache, and becoming one of the most popular tools in the world of Hadoop. Yahoo's problem-solving process leading to the invention of Pig as a scripting tool to convert scripts into MapReduce programs, contributing it to Apache, and becoming one of the most popular tools in the world of Hadoop."]}, {'end': 13335.918, 'start': 13014.531, 'title': 'Advantages of pig latin', 'summary': 'Discusses the advantages of using pig latin, including its ability to drastically reduce code length, support non-java programmers, and its widespread usage by companies like yahoo and twitter, with yahoo running 40% of its jobs in pig.', 'duration': 321.387, 'highlights': ['Pig Latin can reduce code length from 200 lines of Java code to 10 lines of Pig Latin, making it significantly more efficient for development. It can reduce code length from 200 lines of Java code to 10 lines of Pig Latin.', 'Pig Latin allows non-Java programmers to easily work with the system, enabling them to write MapReduce code in 15 minutes instead of 4 hours. Non-Java programmers can write MapReduce code in 15 minutes instead of 4 hours using Pig Latin.', 'Yahoo runs 40% of its jobs in Pig, showcasing its widespread usage and effectiveness in large-scale data processing. Yahoo runs 40% of its jobs in Pig, indicating its widespread usage and effectiveness in large-scale data processing.', 'Twitter is a well-known user of Pig, further emphasizing its popularity and relevance in the industry. Twitter is a well-known user of Pig, emphasizing its popularity and relevance in the industry.']}, {'end': 13668.476, 'start': 13336.71, 'title': 'Pig vs hive: key differences and use cases', 'summary': 'Explores the key differences between pig and hive, highlighting that pig is a client-side application without a server, lacks jdbc/odbc connectivity, does not require schema, and is mainly used by developers, while hive is a data warehouse with optional servers and web interface, supports connectivity, requires a schema, and is used by analysts. pig is used as a data factory operator for cleaning and transforming raw data before it is accessed by hive for reporting purposes.', 'duration': 331.766, 'highlights': ['PIG is a client-side application without a server, lacking JDBC/ODBC connectivity, not requiring a schema, and mainly used by developers.', 'Hive is a data warehouse with optional servers and web interface, supports connectivity, requires a schema, and is mainly used by analysts.', 'PIG is used as a data factory operator for cleaning and transforming raw data before it is accessed by Hive for reporting purposes.']}, {'end': 14032.548, 'start': 13668.476, 'title': 'Apache pig: data transformation & storage', 'summary': 'Discusses how apache pig is used for data transformation, cleaning, and storage in hive, its limitations with unstructured data, and the philosophy behind its name. it also explains complex data types supported by pig and the importance of understanding data before working with it.', 'duration': 364.072, 'highlights': ['Apache Pig is used for data transformation, cleaning, and storage in Hive. Pig is used to transform and clean data before storing it in a Hive table, serving as an ETL tool on top of Hadoop.', 'Limitations of Apache Pig with unstructured data Pig is not suitable for pure unstructured data such as audio, video, and images, making it best suited for structured and semi-structured data.', "Philosophy behind the name 'Pig' Apache Pig is named after the animal pig, symbolizing its ability to handle any type of data and work with different platforms, as well as its support for complex data types.", 'Explanation of complex data types supported by Pig Pig supports regular data types as well as complex data types including tuple (ordered set of fields), bag (collection of tuples or other bags), and map (key-value pair).', 'Importance of understanding data before working with it It is crucial to understand the data before working with it in any big data scenario, emphasizing the need for familiarity with the dataset.']}], 'duration': 1302.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4012730161.jpg', 'highlights': ['Pig and Hive were invented almost at the same time to solve the same problem, allowing you to solve almost the same problem, with most use cases overlapping.', 'Pig accounted for around 70% to 80% of ad hoc jobs in the world of Hadoop around 2013 and 2014.', 'Pig Latin can reduce code length from 200 lines of Java code to 10 lines of Pig Latin, making it significantly more efficient for development.', 'Yahoo runs 40% of its jobs in Pig, showcasing its widespread usage and effectiveness in large-scale data processing.', 'PIG is a client-side application without a server, lacking JDBC/ODBC connectivity, not requiring a schema, and mainly used by developers.', 'Apache Pig is used for data transformation, cleaning, and storage in Hive.', 'Limitations of Apache Pig with unstructured data Pig is not suitable for pure unstructured data such as audio, video, and images, making it best suited for structured and semi-structured data.']}, {'end': 15152.681, 'segs': [{'end': 14116.08, 'src': 'embed', 'start': 14033.268, 'weight': 0, 'content': [{'end': 14036.249, 'text': 'It is a similar data set but with only four columns.', 'start': 14033.268, 'duration': 2.981}, {'end': 14046.512, 'text': 'So this has first the exchange name NYSE, second the symbol or ticker, third the date, and the last column is dividend.', 'start': 14036.809, 'duration': 9.703}, {'end': 14048.733, 'text': 'So there are four columns.', 'start': 14047.733, 'duration': 1}, {'end': 14054.255, 'text': 'So these are the two data sets we will be using initially to understand PIG.', 'start': 14049.133, 'duration': 5.122}, {'end': 14060.716, 'text': 'Now if I go to my desktop, Both these data sets are available in my desktop.', 'start': 14054.695, 'duration': 6.021}, {'end': 14061.456, 'text': 'You can see here.', 'start': 14060.756, 'duration': 0.7}, {'end': 14066.079, 'text': 'NYSE daily and NYSE dividends.', 'start': 14062.217, 'duration': 3.862}, {'end': 14069.922, 'text': 'So both the data sets are available on my desktop.', 'start': 14066.82, 'duration': 3.102}, {'end': 14080.69, 'text': 'The first thing you need to understand about PIG is that PIG can run in two modes.', 'start': 14070.262, 'duration': 10.428}, {'end': 14085.173, 'text': 'One, local mode.', 'start': 14082.571, 'duration': 2.602}, {'end': 14091.071, 'text': 'to MapReduceMod.', 'start': 14086.909, 'duration': 4.162}, {'end': 14116.08, 'text': 'In this mod, PIC will read the data from local file system, transform it, and store it back to local file system.', 'start': 14092.411, 'duration': 23.669}], 'summary': 'Data sets include nyse daily and nyse dividends with four columns each. pig can run in local or mapreduce mode for data transformation.', 'duration': 82.812, 'max_score': 14033.268, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4014033268.jpg'}, {'end': 14513.861, 'src': 'embed', 'start': 14485.45, 'weight': 1, 'content': [{'end': 14493.751, 'text': 'Now your wife comes to you and say that you know, today evening you have to go out and get something.', 'start': 14485.45, 'duration': 8.301}, {'end': 14498.674, 'text': 'And she creates a list of things that you have to do.', 'start': 14495.672, 'duration': 3.002}, {'end': 14502.956, 'text': 'For example, you have to go to the supermarket and buy something.', 'start': 14499.274, 'duration': 3.682}, {'end': 14508.098, 'text': 'And you have to go to the telephone exchange and pay the bill, maybe.', 'start': 14503.976, 'duration': 4.122}, {'end': 14510.94, 'text': 'You have to go to the milkman and get the milk.', 'start': 14508.659, 'duration': 2.281}, {'end': 14513.861, 'text': 'But will you be doing that now? No.', 'start': 14511.56, 'duration': 2.301}], 'summary': 'Wife assigns errands: supermarket, telephone bill, milkman. unlikely to do now.', 'duration': 28.411, 'max_score': 14485.45, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4014485450.jpg'}], 'start': 14033.268, 'title': 'Pig scripting basics', 'summary': 'Covers the basics of pig scripting, including creating relations, performing grouping, and calculating averages using pig commands. it also demonstrates running pig in local mode and script file execution for efficiency.', 'chapters': [{'end': 14244.686, 'start': 14033.268, 'title': 'Understanding pig modes and data sets', 'summary': 'Introduces pig, explaining its two modes - local and mapreduce - and the use of two specific data sets: nyse daily and nyse dividends.', 'duration': 211.418, 'highlights': ['PIG can run in two modes: local mode and MapReduce mode. PIG can run in two modes: local mode and MapReduce mode, with local mode used for testing and MapReduce mode for production.', "Explanation of PIG's local mode. Local mode is used for testing PIG scripts by reading data from the local file system, transforming it, and storing it back onto the local file system, without interacting with Hadoop.", 'Overview of data sets NYSE daily and NYSE dividends. Introduction of the two data sets, NYSE daily and NYSE dividends, with the former containing exchange, symbol, date, and dividend columns.']}, {'end': 14484.709, 'start': 14245.306, 'title': 'Introduction to pig scripting', 'summary': 'Introduces pig scripting by explaining the pig prompt and the process of declaring a relation, using the example of loading data and creating a relation in pig.', 'duration': 239.403, 'highlights': ['The chapter introduces the concept of the PIG prompt and demonstrates how to write a PIG command.', 'It explains the process of declaring a relation in PIG, emphasizing the use of the equal to symbol to signify a relation.', "The chapter details the process of loading data in PIG, highlighting the creation of a pointer to the file using the 'load' operator.", 'It emphasizes the significance of starting with a relation in PIG scripting and the necessity of loading data before working with PIG.']}, {'end': 14906.119, 'start': 14485.45, 'title': 'Pig scripting basics', 'summary': 'Explains the basics of pig scripting, where it illustrates how to create relations, perform grouping and calculate averages using pig commands. it also demonstrates the option of running pig in local mode or running a script file for more efficient execution.', 'duration': 420.669, 'highlights': ["Pig scripting involves creating relations, performing grouping, and calculating averages using Pig commands. The chapter explains how to create a relation called 'dividends', perform grouping on it, calculate the average, and then display the result using the 'dump' command.", 'The option of running Pig in local mode or running a script file for efficient execution is demonstrated. The chapter demonstrates the option of working interactively with Pig by launching the shell and typing commands line by line, or creating a Pig script file to run multiple commands at once for more efficient execution.', "Pig can work in the local mode, allowing users to run scripts from their local machines. The chapter explains that Pig can work in the local mode, meaning the file will be read from the user's desktop, and demonstrates running Pig in the local mode using the command 'pig -x local'."]}, {'end': 15152.681, 'start': 14908.04, 'title': 'Storing data with pig in local and mapreduce mode', 'summary': 'Explains how to store results in a local folder or in hadoop using pig in both local and mapreduce modes, providing examples of commands and file creation.', 'duration': 244.641, 'highlights': ["PIG allows storing results in a folder on the local machine by using 'store' command followed by the folder name, enabling easy access and retrieval of results. By executing the 'store' command with a specified folder name, such as 'April 8', users can store the result in a local directory, allowing easy access and retrieval of the data.", 'Demonstrates how to run Pig in MapReduce mode, showcasing the process of reading data from HDFS, performing transformations, and storing data back in Hadoop, with the use of appropriate commands and Hadoop paths. The transcript demonstrates the process of running Pig in MapReduce mode, including steps such as reading data from HDFS, performing transformations, and storing data back in Hadoop by specifying Hadoop paths and using appropriate commands.', 'Illustrates the creation of a MapReduce job for storing files in Hadoop, providing insights into the process by displaying the creation of jar files, initiation of MapReduce job, and its completion status. The transcript provides insights into the creation of a MapReduce job for storing files in Hadoop, displaying the process through the creation of jar files, initiation of MapReduce jobs, and the display of job completion status.']}], 'duration': 1119.413, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4014033268.jpg', 'highlights': ['PIG can run in two modes: local mode and MapReduce mode, with local mode used for testing and MapReduce mode for production.', 'Pig scripting involves creating relations, performing grouping, and calculating averages using Pig commands.', 'The option of running Pig in local mode or running a script file for efficient execution is demonstrated.', "PIG allows storing results in a folder on the local machine by using 'store' command followed by the folder name, enabling easy access and retrieval of results.", 'Demonstrates how to run Pig in MapReduce mode, showcasing the process of reading data from HDFS, performing transformations, and storing data back in Hadoop, with the use of appropriate commands and Hadoop paths.']}, {'end': 18488.914, 'segs': [{'end': 15553.088, 'src': 'embed', 'start': 15524.155, 'weight': 6, 'content': [{'end': 15532.631, 'text': "but basically what I'm asking, Pig, is that hey Pig, look at this thing called daily, and it has eight columns.", 'start': 15524.155, 'duration': 8.476}, {'end': 15535.934, 'text': "You have not given any column name, that's fine, but it has eight column.", 'start': 15532.671, 'duration': 3.263}, {'end': 15544.381, 'text': 'So obviously when you load the data, if the data is tab separated, PIG can read.', 'start': 15536.774, 'duration': 7.607}, {'end': 15549.986, 'text': 'So PIG expects the data which is tab separated.', 'start': 15545.462, 'duration': 4.524}, {'end': 15553.088, 'text': 'So every column should be separated by tab.', 'start': 15550.486, 'duration': 2.602}], 'summary': 'Pig can read data with eight tab-separated columns.', 'duration': 28.933, 'max_score': 15524.155, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4015524155.jpg'}, {'end': 15995.521, 'src': 'embed', 'start': 15961.328, 'weight': 3, 'content': [{'end': 15965.55, 'text': 'So either you can use the store operator or the dump operator.', 'start': 15961.328, 'duration': 4.222}, {'end': 15968.451, 'text': 'Dump operator will show on the screen which is not so good.', 'start': 15965.59, 'duration': 2.861}, {'end': 15974.053, 'text': "If you're having like one million lines, it's gonna throw all the one million line on your screen.", 'start': 15968.891, 'duration': 5.162}, {'end': 15977.215, 'text': "Rather you say store, it's gonna store someplace.", 'start': 15974.674, 'duration': 2.541}, {'end': 15980.796, 'text': 'In Hadoop you can give any folder name which you have access to.', 'start': 15977.955, 'duration': 2.841}, {'end': 15995.521, 'text': 'relational operators, okay? We have something called relational operators, right? So now let me show you a couple of examples for this.', 'start': 15983.132, 'duration': 12.389}], 'summary': 'Choose store operator over dump for large data; hadoop allows folder naming; use relational operators.', 'duration': 34.193, 'max_score': 15961.328, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4015961328.jpg'}, {'end': 17277.335, 'src': 'embed', 'start': 17247.27, 'weight': 0, 'content': [{'end': 17260.946, 'text': 'This is the script which has this word count dot peg, and all I need to give is the file name, and the file which will be analyzed is this file.', 'start': 17247.27, 'duration': 13.676}, {'end': 17263.287, 'text': 'This is the file which will be analyzed.', 'start': 17261.486, 'duration': 1.801}, {'end': 17270.351, 'text': 'Mary had a little lamb, its fleece was white as the snow, and everywhere that Mary went, lamb was sure to go.', 'start': 17263.747, 'duration': 6.604}, {'end': 17277.335, 'text': 'This is a nursery rhyme, right? So we will do the word count on Mary had a little lamb.', 'start': 17271.011, 'duration': 6.324}], 'summary': "Analyzing a file with word count tool, focusing on 'mary had a little lamb.'", 'duration': 30.065, 'max_score': 17247.27, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4017247270.jpg'}, {'end': 17963.863, 'src': 'embed', 'start': 17930.547, 'weight': 2, 'content': [{'end': 17946.131, 'text': 'So trim is an operator which you can use in PIG basically to trim and then you can say substring from this column called data six to 14, 38 to 45,', 'start': 17930.547, 'duration': 15.584}, {'end': 17948.013, 'text': '46 to 53, so what does this mean?', 'start': 17946.131, 'duration': 1.882}, {'end': 17953.516, 'text': 'So, basically, I want to extract character position six to 14..', 'start': 17948.453, 'duration': 5.063}, {'end': 17957.579, 'text': 'So where is that? Six to 14.', 'start': 17953.516, 'duration': 4.063}, {'end': 17961.101, 'text': 'Zero, one, two, three, four, five, six.', 'start': 17957.579, 'duration': 3.522}, {'end': 17963.863, 'text': 'Six to 14 will be this data.', 'start': 17961.722, 'duration': 2.141}], 'summary': "Using the pig operator 'trim' to extract character positions 6 to 14 from a column.", 'duration': 33.316, 'max_score': 17930.547, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4017930547.jpg'}], 'start': 15155.982, 'title': 'Pig data processing', 'summary': 'Explains running pig in local and mapreduce mode, data processing basics, operators and data transformation, word count analysis, and weather data analysis using pig, covering key concepts and examples.', 'chapters': [{'end': 15222.91, 'start': 15155.982, 'title': 'Running pig in local and mapreduce mode', 'summary': 'Explains the process of running pig in both local and mapreduce mode, highlighting the key difference in input and output path specifications for each mode.', 'duration': 66.928, 'highlights': ['Running Pig commands in local mode and MapReduce mode is the same, with the only difference being the specification of input and output paths, which are on Linux in local mode and on Hadoop in MapReduce mode.', 'In local mode, the input and output paths are specified from Linux, while in MapReduce mode, the input and output paths are specified from Hadoop.']}, {'end': 15980.796, 'start': 15223.33, 'title': 'Pig data processing basics', 'summary': 'Covers the basic operations of loading data into pig, including explicit data type declaration and the load operator, and how pig can transform data even without a declared schema, with examples demonstrating loading tab-separated and comma-separated files.', 'duration': 757.466, 'highlights': ["Pig can transform data without a declared schema The speaker demonstrates loading data without mentioning any schema and then transforming it using Pig, showcasing the tool's capability to work with data even without schema declaration.", 'Explanation of load operator and schema declaration in Pig The explanation of the load operator in Pig is provided, including loading tab-separated and comma-separated data, and the importance of schema declaration for Pig to work efficiently.', "Demonstration of explicit data type declaration in Pig The speaker showcases explicit data type declaration, demonstrating how to inform Pig about the data type of each column, and the use of the 'describe' command to view the data type of a relation."]}, {'end': 17189.743, 'start': 15983.132, 'title': 'Pig operators and data transformation', 'summary': 'Introduces pig operators like foreach, filter, grouping, order by, join, and limit, showcasing their functionalities in data transformation and analysis, including examples of applying operations and explaining the results.', 'duration': 1206.611, 'highlights': ['The chapter introduces PIG operators like foreach, filter, grouping, order by, join, and limit, showcasing their functionalities in data transformation and analysis. The transcript covers the introduction of important PIG operators for data transformation and analysis, including examples of applying operations and explaining the results.', 'The foreach operator is demonstrated for iterating through records and applying operations, such as subtracting columns and extracting specific columns. The foreach operator is explained for iterating through records, applying operations like subtracting columns and extracting specific columns, simplifying data transformation.', 'The filter operator is explained for filtering records based on conditions like greater than, less than, or matching with regular expressions. The filter operator is elaborated for filtering records based on conditions like greater than, less than, or matching with regular expressions, facilitating data analysis.', 'The grouping operator is demonstrated for grouping records by specific columns and generating key-value pairs, enabling aggregation and analysis. The grouping operator is showcased for grouping records by specific columns, generating key-value pairs, and enabling aggregation and analysis, providing insights into data sets.', 'The order by operator is explained for sorting data based on specific columns in ascending or descending order, including multi-column ordering. The order by operator is detailed for sorting data based on specific columns in ascending or descending order, with the option for multi-column ordering, aiding in data arrangement.', 'The join operator is demonstrated for joining data sets based on single or multi-column joins, showcasing different types of joins like left outer, right outer, and full outer. The join operator is demonstrated for joining data sets based on single or multi-column joins, and showcasing different types of joins like left outer, right outer, and full outer, demonstrating versatile data merging capabilities.', 'The limit operator is explained for limiting the number of records to be displayed, providing a concise view of the data set. The limit operator is explained for limiting the number of records to be displayed, providing a concise view of the data set, simplifying data analysis.']}, {'end': 17780.099, 'start': 17189.743, 'title': 'Pig script for word count analysis', 'summary': 'Introduces the concept of using pig to perform word count analysis, demonstrating how to load and transform data, tokenize and flatten text, group words, and count word occurrences, resulting in a simplified and efficient approach compared to traditional methods like java mapreduce.', 'duration': 590.356, 'highlights': ['The chapter introduces the concept of using Pig to perform word count analysis Demonstrating the process of using Pig to perform word count analysis, showcasing the advantages of utilizing Pig for data analysis tasks.', 'demonstrating how to load and transform data, tokenize and flatten text Explaining the process of loading and transforming data, tokenizing and flattening text to extract individual words for analysis, showcasing the initial steps for word count analysis.', 'group words, and count word occurrences Highlighting the process of grouping words and counting word occurrences, demonstrating the essential steps for word count analysis and data aggregation.', 'resulting in a simplified and efficient approach compared to traditional methods like Java MapReduce Emphasizing the efficiency and simplicity of using Pig for word count analysis compared to traditional complex methods such as Java MapReduce, showcasing the benefits of using Pig for data analysis tasks.']}, {'end': 18488.914, 'start': 17780.239, 'title': 'Weather data analysis using pig', 'summary': 'Discusses using pig to analyze unstructured weather data, including extracting temperature readings, filtering for hot and cold days, and finding the hottest and coldest day based on specific temperature conditions.', 'duration': 708.675, 'highlights': ['Using PIG to extract temperature readings and structure unstructured weather data.', 'Filtering for hot and cold days based on specific temperature conditions.', 'Finding the hottest and coldest day by grouping and filtering temperature data.']}], 'duration': 3332.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4015155982.jpg', 'highlights': ['Running Pig commands in local mode and MapReduce mode is the same, with the only difference being the specification of input and output paths, which are on Linux in local mode and on Hadoop in MapReduce mode.', 'In local mode, the input and output paths are specified from Linux, while in MapReduce mode, the input and output paths are specified from Hadoop.', "Pig can transform data without a declared schema The speaker demonstrates loading data without mentioning any schema and then transforming it using Pig, showcasing the tool's capability to work with data even without schema declaration.", 'The chapter introduces PIG operators like foreach, filter, grouping, order by, join, and limit, showcasing their functionalities in data transformation and analysis.', 'The chapter introduces the concept of using Pig to perform word count analysis Demonstrating the process of using Pig to perform word count analysis, showcasing the advantages of utilizing Pig for data analysis tasks.', 'Using PIG to extract temperature readings and structure unstructured weather data.', 'Filtering for hot and cold days based on specific temperature conditions.', 'Finding the hottest and coldest day by grouping and filtering temperature data.']}, {'end': 19375.98, 'segs': [{'end': 19026.509, 'src': 'embed', 'start': 18976.164, 'weight': 0, 'content': [{'end': 18984.826, 'text': 'There is something new in the market called Hadoop, and if you implement the Hadoop cluster, you can store unlimited data practically,', 'start': 18976.164, 'duration': 8.662}, {'end': 18987.267, 'text': 'because Hadoop is naturally the solution for big data right?', 'start': 18984.826, 'duration': 2.441}, {'end': 18992.929, 'text': 'So Facebook got interested in this idea and they immediately implemented a Hadoop cluster.', 'start': 18987.647, 'duration': 5.282}, {'end': 18999.01, 'text': 'So practically speaking, the entire data that Facebook was storing and analyzing, they moved into Hadoop.', 'start': 18993.169, 'duration': 5.841}, {'end': 19007.675, 'text': 'But what was the problem? Back in 2006, we were at Hadoop version one, or the old Hadoop, or the original Hadoop.', 'start': 18999.791, 'duration': 7.884}, {'end': 19013.239, 'text': 'And you know what? The original Hadoop had only MapReduce.', 'start': 19008.155, 'duration': 5.084}, {'end': 19022.606, 'text': 'That means if you want to interact with the data in a Hadoop cluster, the only way was that you must write a MapReduce program.', 'start': 19013.739, 'duration': 8.867}, {'end': 19026.509, 'text': 'Right? You have to write a MapReduce program.', 'start': 19024.227, 'duration': 2.282}], 'summary': 'Hadoop allowed facebook to store and analyze unlimited data, but faced limitations with only mapreduce in version one.', 'duration': 50.345, 'max_score': 18976.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4018976164.jpg'}, {'end': 19112.007, 'src': 'embed', 'start': 19086.458, 'weight': 3, 'content': [{'end': 19091.643, 'text': 'you have to write your program, you have to compile it, create a jar file, deploy it.', 'start': 19086.458, 'duration': 5.185}, {'end': 19093.193, 'text': "It's not easy.", 'start': 19092.552, 'duration': 0.641}, {'end': 19099.237, 'text': "If you're a SQL developer, trust me, learning Java is not going to be so fun for you right?", 'start': 19093.713, 'duration': 5.524}, {'end': 19104.961, 'text': 'With SQL developers, we tend to be more towards the SQL side of the spectrum.', 'start': 19099.758, 'duration': 5.203}, {'end': 19112.007, 'text': "We don't really are not programmers, right? So all of a sudden, all the employees in Facebook started complaining.", 'start': 19105.062, 'duration': 6.945}], 'summary': 'Learning java can be challenging for sql developers, as seen at facebook.', 'duration': 25.549, 'max_score': 19086.458, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4019086458.jpg'}, {'end': 19157.168, 'src': 'embed', 'start': 19136.141, 'weight': 1, 'content': [{'end': 19146.647, 'text': "The alternative was to write a MapReduce program for a SQL query, and that's practically madness, because for a simple select, count star query,", 'start': 19136.141, 'duration': 10.506}, {'end': 19149.889, 'text': 'you have to write 100 lines of code in Java and compile it.', 'start': 19146.647, 'duration': 3.242}, {'end': 19150.609, 'text': 'and imagine 70,000 queries.', 'start': 19149.889, 'duration': 0.72}, {'end': 19157.168, 'text': 'So what Facebook did is that they thought, all right, we need a solution for this.', 'start': 19152.125, 'duration': 5.043}], 'summary': 'Writing a mapreduce program for a sql query required 100 lines of code in java, and facebook needed a solution for 70,000 queries.', 'duration': 21.027, 'max_score': 19136.141, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4019136141.jpg'}], 'start': 18489.354, 'title': 'Oltp, rdbms, data warehousing, and hadoop integration', 'summary': 'Covers the usage of multiple databases in organizations, highlighting mysql, mssql, and oracle, and discusses integrating data warehousing with hadoop, the challenges faced by facebook in handling big data, and the emergence of hive as a cost-effective alternative to traditional data warehousing solutions.', 'chapters': [{'end': 18563.027, 'start': 18489.354, 'title': 'Understanding oltp systems and rdbms', 'summary': 'Explains the concept of oltp systems and rdbms, highlighting the usage of multiple databases in organizations and the purpose of each, with a focus on mysql, mssql, and oracle.', 'duration': 73.673, 'highlights': ['OLTP systems, also known as RDBMS, are used by organizations for online transaction processing, with multiple databases like MySQL, MSSQL, and Oracle serving specific purposes such as transactional data, product catalogs, and other functionalities.', 'Organizations may utilize multiple OLTP systems deployed across the world, each serving specific purposes, leading to the usage of multiple databases like Oracle, MySQL, and Microsoft SQL.']}, {'end': 19375.98, 'start': 18563.348, 'title': 'Data warehousing and hadoop integration', 'summary': 'Discusses integrating data warehousing with hadoop, the challenges faced by facebook in handling big data, and the emergence of hive as a sql interface on hadoop, providing a cost-effective alternative to traditional data warehousing solutions.', 'duration': 812.632, 'highlights': ['Hadoop cluster as a cost-effective solution for big data storage and analysis Facebook implemented a Hadoop cluster to store and analyze unlimited data, providing a cost-effective solution for big data storage and analysis compared to traditional data warehousing solutions.', 'Challenges faced by Facebook in accessing and querying data in Hadoop Facebook faced challenges in accessing and querying data in Hadoop, as it required writing MapReduce programs in Java, leading to the development of Hive as a SQL interface to access structured data in Hadoop.', 'Introduction of Hive as a SQL interface on Hadoop, providing a cost-effective alternative to traditional data warehousing solutions Hive was created to provide a SQL interface on Hadoop, offering a cost-effective alternative to traditional data warehousing solutions by allowing users to write SQL queries and access structured data without the need for Java or MapReduce programming.']}], 'duration': 886.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4018489354.jpg', 'highlights': ['Organizations use multiple OLTP systems like MySQL, MSSQL, and Oracle for specific purposes.', 'Hadoop cluster is a cost-effective solution for big data storage and analysis.', 'Facebook faced challenges in accessing and querying data in Hadoop, leading to the development of Hive as a SQL interface.', 'Hive provides a cost-effective alternative to traditional data warehousing solutions by allowing users to write SQL queries and access structured data.']}, {'end': 22014.534, 'segs': [{'end': 19636.833, 'src': 'embed', 'start': 19609.964, 'weight': 7, 'content': [{'end': 19617.487, 'text': "and you're a SQL developer, and you know that there is structured data in Hadoop, you want to create a table out of the data and do the queries.", 'start': 19609.964, 'duration': 7.523}, {'end': 19623.569, 'text': 'So you install Hive on your laptop, say create a table, load the data, write your query, hit enter.', 'start': 19617.547, 'duration': 6.022}, {'end': 19628.61, 'text': 'Hive will automatically write the equivalent MapReduce program.', 'start': 19623.569, 'duration': 5.041}, {'end': 19630.851, 'text': 'create a jar file, send it to the cluster.', 'start': 19628.61, 'duration': 2.241}, {'end': 19636.833, 'text': "So from the point of view of the Hadoop cluster, it is a regular MapReduce program, it's not Hive.", 'start': 19631.171, 'duration': 5.662}], 'summary': 'Sql developer uses hive to process structured data in hadoop, automatically generating mapreduce program.', 'duration': 26.869, 'max_score': 19609.964, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4019609964.jpg'}, {'end': 20118.505, 'src': 'embed', 'start': 20087.964, 'weight': 0, 'content': [{'end': 20088.864, 'text': 'That is how it works.', 'start': 20087.964, 'duration': 0.9}, {'end': 20093.125, 'text': 'So the next point is that, now talking about the real world.', 'start': 20090.304, 'duration': 2.821}, {'end': 20096.766, 'text': "Okay, so let's suppose you learn Hive.", 'start': 20094.125, 'duration': 2.641}, {'end': 20101.218, 'text': 'in our IntelliPath course, and you master Hive.', 'start': 20098.057, 'duration': 3.161}, {'end': 20105.64, 'text': 'And what did I teach you? I taught you that Hive is a SQL interface to Hadoop.', 'start': 20102.019, 'duration': 3.621}, {'end': 20108.161, 'text': 'Fine, you understand that right?', 'start': 20106.22, 'duration': 1.941}, {'end': 20111.122, 'text': 'And then you go to a real world project, okay?', 'start': 20108.521, 'duration': 2.601}, {'end': 20114.563, 'text': 'And then you go to the project and say that you know what.', 'start': 20111.602, 'duration': 2.961}, {'end': 20115.624, 'text': "I'm an expert in Hive.", 'start': 20114.563, 'duration': 1.061}, {'end': 20118.505, 'text': 'An expert in Hive.', 'start': 20117.544, 'duration': 0.961}], 'summary': 'Learning hive in intellipath course prepares you for real-world projects as an expert in hive, a sql interface to hadoop.', 'duration': 30.541, 'max_score': 20087.964, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4020087964.jpg'}, {'end': 20327.931, 'src': 'embed', 'start': 20300.248, 'weight': 2, 'content': [{'end': 20304.251, 'text': 'Well, Taze is created to make MapReduce faster.', 'start': 20300.248, 'duration': 4.003}, {'end': 20307.313, 'text': "So I don't want to get into details of Taze.", 'start': 20304.691, 'duration': 2.622}, {'end': 20309.615, 'text': "I'm just giving you some extra information.", 'start': 20307.653, 'duration': 1.962}, {'end': 20317.721, 'text': 'For the time being, understand that Taze is a framework which is built to overcome the problems in MapReduce.', 'start': 20309.995, 'duration': 7.726}, {'end': 20319.382, 'text': 'MapReduce is ideally slow.', 'start': 20317.761, 'duration': 1.621}, {'end': 20321.604, 'text': 'MapReduce is normally slow, right?', 'start': 20319.862, 'duration': 1.742}, {'end': 20327.931, 'text': 'So some guys build something called Tez and Tez also uses mappers and reducers and all.', 'start': 20321.884, 'duration': 6.047}], 'summary': 'Taze speeds up mapreduce, addressing its slowness with tez framework.', 'duration': 27.683, 'max_score': 20300.248, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4020300248.jpg'}, {'end': 20390.301, 'src': 'embed', 'start': 20365.516, 'weight': 6, 'content': [{'end': 20373.058, 'text': 'Even though Tez is Apache open source, Hortonworks says they promote Tez and they say that their queries are faster.', 'start': 20365.516, 'duration': 7.542}, {'end': 20378.679, 'text': 'So your first Avenger original Hive is damn slow.', 'start': 20373.598, 'duration': 5.081}, {'end': 20385.7, 'text': 'Now Hive plus Tez, this guy is interactive.', 'start': 20381.138, 'duration': 4.562}, {'end': 20390.301, 'text': 'Interactive query means it is faster but not real time.', 'start': 20386.6, 'duration': 3.701}], 'summary': 'Hortonworks promotes tez for faster queries, making hive plus tez interactive and faster but not real time.', 'duration': 24.785, 'max_score': 20365.516, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4020365516.jpg'}, {'end': 20885.956, 'src': 'embed', 'start': 20852.974, 'weight': 5, 'content': [{'end': 20857.796, 'text': 'The basic difference is that it also allows JDBC ODBC but it provides concurrency.', 'start': 20852.974, 'duration': 4.822}, {'end': 20860.618, 'text': 'That means multiple JDBC ODBC connections are possible.', 'start': 20857.816, 'duration': 2.802}, {'end': 20868.421, 'text': 'You can work with Hive but they have redesigned the CLI of Hive and the new CLI is called Beeline.', 'start': 20860.638, 'duration': 7.783}, {'end': 20873.163, 'text': 'Still it supports the old CLI and the new CLI is called Beeline.', 'start': 20868.901, 'duration': 4.262}, {'end': 20875.944, 'text': 'Beeline is actually a client which you can install.', 'start': 20873.523, 'duration': 2.421}, {'end': 20877.668, 'text': 'like SQL client.', 'start': 20876.747, 'duration': 0.921}, {'end': 20885.956, 'text': 'So I can go to a computer, install Beeline client, from there I can make a connection request to the Hive server too and start working in the Hive.', 'start': 20878.108, 'duration': 7.848}], 'summary': 'Beeline allows multiple jdbc odbc connections and provides hive cli redesign with a new cli called beeline.', 'duration': 32.982, 'max_score': 20852.974, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4020852974.jpg'}, {'end': 21487.194, 'src': 'embed', 'start': 21457.019, 'weight': 1, 'content': [{'end': 21459.841, 'text': 'Entire Hive package always on your machine.', 'start': 21457.019, 'duration': 2.822}, {'end': 21464.745, 'text': 'Hive server, Hive CLI, whatever you call it, anything starting with Hive is on your computer.', 'start': 21460.342, 'duration': 4.403}, {'end': 21467.106, 'text': 'Nothing is there in the cluster.', 'start': 21465.585, 'duration': 1.521}, {'end': 21474.685, 'text': 'Are you getting my point? Why, why? Because either you or only your friend will be accessing this.', 'start': 21468.307, 'duration': 6.378}, {'end': 21479.929, 'text': "And you don't need a separate client server and all, right? Because it's a very small cluster.", 'start': 21475.306, 'duration': 4.623}, {'end': 21487.194, 'text': 'So you install all the package on your computer and then you can also install SQL client on your computer and create a JDBC connection.', 'start': 21480.189, 'duration': 7.005}], 'summary': 'Entire hive package is on your machine, no need for separate client server in a small cluster.', 'duration': 30.175, 'max_score': 21457.019, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4021457019.jpg'}], 'start': 19376.943, 'title': 'Hive in hadoop ecosystem', 'summary': 'Covers the role of hive in hadoop ecosystem, its limitations, use cases, evolution, impact on real world projects, functionalities of hive server 1 and hive server 2, and its usage in small and large hadoop clusters, including practical scenarios.', 'chapters': [{'end': 19755.321, 'start': 19376.943, 'title': 'Understanding hive in hadoop ecosystem', 'summary': 'Explains the role of hive as a tool in the hadoop ecosystem, highlighting its installation on top of hadoop, its role as a client-side application, and its function in providing a projection of structured data using sql-like language called hiveql, which can be used for querying and analysis.', 'duration': 378.378, 'highlights': ['Hive Installation and Function Hive is installed on top of Hadoop, does not have its own storage, and provides a projection of structured data using SQL-like language called HiveQL, which can be used for querying and analysis.', 'Hive as a Client-side Application Hive is a client-side application, installed on the laptop, and it automatically writes the equivalent MapReduce program, creating a jar file and sending it to the Hadoop cluster.', 'Advantages of Using Hive Hive can be used as an ETL tool, provides capability of querying and analysis, and is used by the analyst community to handle large datasets and perform SQL queries on top of map and reduce.']}, {'end': 20087.704, 'start': 19755.841, 'title': 'Hive: pros, cons, and use cases', 'summary': "Discusses the limitations and use cases of hive, highlighting its slow performance due to its mapreduce nature, unsuitability for small data, the need for structured data, and the inability to replace rdbms systems. it also emphasizes hive's use case for handling huge structured data sets, its similarity to sql, and the necessity of understanding the hadoop framework to write mapreduce.", 'duration': 331.863, 'highlights': ['Hive should not be used if the data does not cross gigabytes, as it is expected to be slow due to its MapReduce nature. Hive is slow and unsuitable for small data as queries can take hours, emphasizing the need for big data to justify its use.', 'Hive is specifically for huge amount of structured data, such as a table size of three terabytes, where it can be useful for querying the table despite taking time. Hive is suitable for handling huge structured data sets, providing a use case for querying large tables despite the slow performance.', 'Hive works with structured data like comma separated values, space separated values, column separated value, JSON files, XML files, and semi-structured data, but cannot handle raw free form text data without structure. Hive requires structured data such as CSV, JSON, XML, or semi-structured data and is not suitable for raw free form text data without structure.', 'Hive is not a replacement for RDBMS systems, as it cannot provide real-time solutions and should not be used if RDBMS can solve most problems. Hive is not a substitute for RDBMS systems and should not be used if RDBMS can solve most problems, emphasizing its limitations in real-time solutions.', 'Understanding the Hadoop framework is necessary to write MapReduce, while SQL engineers can quickly write Hive scripts without the need to learn Java, Python, Ruby, or C Sharp for MapReduce. Knowledge of the Hadoop framework is essential for writing MapReduce, while SQL engineers can easily write Hive scripts without learning additional programming languages.']}, {'end': 20668.657, 'start': 20087.964, 'title': 'Evolution of hive and its impact on real world projects', 'summary': 'Discusses the evolution of hive, including its initial success, challenges with slow queries, and the introduction of faster alternatives such as hive plus tez, impala, spark sql, and phoenix, impacting real world hadoop projects.', 'duration': 580.693, 'highlights': ["Hive plus Tez is faster than original Hive as it uses Tez framework to overcome MapReduce's slowness, enhancing query performance. Hive plus Tez is faster than original Hive as it uses Tez framework to overcome MapReduce's slowness, enhancing query performance.", 'Impala, developed by Cloudera, provides an interactive SQL interface and is promoted as a faster alternative to Hive plus Tez. Impala, developed by Cloudera, provides an interactive SQL interface and is promoted as a faster alternative to Hive plus Tez.', 'Spark SQL offers a real-time SQL interface on top of the Spark framework, providing faster query execution compared to Hive plus Tez and Impala. Spark SQL offers a real-time SQL interface on top of the Spark framework, providing faster query execution compared to Hive plus Tez and Impala.', 'Phoenix serves as a SQL interface for NoSQL databases, specifically HBase, facilitating SQL queries on HBase data. Phoenix serves as a SQL interface for NoSQL databases, specifically HBase, facilitating SQL queries on HBase data.', 'The original Hive, known as pure Hive, functions as a client-side application with an interactive shell, allowing developers to execute commands and queries without a server. The original Hive, known as pure Hive, functions as a client-side application with an interactive shell, allowing developers to execute commands and queries without a server.']}, {'end': 21382.613, 'start': 20669.378, 'title': 'Understanding hive server 1 and hive server 2', 'summary': 'Explores the functionalities of hive server 1 and hive server 2, highlighting that hive server 1 allows jdbc or odbc connections but lacks concurrency and consistency, whereas hive server 2 provides concurrency, allows jdbc odbc connections, and introduces a new client-side command line tool called beeline.', 'duration': 713.235, 'highlights': ['Hive Server 2 provides concurrency and allows JDBC ODBC connections. Hive Server 2 allows multiple JDBC ODBC connections, addressing the lack of concurrency and consistency in Hive Server 1.', 'Hive Server 1 allows JDBC or ODBC connections but lacks concurrency and consistency. Hive Server 1 lacks concurrency and consistency, making it impossible to handle multiple user sessions effectively.', 'Introduction of Beeline as a new Client-side command line tool in Hive Server 2. Hive Server 2 introduces Beeline, a new Client-side command line tool, providing users with an alternative to the traditional CLI and enabling improved interactions with Hive.', 'Derby database used as the default embedded metastore in Hive, limiting to one connection at a time. The default embedded metastore in Hive uses Derby database, which restricts access to only one connection at a time, leading to the recommendation of configuring a separate database like MySQL for metadata storage in production environments.', "Hive allows updates in the latest version and can refer to the position of the block in the data node. The latest version of Hive supports updates and can refer to the position of the block in the data node, offering improved functionality compared to earlier versions and demonstrating compatibility with Hadoop's write once read many system."]}, {'end': 22014.534, 'start': 21383.073, 'title': 'Hive usage in small and large hadoop clusters', 'summary': 'Explains the usage of hive in small and large hadoop clusters, highlighting the installation and access patterns and also presents a practical scenario of analyzing customer transaction data in hive.', 'duration': 631.461, 'highlights': ["In small Hadoop clusters with a limited number of data nodes, the entire Hive package is installed on the user's computer, eliminating the need for a separate server, making it suitable for proof of concept or non-production workloads. Small Hadoop clusters with 3-4 data nodes are accessed by a few users, leading to installation of the entire Hive package on the user's computer, making a separate server unnecessary.", 'In contrast, large Hadoop clusters with thousands of data nodes utilize a gateway node, acting as an intermediate server housing all client packages, allowing users to connect and work with the cluster through this node. Large Hadoop clusters utilize a gateway node as an intermediate server for connecting and working with the cluster, housing all client packages and prohibiting direct access to the cluster.', 'The chapter also presents a practical scenario of analyzing customer transaction data in Hive, aiming to classify customers into different age groups and analyze their spending patterns. The practical scenario involves analyzing customer transaction data in Hive to classify customers into different age groups and analyze their spending patterns, aiming to understand the spending behavior of different age groups.']}], 'duration': 2637.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4019376943.jpg', 'highlights': ['Hive Server 2 provides concurrency and allows JDBC ODBC connections.', "Hive plus Tez is faster than original Hive as it uses Tez framework to overcome MapReduce's slowness, enhancing query performance.", 'Hive Installation and Function Hive is installed on top of Hadoop, does not have its own storage, and provides a projection of structured data using SQL-like language called HiveQL, which can be used for querying and analysis.', 'Hive as a Client-side Application Hive is a client-side application, installed on the laptop, and it automatically writes the equivalent MapReduce program, creating a jar file and sending it to the Hadoop cluster.', 'Hive should not be used if the data does not cross gigabytes, as it is expected to be slow due to its MapReduce nature. Hive is slow and unsuitable for small data as queries can take hours, emphasizing the need for big data to justify its use.', "In small Hadoop clusters with a limited number of data nodes, the entire Hive package is installed on the user's computer, eliminating the need for a separate server, making it suitable for proof of concept or non-production workloads.", 'Understanding the Hadoop framework is necessary to write MapReduce, while SQL engineers can quickly write Hive scripts without the need to learn Java, Python, Ruby, or C Sharp for MapReduce.', 'The chapter also presents a practical scenario of analyzing customer transaction data in Hive, aiming to classify customers into different age groups and analyze their spending patterns.']}, {'end': 23351.291, 'segs': [{'end': 22510.37, 'src': 'embed', 'start': 22479.736, 'weight': 2, 'content': [{'end': 22489.557, 'text': 'And if this data was very, very divided into blocks and all and all, just like regular Hadoop file, this is where your data gets stored.', 'start': 22479.736, 'duration': 9.821}, {'end': 22494.999, 'text': 'So I can also show this from command line.', 'start': 22491.818, 'duration': 3.181}, {'end': 22503.183, 'text': 'So if I do warehouse, then del.db, what will I see? I see another folder called transaction records.', 'start': 22495.019, 'duration': 8.164}, {'end': 22510.37, 'text': 'And if I again say dxn records, I will see the file.', 'start': 22505.145, 'duration': 5.225}], 'summary': 'Data is stored in blocks, like hadoop files, in del.db warehouse with transaction records folder and file.', 'duration': 30.634, 'max_score': 22479.736, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4022479736.jpg'}, {'end': 22645.645, 'src': 'embed', 'start': 22615.886, 'weight': 1, 'content': [{'end': 22619.488, 'text': 'The result of the query is 50,000.', 'start': 22615.886, 'duration': 3.602}, {'end': 22629.794, 'text': 'How did I do it? I just wrote my query and hit enter in hive and it automatically launches an equivalent MapReduce job to show me the result.', 'start': 22619.488, 'duration': 10.306}, {'end': 22632.541, 'text': 'and I get the result here.', 'start': 22630.82, 'duration': 1.721}, {'end': 22634.861, 'text': "Now let's analyze this further.", 'start': 22633.081, 'duration': 1.78}, {'end': 22638.382, 'text': "So what I want to do is that I'll be creating one more table.", 'start': 22635.281, 'duration': 3.101}, {'end': 22645.645, 'text': "So I'm gonna create a table called out one.", 'start': 22638.402, 'duration': 7.243}], 'summary': "Query resulted in 50,000; hive launched mapreduce job and created new table 'out one'.", 'duration': 29.759, 'max_score': 22615.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4022615886.jpg'}, {'end': 22707.565, 'src': 'embed', 'start': 22675.311, 'weight': 6, 'content': [{'end': 22677.352, 'text': 'So I have created a table already.', 'start': 22675.311, 'duration': 2.041}, {'end': 22679.534, 'text': 'So this table is called out one.', 'start': 22677.392, 'duration': 2.142}, {'end': 22683.876, 'text': 'And how do you do a join operation in Hive? Very simple.', 'start': 22680.234, 'duration': 3.642}, {'end': 22688.635, 'text': 'insert overwrite table out one.', 'start': 22686.374, 'duration': 2.261}, {'end': 22691.577, 'text': 'So out one is a new table we have created.', 'start': 22689.096, 'duration': 2.481}, {'end': 22695.419, 'text': 'Select a.customer, a.firstname, a.h.', 'start': 22692.217, 'duration': 3.202}, {'end': 22704.304, 'text': 'So if you look at the syntax of the join operation, it is exactly same as your normal SQL join operation.', 'start': 22695.739, 'duration': 8.565}, {'end': 22707.565, 'text': "If I hit enter, it's gonna launch a MapReduce job.", 'start': 22705.324, 'duration': 2.241}], 'summary': 'Created table out one in hive, performed join operation, launched mapreduce job', 'duration': 32.254, 'max_score': 22675.311, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4022675311.jpg'}, {'end': 23020.438, 'src': 'embed', 'start': 22981.669, 'weight': 0, 'content': [{'end': 22988.415, 'text': 'Now what are these things? By default, any table you create is called a managed table.', 'start': 22981.669, 'duration': 6.746}, {'end': 22994.599, 'text': 'So what is that? How do you know that? If you go to Hive, you can say show tables.', 'start': 22989.338, 'duration': 5.261}, {'end': 22999.56, 'text': 'And we have a table called transaction records.', 'start': 22997.24, 'duration': 2.32}, {'end': 23014.303, 'text': "If I say describe formatted TXN records, if I type this command, describe the table, it says it's a managed table.", 'start': 22999.66, 'duration': 14.643}, {'end': 23020.438, 'text': 'Table type is managed and look at the location of the table.', 'start': 23015.916, 'duration': 4.522}], 'summary': "Default tables in hive are managed tables, such as the 'transaction records' table with a managed type.", 'duration': 38.769, 'max_score': 22981.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4022981669.jpg'}, {'end': 23239.6, 'src': 'embed', 'start': 23141.487, 'weight': 3, 'content': [{'end': 23145.968, 'text': 'And if you look at the schema and row formats, they are all same.', 'start': 23141.487, 'duration': 4.481}, {'end': 23150.749, 'text': 'There is no difference in the schema or the format or anything.', 'start': 23146.568, 'duration': 4.181}, {'end': 23161.831, 'text': "Only difference I'm saying location, user, Cloudera, my customer, meaning this is the location in Hadoop.", 'start': 23151.329, 'duration': 10.502}, {'end': 23178.477, 'text': 'How do you know that? So once you create this table, if you go to your Hue, you go to User, Cloudera, there is a table created called My Customer.', 'start': 23163.726, 'duration': 14.751}, {'end': 23184.001, 'text': 'Can you see this? This table is just created and there is no data inside this.', 'start': 23178.537, 'duration': 5.464}, {'end': 23191.067, 'text': 'Now what is the advantage is that, so now I created an external table.', 'start': 23185.543, 'duration': 5.524}, {'end': 23200.158, 'text': 'If I do a select star, from this table, PEXN records two, there is no data.', 'start': 23191.187, 'duration': 8.971}, {'end': 23204.502, 'text': "Because I just created a table, I didn't fill it with data.", 'start': 23200.518, 'duration': 3.984}, {'end': 23208.845, 'text': 'If you want to fill it with data, all you need to do is very simple.', 'start': 23204.922, 'duration': 3.923}, {'end': 23216.652, 'text': 'Go to the folder mentioned in the external table, that is my customer, copy the data to that folder.', 'start': 23209.866, 'duration': 6.786}, {'end': 23227.296, 'text': 'Now I just uploaded this data to this my customer folder.', 'start': 23222.915, 'duration': 4.381}, {'end': 23232.658, 'text': 'Can you see this? In the my customer folder in Hadoop, I just uploaded this data.', 'start': 23227.336, 'duration': 5.322}, {'end': 23239.6, 'text': 'Now if I come back here, again I do a select star, do a limit of 20, I have the data.', 'start': 23232.758, 'duration': 6.842}], 'summary': 'Creating external table in hadoop, uploading data, and querying for results.', 'duration': 98.113, 'max_score': 23141.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023141487.jpg'}], 'start': 22014.934, 'title': 'Hive in data warehousing', 'summary': "Explores creating tables in hive, loading data for data warehousing, hive's role in sum calculations, hadoop data storage in hive, hive's data analysis process including loading 50,000 rows, join operations, classifying customers, and summarizing data, as well as the differences and implications of managed vs external tables in hive.", 'chapters': [{'end': 22246.754, 'start': 22014.934, 'title': 'Sum calculation using hive in data warehousing', 'summary': 'Explains the process of creating a table in hive with a specific schema and loading data into it from a local file system for data warehousing, emphasizing the use of hive for sum calculations.', 'duration': 231.82, 'highlights': ['The chapter explains the process of creating a table in Hive with a specific schema and loading data into it from a local file system for data warehousing. Creation of table in Hive with specified schema, loading data from local file system, emphasis on data warehousing', 'Emphasizes the use of Hive for sum calculations. Use of Hive for sum calculations']}, {'end': 22510.37, 'start': 22248.359, 'title': 'Hadoop data storage in hive', 'summary': 'Explains the behind-the-scenes data storage mechanism in hive, revealing how databases and tables are stored as folders in the warehouse directory in hadoop, with a specific example of how the transaction records table is represented as a folder and file in hdfs.', 'duration': 262.011, 'highlights': ['Whenever you install Hive, it creates a folder called warehouse in Hadoop or HDFS, and all the databases you create are stored inside this warehouse directory.', 'The databases and tables created in Hive are actually represented as folders in Hadoop, with tables being further represented as folders containing the actual data files.', 'The data inserted into a Hive table is copied into a specific location in HDFS, illustrating the underlying process of data storage in Hadoop.']}, {'end': 22944.247, 'start': 22516.157, 'title': 'Hive data analysis process', 'summary': 'Explains the process of loading, querying, joining, classifying, and grouping data in hive, showcasing the use of mapreduce jobs and sql-like syntax to achieve results like loading 50,000 rows, performing join operations, classifying customers, and summarizing data.', 'duration': 428.09, 'highlights': ["The chapter showcases loading 50,000 rows of data into a table in Hive and firing a MapReduce job to get the result. The query 'select count star from transaction records' fires a MapReduce job that loads 50,000 rows of data, showcasing the use of MapReduce jobs in achieving results.", 'It demonstrates performing a join operation in Hive and storing the results in a new table using SQL-like syntax, launching a MapReduce job in the process. The process of performing a join operation in Hive using SQL-like syntax and launching a MapReduce job to store the result in a new table is showcased.', 'The transcript highlights the process of classifying customers based on age using a case statement and firing a MapReduce job to categorize the records. The process of classifying customers based on age using a case statement and firing a MapReduce job to categorize the records is highlighted in the transcript.', 'It illustrates the creation of a final table by performing a group by query on the classified data, showcasing the summarization of data in Hive. The creation of a final table by performing a group by query on the classified data, showcasing the summarization of data in Hive, is illustrated in the transcript.']}, {'end': 23351.291, 'start': 22945.007, 'title': 'Managed vs external tables in hive', 'summary': 'Explains the differences between managed and external tables in hive, highlighting how managed tables are stored in the user hive warehouse and data needs to be manually loaded, while external tables specify a specific location for data, making it visible in the table, with practical demonstrations and implications of dropping each type of table.', 'duration': 406.284, 'highlights': ['The difference between managed and external tables in Hive is that managed tables are stored in the user Hive warehouse and require manual data loading, while external tables specify a specific location for data, making it visible in the table.', "When dropping a managed table, the data associated with it is also deleted, as the table's folder is removed from Hadoop, whereas dropping an external table does not result in data deletion.", 'Practical demonstrations are used to showcase the implications of dropping each type of table, with the managed table resulting in data deletion upon dropping, while the external table retains the data.', 'The chapter emphasizes the practical usage scenarios for managed and external tables in Hive, illustrating the consequences of dropping each type of table and how data is handled based on the table type.']}], 'duration': 1336.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4022014934.jpg', 'highlights': ['Creation of table in Hive with specified schema, loading data from local file system, emphasis on data warehousing', 'Use of Hive for sum calculations', 'Illustrating the underlying process of data storage in Hadoop', "The query 'select count star from transaction records' fires a MapReduce job that loads 50,000 rows of data, showcasing the use of MapReduce jobs in achieving results", 'The process of performing a join operation in Hive using SQL-like syntax and launching a MapReduce job to store the result in a new table is showcased', 'The process of classifying customers based on age using a case statement and firing a MapReduce job to categorize the records is highlighted in the transcript', 'The creation of a final table by performing a group by query on the classified data, showcasing the summarization of data in Hive, is illustrated in the transcript', 'The difference between managed and external tables in Hive is that managed tables are stored in the user Hive warehouse and require manual data loading, while external tables specify a specific location for data, making it visible in the table', "When dropping a managed table, the data associated with it is also deleted, as the table's folder is removed from Hadoop, whereas dropping an external table does not result in data deletion", 'Practical demonstrations are used to showcase the implications of dropping each type of table, with the managed table resulting in data deletion upon dropping, while the external table retains the data', 'The chapter emphasizes the practical usage scenarios for managed and external tables in Hive, illustrating the consequences of dropping each type of table and how data is handled based on the table type']}, {'end': 24530.69, 'segs': [{'end': 23548.497, 'src': 'embed', 'start': 23520.73, 'weight': 5, 'content': [{'end': 23531.815, 'text': 'Now if I say show tables, sorry, show databases, it will list all the databases.', 'start': 23520.73, 'duration': 11.085}, {'end': 23533.776, 'text': 'So this is how you start Beeline.', 'start': 23532.235, 'duration': 1.541}, {'end': 23536.877, 'text': 'So you will say user libhive Beeline.', 'start': 23533.836, 'duration': 3.041}, {'end': 23538.618, 'text': 'So that will start Beeline.', 'start': 23537.237, 'duration': 1.381}, {'end': 23539.998, 'text': 'Beeline is your client.', 'start': 23538.698, 'duration': 1.3}, {'end': 23546.461, 'text': "And when you say show tables, it won't show anything because it need a server connectivity.", 'start': 23540.499, 'duration': 5.962}, {'end': 23548.497, 'text': 'and this is the connection string.', 'start': 23546.897, 'duration': 1.6}], 'summary': 'Beeline client used to list databases; show tables requires server connectivity', 'duration': 27.767, 'max_score': 23520.73, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023520730.jpg'}, {'end': 23688.464, 'src': 'embed', 'start': 23658.294, 'weight': 2, 'content': [{'end': 23663.818, 'text': 'So if I want to do that, you can of course use the command line, but you can also use the GUI.', 'start': 23658.294, 'duration': 5.524}, {'end': 23664.698, 'text': 'Let me show you that.', 'start': 23663.918, 'duration': 0.78}, {'end': 23668.481, 'text': 'Go to Hue, go to this Data Browsers.', 'start': 23664.979, 'duration': 3.502}, {'end': 23670.662, 'text': 'There is something called Metastore Table.', 'start': 23668.521, 'duration': 2.141}, {'end': 23671.923, 'text': 'Click on this.', 'start': 23671.363, 'duration': 0.56}, {'end': 23680.129, 'text': 'Once you click on Metastore Table, it will allow you to select the database.', 'start': 23674.125, 'duration': 6.004}, {'end': 23681.91, 'text': "So I'm gonna select our database.", 'start': 23680.169, 'duration': 1.741}, {'end': 23685.203, 'text': 'I can select this database.', 'start': 23683.502, 'duration': 1.701}, {'end': 23688.464, 'text': 'I can say create a table from a file.', 'start': 23685.283, 'duration': 3.181}], 'summary': 'Demonstrating how to create a table from a file using hue gui.', 'duration': 30.17, 'max_score': 23658.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023658294.jpg'}, {'end': 23890.749, 'src': 'embed', 'start': 23855.188, 'weight': 0, 'content': [{'end': 23861.672, 'text': 'Create a table for me in your Hive in which I want to upload all my sales data.', 'start': 23855.188, 'duration': 6.484}, {'end': 23867.59, 'text': 'So what I did, I created a table Create table.', 'start': 23862.872, 'duration': 4.718}, {'end': 23873.324, 'text': 'sales data.', 'start': 23872.524, 'duration': 0.8}, {'end': 23880.546, 'text': 'I created a table called sales data because my manager says he has to upload some data related to sales.', 'start': 23873.824, 'duration': 6.722}, {'end': 23886.708, 'text': 'so I created a table called sales data with some schema, some columns and all whatever we have.', 'start': 23880.546, 'duration': 6.162}, {'end': 23890.749, 'text': "Now, imagine you're working on a manage table.", 'start': 23887.208, 'duration': 3.541}], 'summary': "Created a hive table 'sales data' to upload sales data for management.", 'duration': 35.561, 'max_score': 23855.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023855188.jpg'}, {'end': 24004.671, 'src': 'embed', 'start': 23978.773, 'weight': 6, 'content': [{'end': 23983.635, 'text': 'This contains all the sales transactions in the month of January.', 'start': 23978.773, 'duration': 4.862}, {'end': 23985.215, 'text': 'Very simple.', 'start': 23984.655, 'duration': 0.56}, {'end': 23988.656, 'text': "So you're happy, your manager is also happy.", 'start': 23986.135, 'duration': 2.521}, {'end': 23999.14, 'text': 'So then, what happened is that you keep on working in the project and the next month your manager comes to you and say that hey Raghu,', 'start': 23990.537, 'duration': 8.603}, {'end': 24001.168, 'text': 'We have some more data.', 'start': 23999.767, 'duration': 1.401}, {'end': 24004.671, 'text': 'So this data is from the month of February.', 'start': 24001.949, 'duration': 2.722}], 'summary': 'Sales transactions for january and february were successfully managed, leading to a positive outcome for both the employee and the manager.', 'duration': 25.898, 'max_score': 23978.773, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023978773.jpg'}, {'end': 24156.999, 'src': 'embed', 'start': 24127.876, 'weight': 4, 'content': [{'end': 24131.578, 'text': 'So what is the query? You say select star from the table where month equals to April.', 'start': 24127.876, 'duration': 3.702}, {'end': 24136.08, 'text': 'Means you want to see all the data from April month.', 'start': 24132.178, 'duration': 3.902}, {'end': 24138.141, 'text': 'Now look at the problem you have.', 'start': 24136.58, 'duration': 1.561}, {'end': 24143.504, 'text': "Hive by default doesn't know where is April month data.", 'start': 24138.861, 'duration': 4.643}, {'end': 24148.246, 'text': 'So what Hive will do, it will come to this folder called sales data.', 'start': 24143.884, 'duration': 4.362}, {'end': 24152.308, 'text': 'It will first scan this entire January.txt.', 'start': 24148.806, 'duration': 3.502}, {'end': 24156.999, 'text': 'then scan this entire February.txt.', 'start': 24153.098, 'duration': 3.901}], 'summary': 'Query requests all data from april month in sales data folder, hive scans january.txt and february.txt.', 'duration': 29.123, 'max_score': 24127.876, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4024127876.jpg'}, {'end': 24401.608, 'src': 'embed', 'start': 24376.108, 'weight': 3, 'content': [{'end': 24385.121, 'text': "what Hive will do is that it's gonna pick the column from the whole data and understand how many values are there.", 'start': 24376.108, 'duration': 9.013}, {'end': 24388.022, 'text': 'In my example, there are six months data.', 'start': 24385.581, 'duration': 2.441}, {'end': 24392.884, 'text': "It's gonna automatically create six folders, Jan, Feb, March, April, May, June.", 'start': 24388.442, 'duration': 4.442}, {'end': 24396.786, 'text': 'All the January month data will be copied into this folder.', 'start': 24393.465, 'duration': 3.321}, {'end': 24401.608, 'text': 'All the February month data will be copied into this folder, et cetera.', 'start': 24397.506, 'duration': 4.102}], 'summary': 'Hive will create six folders for six months of data, automatically organizing data by month.', 'duration': 25.5, 'max_score': 24376.108, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4024376108.jpg'}], 'start': 23351.291, 'title': 'Cloudera, hive, and query performance', 'summary': 'Covers managing tables in cloudera, emphasizing the importance of external tables, connecting beeline client to hive server 2 to query healthcare data, and optimizing query performance in hive through data partitioning, demonstrating improved query speed.', 'chapters': [{'end': 23409.776, 'start': 23351.291, 'title': 'Cloudera and managing tables', 'summary': 'Explains the difference between managing and external tables in cloudera, emphasizing that dropping a managed table removes both the table and its data, while dropping an external table only removes the table while the data remains safe. it also highlights the recommendation to use external tables when there is a need to share data with others to prevent accidental data loss.', 'duration': 58.485, 'highlights': ['The difference between managing and external tables in Cloudera is that dropping a managed table removes both the table and its data, while dropping an external table only removes the table while the data remains safe.', 'It is recommended to create an external table when there is a need to share data with others and to prevent accidental data loss, as dropping a managed table can result in the loss of data.']}, {'end': 23657.459, 'start': 23410.948, 'title': 'Connecting beeline client to hive server 2', 'summary': 'Explains how to connect the beeline client to hive server 2, including starting beeline, using the connection string to establish server connectivity, and interacting with databases and tables, ultimately aiming to query healthcare data for specific attributes.', 'duration': 246.511, 'highlights': ['Beeline is a client that needs to connect with Hive Server 2, and the connection string is used to establish server connectivity. ', "Starting Beeline is achieved by entering the command 'user libhive Beeline'. ", "The connection string 'connect JDBC Hive Server 2 localhost' is used to establish the connection, allowing the execution of regular commands like 'show databases' and 'show tables'. ", 'The option of using Beeline client on the desktop and employing Hue for tasks such as uploading data into Hive and querying is also presented. ']}, {'end': 24530.69, 'start': 23658.294, 'title': 'Hive data partitioning: optimizing query performance', 'summary': 'Discusses using hive gui to create a table, loading data, the need for partitions in hive, and how partitioning based on a column can significantly improve query performance by minimizing data scan. this is explained through a practical example and its impact on query speed.', 'duration': 872.396, 'highlights': ['Using Hive GUI to Create and Query Tables Demonstrates using GUI to create a table and query data, providing an alternative to command line interface.', 'Importance of Partitions in Hive Explains the necessity of partitions in Hive to optimize query performance, with the example of data loading and retrieval, showcasing the inefficiency of scanning unpartitioned data.', 'Impact of Partitioning on Query Speed Illustrates the impact of partitioning based on a column in Hive, showcasing how it significantly improves query speed by minimizing data scan, enhancing query performance.']}], 'duration': 1179.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4023351291.jpg', 'highlights': ['Dropping an external table only removes the table while the data remains safe', 'Creating an external table prevents accidental data loss', 'Beeline is a client that needs to connect with Hive Server 2', 'The connection string is used to establish server connectivity', 'Using Hive GUI to Create and Query Tables provides an alternative to command line interface', 'Importance of Partitions in Hive to optimize query performance', 'Impact of Partitioning on Query Speed significantly improves query speed']}, {'end': 25588.584, 'segs': [{'end': 25481.619, 'src': 'embed', 'start': 25452.705, 'weight': 3, 'content': [{'end': 25456.689, 'text': "So I created a table called partitioned user, right? Now see what I'm going to do.", 'start': 25452.705, 'duration': 3.984}, {'end': 25466.666, 'text': "I'm gonna say, insert into table partition underscore user one, partition country comma state.", 'start': 25456.829, 'duration': 9.837}, {'end': 25470.811, 'text': "So what I'm telling, Hive, is that hey, Hive,", 'start': 25467.367, 'duration': 3.444}, {'end': 25481.619, 'text': 'I want you to copy all the data from my temporary table and load it into this new table and partition it with country and state.', 'start': 25470.811, 'duration': 10.808}], 'summary': 'Created partitioned user table and loaded data with country and state partitions.', 'duration': 28.914, 'max_score': 25452.705, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4025452705.jpg'}, {'end': 25536.222, 'src': 'embed', 'start': 25506.79, 'weight': 0, 'content': [{'end': 25508.291, 'text': 'Look at your screen very carefully.', 'start': 25506.79, 'duration': 1.501}, {'end': 25512.032, 'text': 'You will be able to see Hive creating these folders.', 'start': 25508.711, 'duration': 3.321}, {'end': 25515.693, 'text': 'So now it is creating the partitions.', 'start': 25513.852, 'duration': 1.841}, {'end': 25517.513, 'text': 'You will see that right now.', 'start': 25516.333, 'duration': 1.18}, {'end': 25520.614, 'text': 'Just wait for a moment for that to pop up.', 'start': 25518.033, 'duration': 2.581}, {'end': 25526.216, 'text': 'Can you see the loading partitions?', 'start': 25524.475, 'duration': 1.741}, {'end': 25527.997, 'text': 'loading partitions, loading partitions.', 'start': 25526.216, 'duration': 1.781}, {'end': 25528.617, 'text': 'can you see that?', 'start': 25527.997, 'duration': 0.62}, {'end': 25530.879, 'text': 'Country country, state, country state.', 'start': 25528.938, 'duration': 1.941}, {'end': 25531.379, 'text': 'can you see it?', 'start': 25530.879, 'duration': 0.5}, {'end': 25536.222, 'text': 'creating these files and folders? And done.', 'start': 25531.379, 'duration': 4.843}], 'summary': 'Hive creates folders and partitions, loading data.', 'duration': 29.432, 'max_score': 25506.79, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4025506790.jpg'}, {'end': 25588.584, 'src': 'embed', 'start': 25560.587, 'weight': 4, 'content': [{'end': 25566.67, 'text': 'If you open any of these states, you will have a file that will have only that state data.', 'start': 25560.587, 'duration': 6.083}, {'end': 25577.336, 'text': 'Of course, you are not able to see the file in the proper format because it is storing it as a sequence file, but this has the state related data.', 'start': 25567.571, 'duration': 9.765}, {'end': 25579.177, 'text': 'Only that is matching with here.', 'start': 25577.516, 'duration': 1.661}, {'end': 25588.584, 'text': 'So now you can see that it is creating countries and then inside country it is creating states also.', 'start': 25579.77, 'duration': 8.814}], 'summary': 'Data files contain state-related information, with separate files for each state. the data is organized by countries and states.', 'duration': 27.997, 'max_score': 25560.587, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4025560587.jpg'}], 'start': 24530.71, 'title': 'Hive partitioning', 'summary': 'Discusses static and dynamic partitioning in hive, highlighting the use cases and differences, with a focus on manual creation of partitions based on country in static partitioning and automatic detection of data in dynamic partitioning. it provides a comprehensive guide on creating a partitioned table, loading data, and enabling dynamic partitioning, with a demonstration of the impact on query performance. additionally, it demonstrates creating a partitioned table using dynamic partitioning, resulting in the creation of five country partitions and their respective states, with four countries and their associated states being created.', 'chapters': [{'end': 24773.988, 'start': 24530.71, 'title': 'Static vs dynamic partitioning', 'summary': 'Discusses static and dynamic partitioning in hive, highlighting the use cases and differences, with a focus on the manual creation of partitions based on country in static partitioning and the automatic detection of data in dynamic partitioning.', 'duration': 243.278, 'highlights': ['Static partitioning involves manually creating partitions and loading data based on specific criteria, such as country, when the data does not contain that information, allowing for controlled querying based on the partitioned criteria. Manually create two partitions for India and USA, load data accordingly, and query based on country.', 'Dynamic partitioning in Hive automatically detects data and creates partitions, removing the need for manual intervention in creating partitions and loading data, providing convenience and automation. Hive automatically detects data and creates partitions, eliminating manual intervention.', 'The use case of static partitioning is exemplified in a scenario where the data does not contain information about the country, but the user knows the country distribution, enabling the manual creation of partitions based on the country information. Creating static partitions based on country distribution when the information is not present in the data.']}, {'end': 25391.905, 'start': 24775.188, 'title': 'Hive partitioning demonstration', 'summary': 'Provides a comprehensive guide on static and dynamic partitioning in hive, including creating a partitioned table, loading data, and enabling dynamic partitioning, with a demonstration of the impact on query performance.', 'duration': 616.717, 'highlights': ['Demonstrating static partitioning by creating a table partitioned by country and state, loading data, and creating static partitions for US and California, showcasing the resulting folder structure and querying the data for faster access. Creating a partitioned table with country and state columns, loading data and creating static partitions for US and California, resulting in faster data access, as demonstrated by the query taking 85 seconds without partitioning.', 'Enabling dynamic partitioning by setting Hive to execute dynamic partition and configuring maximum dynamic partitions per node, followed by creating a temporary table and loading data to compare query performance with and without partitioning. Enabling dynamic partitioning, setting maximum dynamic partitions per node to 1000, creating a temporary table without partitioning, and comparing query performance with and without partitioning, showcasing the impact of dynamic partitioning.', 'Creating a temporary table, loading data, and running a query to assess the query performance without partitioning, demonstrating the difference in query execution time with and without partitioning. Creating a temporary table, loading data, and running a query to assess the impact of partitioning on query performance, showing a query execution time of 85 seconds without partitioning.']}, {'end': 25588.584, 'start': 25392.345, 'title': 'Dynamic partition creation', 'summary': 'Demonstrates how to create a partitioned table in hive using dynamic partitioning, resulting in the creation of five country partitions and their respective states, with four countries and their associated states being created.', 'duration': 196.239, 'highlights': ['Hive creates five partitions for countries - Australia, Canada, UK, and US, storing them as sequence files.', "The table 'partitioned_user_one' contains four countries - Australia, Canada, UK, and US - each having their respective states stored as sequence files within the folders.", 'The process involves Hive identifying countries and states from the data, creating folders for each, and placing the data accordingly.', 'The demonstration showcases the creation of states within countries through the dynamic partitioning process.']}], 'duration': 1057.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4024530710.jpg', 'highlights': ['Static partitioning involves manually creating partitions and loading data based on specific criteria, such as country, when the data does not contain that information, allowing for controlled querying based on the partitioned criteria.', 'Dynamic partitioning in Hive automatically detects data and creates partitions, removing the need for manual intervention in creating partitions and loading data, providing convenience and automation.', 'Demonstrating static partitioning by creating a table partitioned by country and state, loading data, and creating static partitions for US and California, showcasing the resulting folder structure and querying the data for faster access.', 'Enabling dynamic partitioning by setting Hive to execute dynamic partition and configuring maximum dynamic partitions per node, followed by creating a temporary table and loading data to compare query performance with and without partitioning.', 'Hive creates five partitions for countries - Australia, Canada, UK, and US, storing them as sequence files.', "The table 'partitioned_user_one' contains four countries - Australia, Canada, UK, and US - each having their respective states stored as sequence files within the folders."]}, {'end': 27696.518, 'segs': [{'end': 26097.843, 'src': 'heatmap', 'start': 25665.387, 'weight': 0.706, 'content': [{'end': 25669.688, 'text': 'Similarly, the data which is generated on day two is processed on day three.', 'start': 25665.387, 'duration': 4.301}, {'end': 25676.709, 'text': 'And on the other hand, real time processing involves a continual input, process and output of data.', 'start': 25670.228, 'duration': 6.481}, {'end': 25685.677, 'text': 'So it provides an organization, the ability to take immediate action in times when acting within seconds or minutes is significant.', 'start': 25677.15, 'duration': 8.527}, {'end': 25689.46, 'text': 'So Spark helps in achieving this real time processing.', 'start': 25686.238, 'duration': 3.222}, {'end': 25692.623, 'text': 'So again, as we can see over here in this figure,', 'start': 25689.961, 'duration': 2.662}, {'end': 25699.409, 'text': 'the data which is generated on day one is processed on day one itself and there is no time lag in processing.', 'start': 25692.623, 'duration': 6.786}, {'end': 25703.012, 'text': "So that's the major difference between Hadoop and Spark.", 'start': 25699.809, 'duration': 3.203}, {'end': 25705.614, 'text': 'So now let us understand about Spark briefly.', 'start': 25703.252, 'duration': 2.362}, {'end': 25710.03, 'text': 'So Spark is a cluster computing framework for real-time processing.', 'start': 25706.107, 'duration': 3.923}, {'end': 25719.358, 'text': 'It was Hadoop sub-project introduced in the UC Berkeley R&D lab in the year 2009, and it became open source in 2010.', 'start': 25710.511, 'duration': 8.847}, {'end': 25723.581, 'text': 'And then 2013, it was donated to Apache Software Foundation.', 'start': 25719.358, 'duration': 4.223}, {'end': 25732.348, 'text': 'So Spark provides an interface for programming multiple clusters with implicit data parallelism and fault tolerance.', 'start': 25724.141, 'duration': 8.207}, {'end': 25737.604, 'text': "So now that we know what is Spark, Let's actually have a look at some of its features.", 'start': 25733.189, 'duration': 4.415}, {'end': 25739.245, 'text': 'So Spark is real-time.', 'start': 25738.045, 'duration': 1.2}, {'end': 25746.229, 'text': 'That is, it provides real-time computation and low latency because of in-memory computation.', 'start': 25739.625, 'duration': 6.604}, {'end': 25751.372, 'text': 'And Spark is also 100 times faster for large-scale data processing.', 'start': 25746.709, 'duration': 4.663}, {'end': 25753.693, 'text': 'And Spark is also polyglot.', 'start': 25751.832, 'duration': 1.861}, {'end': 25760.957, 'text': 'That is, you can write Spark applications in multiple languages such as Java, Scala, Python, R, and SQL.', 'start': 25754.073, 'duration': 6.884}, {'end': 25769.139, 'text': 'Spark also has a simple programming layer which provides powerful caching and disk persistence capabilities.', 'start': 25761.597, 'duration': 7.542}, {'end': 25778.127, 'text': "And Spark also has multiple deployment modes, so it can be deployed through Mesos, Hadoop Ion, or Spark's own cluster manager.", 'start': 25769.719, 'duration': 8.408}, {'end': 25781.61, 'text': 'So now let us look at some domain scenarios of Spark.', 'start': 25778.707, 'duration': 2.903}, {'end': 25791.899, 'text': 'So, from small startups to Fortune 500s, almost every single company is adopting Apache Spark to build scale and innovate their big data applications.', 'start': 25781.95, 'duration': 9.949}, {'end': 25796.148, 'text': "So let's see some industries where Spark is used widely these days.", 'start': 25792.386, 'duration': 3.762}, {'end': 25805.335, 'text': 'So industries like healthcare, media, finance, e-commerce and travel are extensively using Spark to add value to their business.', 'start': 25796.728, 'duration': 8.607}, {'end': 25809.117, 'text': "So now let's understand how are Hadoop and Spark used together.", 'start': 25805.615, 'duration': 3.502}, {'end': 25813.439, 'text': 'So Hadoop alone provided limited predictive capabilities,', 'start': 25809.397, 'duration': 4.042}, {'end': 25819.363, 'text': 'as organizations were finding it difficult to predict customer needs and emerging market requirements.', 'start': 25813.439, 'duration': 5.924}, {'end': 25822.885, 'text': 'But with the combination of Hadoop and Spark,', 'start': 25819.863, 'duration': 3.022}, {'end': 25830.369, 'text': 'companies can now process billions of events every day at an analytical speed of 40 milliseconds per event.', 'start': 25822.885, 'duration': 7.484}, {'end': 25841.694, 'text': 'Companies can also enhance their storage and processing capabilities by working together with Hadoop and Spark to reduce costs by almost 40%.', 'start': 25830.97, 'duration': 10.724}, {'end': 25844.735, 'text': 'And by deploying a big data platform in the cloud.', 'start': 25841.694, 'duration': 3.041}, {'end': 25852.038, 'text': 'organizations can avoid duplication by unifying clusters into one which basically supports both Hadoop and Spark.', 'start': 25844.735, 'duration': 7.303}, {'end': 25856.9, 'text': 'So now many organizations are already using Hadoop and Spark together.', 'start': 25852.838, 'duration': 4.062}, {'end': 25863.1, 'text': 'Yahoo, Amazon, NASA, and eBay run Apache Spark inside their Hadoop clusters.', 'start': 25857.4, 'duration': 5.7}, {'end': 25868.545, 'text': 'Hortonworks and Cloudera Hadoop distributions come bundled with Apache Spark.', 'start': 25863.58, 'duration': 4.965}, {'end': 25873.649, 'text': 'AltScale uses Spark on Hadoop to provide big data as a cloud service.', 'start': 25868.845, 'duration': 4.804}, {'end': 25879.695, 'text': 'And Uber also uses Spark and Hadoop together to optimize customer experience.', 'start': 25874.07, 'duration': 5.625}, {'end': 25883.519, 'text': "So now let us have a look at some of Spark's APIs.", 'start': 25880.458, 'duration': 3.061}, {'end': 25891.042, 'text': 'So Apache Spark Quotes can be written in Scala, Java, Python and R and we can also write SQL queries with Spark.', 'start': 25883.999, 'duration': 7.043}, {'end': 25895.103, 'text': 'But the most preferred language along with Spark is Scala.', 'start': 25891.582, 'duration': 3.521}, {'end': 25898.244, 'text': "So let's have a look at some of the features of Scala.", 'start': 25895.483, 'duration': 2.761}, {'end': 25906.367, 'text': 'So Scala provides scalability on Java Virtual Machine and its performance is also really high when compared to others.', 'start': 25898.784, 'duration': 7.583}, {'end': 25911.511, 'text': 'Scala also provides excellent built-in concurrency support and libraries.', 'start': 25906.929, 'duration': 4.582}, {'end': 25918.293, 'text': 'And well, a single line of code in Scala can replace 20 to 25 lines of Java code.', 'start': 25912.031, 'duration': 6.262}, {'end': 25922.235, 'text': 'And Spark is also extremely fast and efficient.', 'start': 25918.774, 'duration': 3.461}, {'end': 25925.876, 'text': "So now it's time to have a look at Spark's architecture.", 'start': 25923.015, 'duration': 2.861}, {'end': 25934.98, 'text': 'So Apache Spark has a well-defined layered architecture where all the Spark components and layers are loosely coupled.', 'start': 25926.917, 'duration': 8.063}, {'end': 25945.52, 'text': 'So Spark basically uses a master worker architecture and in the master node you have the driver program which drives your application.', 'start': 25935.538, 'duration': 9.982}, {'end': 25948.501, 'text': 'Now you might be wondering what is a driver program.', 'start': 25946.04, 'duration': 2.461}, {'end': 25956.142, 'text': "So, basically, the code that you're writing behaves as a driver program, or, if you are using the interactive shell,", 'start': 25948.921, 'duration': 7.221}, {'end': 25958.423, 'text': 'the shell will act as a driver program.', 'start': 25956.142, 'duration': 2.281}, {'end': 25967.531, 'text': 'Now the driver program runs the main function of the application and it is the very place where the Spark context is created.', 'start': 25959.023, 'duration': 8.508}, {'end': 25973.155, 'text': 'So assume that the Spark context is a gateway to all the Spark functionalities.', 'start': 25968.051, 'duration': 5.104}, {'end': 25978.84, 'text': 'So Spark driver and Spark context take care of the job execution within the cluster.', 'start': 25973.516, 'duration': 5.324}, {'end': 25985.105, 'text': 'Now. Spark context also works along with the cluster manager to manage various jobs,', 'start': 25979.38, 'duration': 5.725}, {'end': 25993.469, 'text': 'and cluster manager is responsible for acquiring resources on the Spark cluster and allocating them to a Spark job.', 'start': 25985.105, 'duration': 8.364}, {'end': 25999.011, 'text': 'Now this job is split into multiple tasks which are distributed over the worker node.', 'start': 25993.749, 'duration': 5.262}, {'end': 26007.915, 'text': 'So these worker nodes are the slave nodes whose job is to basically execute the task and return the result to the Spark context.', 'start': 25999.471, 'duration': 8.444}, {'end': 26014.717, 'text': 'Then we have the executor which is a distributed agent responsible for the execution of tasks.', 'start': 26008.455, 'duration': 6.262}, {'end': 26025.049, 'text': 'So every Spark application has its own executor process And these executors usually run for the entire lifetime of a Spark application.', 'start': 26015.198, 'duration': 9.851}, {'end': 26033.776, 'text': 'So in a nutshell, the Spark context takes the job, breaks the job and tasks and distributes them to the worker nodes.', 'start': 26025.709, 'duration': 8.067}, {'end': 26042.343, 'text': 'These tasks work on the partition RDD, perform operations, collect the results and return to the main Spark context.', 'start': 26034.236, 'duration': 8.107}, {'end': 26046.28, 'text': 'Now, if you actually increase the number of worker nodes,', 'start': 26042.978, 'duration': 3.302}, {'end': 26054.704, 'text': 'then you can divide the jobs into more partitions and execute them parallelly over multiple systems, and it will be a lot faster.', 'start': 26046.28, 'duration': 8.424}, {'end': 26061.708, 'text': 'And with the increase in the number of workers, memory size will also increase and you can cache the jobs to execute it faster.', 'start': 26054.945, 'duration': 6.763}, {'end': 26065.53, 'text': 'So now let us see the various ways by which Spark can be deployed.', 'start': 26061.888, 'duration': 3.642}, {'end': 26068.532, 'text': 'The system currently supports these cluster managers.', 'start': 26066.051, 'duration': 2.481}, {'end': 26077.431, 'text': 'So we have Spark Standalone Cluster, which is a simple cluster manager included with Spark that makes it easy to set up a cluster.', 'start': 26069.147, 'duration': 8.284}, {'end': 26084.915, 'text': 'Then we have Apache Mesos, which is a general cluster manager that can also run Hadoop MapReduce and service applications.', 'start': 26077.991, 'duration': 6.924}, {'end': 26094.442, 'text': 'After that we have Hadoop Viya Yarn and we can also deploy it with Kubernetes, which is an open source system for automating deployment,', 'start': 26085.355, 'duration': 9.087}, {'end': 26097.843, 'text': 'scaling and management of containerized applications.', 'start': 26094.442, 'duration': 3.401}], 'summary': 'Spark enables real-time processing, 100x faster, used in various industries, and deployed with hadoop in major organizations.', 'duration': 432.456, 'max_score': 25665.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4025665387.jpg'}, {'end': 26152.979, 'src': 'embed', 'start': 26124.817, 'weight': 2, 'content': [{'end': 26129.099, 'text': 'And RDDs help in achieving this in-memory data sharing.', 'start': 26124.817, 'duration': 4.282}, {'end': 26137.042, 'text': 'So RDD stands for resilient distributed data set, and it is the fundamental data structure of Apache Spark.', 'start': 26129.559, 'duration': 7.483}, {'end': 26148.307, 'text': 'So by resilient I mean fault tolerant, as it can recompute missing or damaged partitions in case of a node failure with the help of RDD lineage graph.', 'start': 26137.582, 'duration': 10.725}, {'end': 26152.979, 'text': 'And it is distributed since data resides on multiple nodes.', 'start': 26149.096, 'duration': 3.883}], 'summary': 'Rdds are resilient, distributed data sets in apache spark, providing fault tolerance and in-memory data sharing.', 'duration': 28.162, 'max_score': 26124.817, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4026124817.jpg'}, {'end': 26253.614, 'src': 'embed', 'start': 26227.028, 'weight': 0, 'content': [{'end': 26236.37, 'text': "So I'll just type in val and then give a name to the object which would be let's say text RDD and then let me load this external data file.", 'start': 26227.028, 'duration': 9.342}, {'end': 26243.412, 'text': 'So the command would be sc.text file and then I will give in the path right.', 'start': 26236.43, 'duration': 6.982}, {'end': 26246.112, 'text': 'So this was the second way to create our RDD.', 'start': 26243.492, 'duration': 2.62}, {'end': 26251.513, 'text': 'And the third way to create RDD is to transform an existing RDD.', 'start': 26246.772, 'duration': 4.741}, {'end': 26253.614, 'text': "So we'll do that using the filter function.", 'start': 26251.874, 'duration': 1.74}], 'summary': 'Creating rdd using different methods like loading external data file and transforming an existing rdd.', 'duration': 26.586, 'max_score': 26227.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4026227028.jpg'}, {'end': 26359.819, 'src': 'embed', 'start': 26330.111, 'weight': 1, 'content': [{'end': 26338.595, 'text': "So what I'll do is I will go ahead and create an RDD with the name X and this RDD is made by parallelizing this collection.", 'start': 26330.111, 'duration': 8.484}, {'end': 26348.256, 'text': 'Now what I want is using the map function I want to add the number one to each of these existing element.', 'start': 26339.135, 'duration': 9.121}, {'end': 26349.997, 'text': 'So this would be the command for that.', 'start': 26348.476, 'duration': 1.521}, {'end': 26359.819, 'text': "So on top of this RDDX I'll apply the map function and this will take in every single element and add one to that element.", 'start': 26350.137, 'duration': 9.682}], 'summary': 'Create rdd x, add 1 to each element using map function.', 'duration': 29.708, 'max_score': 26330.111, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4026330111.jpg'}, {'end': 27574.161, 'src': 'embed', 'start': 27547.213, 'weight': 4, 'content': [{'end': 27553.537, 'text': 'and I will first take in all of the input columns which are nothing but all of the independent variables.', 'start': 27547.213, 'duration': 6.324}, {'end': 27564.223, 'text': 'So these are all of the independent variables and I will be storing them under one new column and name that column to be equal to features.', 'start': 27553.977, 'duration': 10.246}, {'end': 27568.48, 'text': "And I'll store this in object and name that object to be assembler.", 'start': 27564.639, 'duration': 3.841}, {'end': 27574.161, 'text': "So after this, I'll apply this transform function on top of the EECOM DF1 data frame.", 'start': 27568.62, 'duration': 5.541}], 'summary': "Input columns merged into 'features' column, applied to eecom df1.", 'duration': 26.948, 'max_score': 27547.213, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4027547213.jpg'}, {'end': 27672.726, 'src': 'embed', 'start': 27648.396, 'weight': 3, 'content': [{'end': 27656.557, 'text': "So over here I will use the summary function on top of the model which you just built and I'll store it in training summary.", 'start': 27648.396, 'duration': 8.161}, {'end': 27661.3, 'text': 'And I would want to know the residuals of this training summary.', 'start': 27657.117, 'duration': 4.183}, {'end': 27672.726, 'text': 'So these residual values are basically the error in prediction or in other words, the difference between the actual values and the predicted values.', 'start': 27661.88, 'duration': 10.846}], 'summary': 'Using summary function to analyze model performance and identify residuals.', 'duration': 24.33, 'max_score': 27648.396, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4027648396.jpg'}], 'start': 25590.186, 'title': 'Impact of partitioning, batch vs real-time processing, spark rdds, spark ecosystem, and data frame operations', 'summary': 'Showcases the impact of partitioning on query time with a reduction from 85 seconds to 41 seconds, discusses the efficiency of spark in real-time processing being 100 times faster for large-scale data processing, introduces rdds in spark and their significance in achieving in-memory data sharing, provides an overview of the spark ecosystem and its components, and covers creating a data frame in spark, performing basic operations, applying spark sql queries, and implementing linear regression algorithm with a root mean square error value of 9.92 and r square value of 0.98.', 'chapters': [{'end': 25640.875, 'start': 25590.186, 'title': 'Partitioning impact on query time', 'summary': 'Demonstrates the impact of partitioning on query time, showing a reduction from 85 seconds to 41 seconds when using partitioning.', 'duration': 50.689, 'highlights': ['Using partitioning reduced query time from 85 seconds to 41 seconds, demonstrating a significant impact on performance.', 'The demonstration showcases the effectiveness of partitioning in improving query execution speed.']}, {'end': 26100.284, 'start': 25641.595, 'title': 'Batch vs real-time processing', 'summary': 'Discusses the differences between batch and real-time processing, highlighting the time lag in hadoop batch processing and the efficiency of spark in real-time processing, with spark being 100 times faster for large-scale data processing, used widely in industries like healthcare, media, finance, e-commerce, and travel, and deployed in conjunction with hadoop by various organizations such as yahoo, amazon, nasa, ebay, hortonworks, cloudera, altscale, and uber.', 'duration': 458.689, 'highlights': ['Efficiency of Spark in Real-Time Processing Spark provides real-time computation and low latency due to in-memory computation, and it is 100 times faster for large-scale data processing, making it a preferred choice in various industries.', 'Industries Using Spark Widely Spark is extensively used in industries like healthcare, media, finance, e-commerce, and travel to add value to their business, making it a popular tool for big data applications.', 'Deployment of Hadoop and Spark Together Organizations using Hadoop and Spark together can process billions of events every day at an analytical speed of 40 milliseconds per event, reduce costs by almost 40%, and avoid duplication by unifying clusters into one supporting both Hadoop and Spark.', 'Organizations Using Hadoop and Spark Together Various organizations such as Yahoo, Amazon, NASA, eBay, Hortonworks, Cloudera, AltScale, and Uber are already using Hadoop and Spark together for big data processing.']}, {'end': 26799.83, 'start': 26100.704, 'title': 'Introduction to spark rdds', 'summary': 'Introduces rdds in spark, explaining their significance in achieving in-memory data sharing for faster processing, and how to create and perform operations on rdds, including transformations like map, flatmap, filter, intersection, and actions like reduce, first, take, and foreachpartition.', 'duration': 699.126, 'highlights': ['RDDs enable in-memory data sharing for faster processing compared to network and disk sharing. In-memory data sharing achieved through RDDs is faster than network and disk sharing, improving processing efficiency.', 'Three ways to create RDDs: parallelizing existing collection, referencing a dataset in an external storage system, and transforming an existing RDD. Three methods to create RDDs: parallelizing existing collection, referencing a dataset in an external storage system, and transforming an existing RDD.', 'Transformation functions like map, flatMap, filter, and intersection help manipulate and process RDDs. Transformation functions like map, flatMap, filter, and intersection enable manipulation and processing of RDDs.', 'Actions like reduce, first, take, and forEachPartition are used to work with the actual dataset and give non-RDD values. Actions like reduce, first, take, and forEachPartition are used to work with the actual dataset and provide non-RDD values.']}, {'end': 27130.268, 'start': 26800.071, 'title': 'Spark ecosystem overview', 'summary': 'Introduces the various components of the spark ecosystem, including spark core, deployment modes, ecosystem libraries like spark sql, ml lab, graphics, and streaming, its multi-language support, storage options, and a deep dive into spark sql, highlighting its features, advantages, architecture, and scalability.', 'duration': 330.197, 'highlights': ['Spark SQL provides unified data access, allowing the querying of structured data from various sources like CSV, JSON, Hive, and Cassandra, and supports seamless integration of SQL-like queries with Spark programs.', 'Spark SQL architecture includes the Data Source API with built-in support for Hive, Avro, JSON, JDBC, Parkit, and third-party integration through Spark packages, the DataFrame API for distributed collection of data, and the SQL interpreter and optimizer for transformation, analysis, optimization, and planning.', 'Spark SQL execution is much faster than Hive, with queries taking only 50 seconds in Spark SQL compared to 600 seconds in Hive, and it supports real-time querying.', 'Spark ecosystem includes various libraries like Spark SQL for SQL-like queries, ML Lab for scalable machine learning pipelines, Graphics for graph processing, and Streaming for batch processing and streaming of data.', "Spark ecosystem supports multiple deployment modes like Hadoop YARN, Mesos, and Spark's own cluster manager, and is implemented in Scala, R, Python, and Java with Scala being the widely used language.", 'Spark core is responsible for basic IO functions, scheduling, monitoring, and is the foundation of the entire Spark ecosystem.']}, {'end': 27696.518, 'start': 27131.189, 'title': 'Spark data frame and mllib operations', 'summary': 'Covers creating a data frame in spark, performing basic operations on data frames, applying spark sql queries, and implementing linear regression algorithm using mllib with a root mean square error value of 9.92 and r square value of 0.98.', 'duration': 565.329, 'highlights': ['Implementing linear regression algorithm using MLlib with a root mean square error value of 9.92 and R square value of 0.98 The linear regression algorithm is implemented using MLlib, resulting in a root mean square error value of 9.92 and an R square value of 0.98.', 'Applying Spark SQL queries on the data frame and creating a temporary view Spark SQL queries are applied on the data frame, creating a temporary view and extracting records based on specific conditions.', 'Basic operations on Spark data frames including creating, converting, and extracting columns Basic operations on Spark data frames are performed, including creating, converting, and extracting columns based on specific criteria.']}], 'duration': 2106.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4025590186.jpg', 'highlights': ['Using partitioning reduced query time from 85 seconds to 41 seconds, demonstrating a significant impact on performance.', 'Efficiency of Spark in Real-Time Processing Spark provides real-time computation and low latency due to in-memory computation, and it is 100 times faster for large-scale data processing, making it a preferred choice in various industries.', 'RDDs enable in-memory data sharing for faster processing compared to network and disk sharing. In-memory data sharing achieved through RDDs is faster than network and disk sharing, improving processing efficiency.', 'Spark SQL provides unified data access, allowing the querying of structured data from various sources like CSV, JSON, Hive, and Cassandra, and supports seamless integration of SQL-like queries with Spark programs.', 'Implementing linear regression algorithm using MLlib with a root mean square error value of 9.92 and R square value of 0.98 The linear regression algorithm is implemented using MLlib, resulting in a root mean square error value of 9.92 and an R square value of 0.98.']}, {'end': 30076.083, 'segs': [{'end': 28425.442, 'src': 'embed', 'start': 28403.475, 'weight': 3, 'content': [{'end': 28414.26, 'text': 'Now, what is more interesting is that if you look at the schema of this guy, look at the schema, all the column names and their data types,', 'start': 28403.475, 'duration': 10.785}, {'end': 28415.941, 'text': 'everything is automatically inferred.', 'start': 28414.26, 'duration': 1.681}, {'end': 28418.919, 'text': "Don't you think this is really interesting?", 'start': 28417.338, 'duration': 1.581}, {'end': 28425.442, 'text': "Because you don't have to spend any number of time understanding what is the structure, where it is, how many columns are there?", 'start': 28419.479, 'duration': 5.963}], 'summary': 'Schema infers all column names and data types automatically, saving time and effort in understanding structure.', 'duration': 21.967, 'max_score': 28403.475, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4028403475.jpg'}, {'end': 28903.286, 'src': 'embed', 'start': 28850.329, 'weight': 0, 'content': [{'end': 28857.573, 'text': 'So you do that, right? So that is recorded in something called a VXML or a voice XML.', 'start': 28850.329, 'duration': 7.244}, {'end': 28866.338, 'text': 'So, basically, that XML file will contain the metadata the duration of the call, what the customer pressed one, two,', 'start': 28858.353, 'duration': 7.985}, {'end': 28869.219, 'text': 'three and how long he took to press, et cetera.', 'start': 28866.338, 'duration': 2.881}, {'end': 28876.223, 'text': 'So this data, so they had like millions of records in this data because they had a lot of calls in their business.', 'start': 28869.899, 'duration': 6.324}, {'end': 28884.771, 'text': 'So every day what will happen, we will get this data, this V XML, right? And we were not able to convert it using Spark.', 'start': 28876.704, 'duration': 8.067}, {'end': 28885.932, 'text': "So that's what I'm saying.", 'start': 28885.111, 'duration': 0.821}, {'end': 28888.734, 'text': 'In all the cases, the XML conversion will not work.', 'start': 28885.952, 'duration': 2.782}, {'end': 28892.457, 'text': 'And they wrote their own parser to convert the data.', 'start': 28889.494, 'duration': 2.963}, {'end': 28896.38, 'text': 'But just to show you an example of how to do this.', 'start': 28892.537, 'duration': 3.843}, {'end': 28899.983, 'text': "So this is the XML file that I'm going to use.", 'start': 28897.341, 'duration': 2.642}, {'end': 28903.286, 'text': 'And this is a very simple XML file.', 'start': 28901.044, 'duration': 2.242}], 'summary': "Millions of call records in vxml format parsed using custom parser due to spark's failure.", 'duration': 52.957, 'max_score': 28850.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4028850329.jpg'}, {'end': 29143.456, 'src': 'embed', 'start': 29086.309, 'weight': 1, 'content': [{'end': 29098.878, 'text': 'The data frame looks nice, but the problem is that if you look at the first column, the address column, the address column is actually struct,', 'start': 29086.309, 'duration': 12.569}, {'end': 29099.219, 'text': 'I think.', 'start': 29098.878, 'duration': 0.341}, {'end': 29113.459, 'text': 'If I look at the schema, the address column is of a data type called a struct, which means within the address column you have three more columns city,', 'start': 29100.5, 'duration': 12.959}, {'end': 29114.259, 'text': 'country, pin code.', 'start': 29113.459, 'duration': 0.8}, {'end': 29115.98, 'text': 'That is how it will read.', 'start': 29114.9, 'duration': 1.08}, {'end': 29124.944, 'text': 'So if you want to run SQL queries and all one thing you can do is that you can say address.city, address.country.', 'start': 29116.78, 'duration': 8.164}, {'end': 29126.705, 'text': 'like that you can say right?', 'start': 29124.944, 'duration': 1.761}, {'end': 29131.047, 'text': 'But an easier way will be that you know you can again expand it.', 'start': 29127.205, 'duration': 3.842}, {'end': 29141.656, 'text': 'So I can simply say where, you know, I just want to select employee number, employee name, address dot city, address dot country, etc.', 'start': 29131.647, 'duration': 10.009}, {'end': 29143.456, 'text': 'So you see the new data frame.', 'start': 29142.016, 'duration': 1.44}], 'summary': 'The data frame has a nested address column with city, country, and pin code, making it easier to run sql queries and select specific columns.', 'duration': 57.147, 'max_score': 29086.309, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029086309.jpg'}, {'end': 29470.125, 'src': 'embed', 'start': 29435.013, 'weight': 10, 'content': [{'end': 29439.036, 'text': "i don't think much people are using it this one.", 'start': 29435.013, 'duration': 4.023}, {'end': 29441.658, 'text': 'mostly people prefer orc or parquet.', 'start': 29439.036, 'duration': 2.622}, {'end': 29444.38, 'text': 'now again there is this competition.', 'start': 29441.658, 'duration': 2.722}, {'end': 29448.176, 'text': 'if you are using hortonworks, you will see ORC more.', 'start': 29444.38, 'duration': 3.796}, {'end': 29451.697, 'text': 'If you are using Cloudera, you will see Parquet more.', 'start': 29448.856, 'duration': 2.841}, {'end': 29459.18, 'text': 'So Hortonworks was originally, I think they were contributing a lot to the ORC project.', 'start': 29452.698, 'duration': 6.482}, {'end': 29470.125, 'text': 'So when you start working on Hortonworks platform, Hortonworks will say that, you know what, if you want to compress your data, please use ORC format.', 'start': 29459.661, 'duration': 10.464}], 'summary': 'Hortonworks promotes orc format over parquet, preferred by cloudera users.', 'duration': 35.112, 'max_score': 29435.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029435013.jpg'}, {'end': 29542.468, 'src': 'embed', 'start': 29511.936, 'weight': 4, 'content': [{'end': 29521.521, 'text': 'And this is really really important, because you never want to store the data as it is on big data clusters,', 'start': 29511.936, 'duration': 9.585}, {'end': 29524.582, 'text': 'because that will consume a lot of space for you, right?', 'start': 29521.521, 'duration': 3.061}, {'end': 29533.246, 'text': 'So people are always searching for ways to reduce the total file size, right?', 'start': 29525.143, 'duration': 8.103}, {'end': 29537.467, 'text': 'And parquet is what you have in your syllabus, because you are learning Cloudera course, right?', 'start': 29533.627, 'duration': 3.84}, {'end': 29540.928, 'text': 'You are learning CCA 175..', 'start': 29538.147, 'duration': 2.781}, {'end': 29542.468, 'text': 'But yeah, ORC is very nice.', 'start': 29540.928, 'duration': 1.54}], 'summary': 'Reduce big data file size using parquet and orc formats for efficient storage.', 'duration': 30.532, 'max_score': 29511.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029511936.jpg'}, {'end': 29616.607, 'src': 'embed', 'start': 29573.385, 'weight': 8, 'content': [{'end': 29574.945, 'text': 'There is no RC reader, I guess.', 'start': 29573.385, 'duration': 1.56}, {'end': 29576.226, 'text': 'RC is no longer there.', 'start': 29575.125, 'duration': 1.101}, {'end': 29581.168, 'text': 'But let me show you a sample so that you understand how this Parquet files looks like.', 'start': 29577.286, 'duration': 3.882}, {'end': 29596.974, 'text': 'So people will be thinking like, what do you mean by a Parquet file? Do we have the file? There is a file called babynames.parquet.', 'start': 29581.688, 'duration': 15.286}, {'end': 29609.943, 'text': 'Can I upload this? babynames.parquet to user training.', 'start': 29599.196, 'duration': 10.747}, {'end': 29616.607, 'text': "Oh, it's already there, babynames.parquet.", 'start': 29614.386, 'duration': 2.221}], 'summary': 'No rc reader available. demonstrating parquet file with babynames.parquet.', 'duration': 43.222, 'max_score': 29573.385, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029573385.jpg'}, {'end': 29790.627, 'src': 'embed', 'start': 29743.237, 'weight': 6, 'content': [{'end': 29753.985, 'text': 'uh, but where is this rdbms thing, if you guys allow me?', 'start': 29743.237, 'duration': 10.748}, {'end': 29755.686, 'text': 'okay, so we have it.', 'start': 29753.985, 'duration': 1.701}, {'end': 29757.988, 'text': 'uh, i just need to search for a file.', 'start': 29755.686, 'duration': 2.302}, {'end': 29761.55, 'text': 'then i will tell you what we are talking about.', 'start': 29757.988, 'duration': 3.562}, {'end': 29764.092, 'text': 'mysql dash connector.java.', 'start': 29761.55, 'duration': 2.542}, {'end': 29770.083, 'text': 'mysql dash Connector.java.', 'start': 29764.092, 'duration': 5.991}, {'end': 29780.005, 'text': 'Okay Inside this .bin.jar file.', 'start': 29774.524, 'duration': 5.481}, {'end': 29781.405, 'text': 'Okay So we got it.', 'start': 29780.225, 'duration': 1.18}, {'end': 29785.746, 'text': "Right So let's do one thing.", 'start': 29782.745, 'duration': 3.001}, {'end': 29790.627, 'text': 'Let us first verify whether MySQL is running on our cloud lab.', 'start': 29786.046, 'duration': 4.581}], 'summary': 'Discussion about locating and verifying mysql connector file for cloud lab.', 'duration': 47.39, 'max_score': 29743.237, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029743237.jpg'}, {'end': 30010.17, 'src': 'embed', 'start': 29960.469, 'weight': 7, 'content': [{'end': 29966.332, 'text': 'Now, how do you use it? First, exit from the Spark shell that you have.', 'start': 29960.469, 'duration': 5.863}, {'end': 29971.615, 'text': 'And you need to start this Spark shell with..', 'start': 29968.233, 'duration': 3.382}, {'end': 29974.898, 'text': 'couple of arguments.', 'start': 29973.137, 'duration': 1.761}, {'end': 29977.198, 'text': 'let me show you what it is.', 'start': 29974.898, 'duration': 2.3}, {'end': 29984.52, 'text': 'so we will say spark shell and just copy paste it.', 'start': 29977.198, 'duration': 7.322}, {'end': 29988.601, 'text': 'okay, my sequel connector Java.', 'start': 29984.52, 'duration': 4.081}, {'end': 29991.942, 'text': 'what is a version we are using?', 'start': 29988.601, 'duration': 3.341}, {'end': 30005.568, 'text': 'five, one, eighteen, eighteen, dot jar 5118.', 'start': 29991.942, 'duration': 13.626}, {'end': 30010.17, 'text': "yeah, so what i'm going to do?", 'start': 30005.568, 'duration': 4.602}], 'summary': 'Using spark shell with mysql connector version 5.1.18.18.', 'duration': 49.701, 'max_score': 29960.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4029960469.jpg'}], 'start': 27696.518, 'title': 'Working with various data formats in spark', 'summary': 'Discusses the significance of json data in big data and iot, loading and analyzing yelp data in spark, querying nested json data in spark sql, handling big data formats such as orc and parquet, and connecting spark to a mysql database.', 'chapters': [{'end': 27875.583, 'start': 27696.518, 'title': 'Dealing with json data', 'summary': 'Discusses the importance of json data in representing and accessing various types of data, especially in big data and iot applications, with mongodb being a popular nosql database storing data in json format.', 'duration': 179.065, 'highlights': ['JSON is widely used, especially in big data and IoT applications, and is a standardized way of representing data. JSON is one of the most widely used file formats, especially in the case of big data and IoT applications, and is a popular and standardized way of representing data.', 'MongoDB is popular because it stores data as JSON, allowing programmers to natively work with the data. MongoDB is popular because it stores data as JSON, allowing programmers to natively work with the data, making it extremely popular in the market.', 'JSON data is important in various areas and NoSQL databases like MongoDB are popular due to its use of JSON. JSON data is very important in various areas, and NoSQL databases like MongoDB are extremely popular due to their use of JSON.']}, {'end': 28499.124, 'start': 27875.976, 'title': 'Analyzing yelp data in spark', 'summary': 'Discusses loading yelp data in json format into spark, analyzing the semi-structured data, and creating a data frame with inferred schema, demonstrating the ease of processing and querying.', 'duration': 623.148, 'highlights': ['Spark automatically reads and infers schema from the JSON data, creating a data frame with nested columns, and enabling easy querying. Spark automatically reads the JSON data, infers its schema, and creates a data frame, simplifying the process of understanding and querying the data.', 'Discussion on packaging Spark programs using SBT for production, and the option to use the shell for easier program visualization. Package Spark programs using SBT for production, with the option to use the shell for easier program visualization, providing flexibility and versatility in program execution.', "Exploration of Yelp's business listing and data, highlighting the features such as restaurant details, reviews, reservation information, and the availability of public data for download. Exploration of Yelp's business listing, including restaurant details, reviews, reservation information, and the availability of public data for download, showcasing the breadth of available data for analysis."]}, {'end': 29315.869, 'start': 28499.524, 'title': 'Working with json and xml in spark sql', 'summary': 'Covers running sql queries on nested json data, demonstrating the speed and ease of working with json data in spark sql, and the process of reading and working with xml files in spark sql, including the customization required for certain scenarios.', 'duration': 816.345, 'highlights': ['Running SQL queries on nested JSON data in Spark SQL The speaker demonstrates running SQL queries on nested JSON data, showcasing the speed of the queries by retrieving state-wise restaurant data and highlighting the ease of working with JSON data in Spark SQL.', 'Demonstrating the process of reading and working with XML files in Spark SQL The speaker explains the process of reading and working with XML files in Spark SQL, including the identification of root and row tags, the use of the Databricks Spark XML package, and the customization required for certain scenarios.', 'Comparison between JSON and XML data handling in Spark SQL The speaker discusses the differences between handling JSON and XML data in Spark SQL, emphasizing the commonality and challenges of working with XML files and the customization required for reading XML files in certain scenarios.']}, {'end': 29743.237, 'start': 29315.869, 'title': 'Dealing with big data formats', 'summary': "Discusses challenges faced while working with various big data formats, including the best compression formats orc and parquet, with orc achieving the best compression and hive and cloudera's preference for parquet, highlighting the importance of using compressed files to reduce space in big data clusters.", 'duration': 427.368, 'highlights': ['ORC is the best compression format, achieving optimized row columnar format and preferred by Hortonworks platform, while Cloudera prefers Parquet, achieving around 50% compression, and heavily compressed, significantly reducing file size on big data clusters. ORC is the best compression format, achieving optimized row columnar format and preferred by Hortonworks platform, while Cloudera prefers Parquet, achieving around 50% compression, and heavily compressed, significantly reducing file size on big data clusters.', 'The challenges faced while working with big data include encountering files that are difficult to understand and read, which may require writing custom code and using parsers to convert and read the data, highlighting the practical difficulties in real-world data processing. The challenges faced while working with big data include encountering files that are difficult to understand and read, which may require writing custom code and using parsers to convert and read the data, highlighting the practical difficulties in real-world data processing.', 'Spark SQL offers excellent connectivity options, enabling the reading of data from various sources such as RDBMS and Hive, providing flexibility and compatibility with different data platforms. Spark SQL offers excellent connectivity options, enabling the reading of data from various sources such as RDBMS and Hive, providing flexibility and compatibility with different data platforms.']}, {'end': 30076.083, 'start': 29743.237, 'title': 'Connecting mysql database to spark', 'summary': 'Introduces the process of connecting to a mysql database from spark, verifying the database and data, and discussing the usage of the mysql connector 5.1.18.jar to read data into spark.', 'duration': 332.846, 'highlights': ['The speaker demonstrates the process of logging into a MySQL database running on the cloud lab, showing the presence of a retail DB with tables like customers and orders, and accessing data from the orders table with around 68,000 rows. The speaker logs into a MySQL database on the cloud lab, verifies the existence of a retail DB with tables like customers and orders, and shows accessing data from the orders table with approximately 68,000 rows.', 'The importance of having the MySQL connector or MySQL driver (JDBC) to interact with RDBMS like MySQL is emphasized, and the speaker verifies the presence of the MySQL connector 5.1.18.jar, explaining its usage to connect to Spark. The speaker emphasizes the significance of having the MySQL connector or driver to interact with RDBMS and verifies the presence of the MySQL connector 5.1.18.jar, explaining its usage to connect to Spark.', 'The speaker introduces the process of starting the Spark shell with the MySQL connector 5.1.18.jar added as a driver to the session, and discusses creating a Java Utilities Properties object to read data from the MySQL database into Spark. The speaker explains starting the Spark shell with the MySQL connector 5.1.18.jar added as a driver and discusses creating a Java Utilities Properties object to read data from the MySQL database into Spark.']}], 'duration': 2379.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4027696518.jpg', 'highlights': ['JSON is widely used in big data and IoT applications, and is a standardized way of representing data.', 'MongoDB is popular because it stores data as JSON, allowing programmers to natively work with the data.', 'Spark automatically reads and infers schema from JSON data, creating a data frame with nested columns, and enabling easy querying.', 'Package Spark programs using SBT for production, with the option to use the shell for easier program visualization.', "Exploration of Yelp's business listing and data, highlighting restaurant details, reviews, reservation information, and public data availability.", 'Running SQL queries on nested JSON data in Spark SQL, showcasing the speed of the queries and the ease of working with JSON data.', 'ORC is the best compression format, achieving optimized row columnar format and preferred by Hortonworks platform.', 'Challenges faced while working with big data include encountering difficult-to-understand files, requiring custom code and parsers.', 'Spark SQL offers excellent connectivity options, enabling reading of data from various sources such as RDBMS and Hive.', 'The speaker demonstrates the process of logging into a MySQL database running on the cloud lab and accessing data from the orders table with around 68,000 rows.', 'The importance of having the MySQL connector or driver to interact with RDBMS like MySQL is emphasized.', 'The speaker introduces the process of starting the Spark shell with the MySQL connector 5.1.18.jar added as a driver to the session.']}, {'end': 32641.032, 'segs': [{'end': 31376.451, 'src': 'embed', 'start': 31336.801, 'weight': 4, 'content': [{'end': 31347.712, 'text': "Now, we already have the Spark session, so I don't have to worry about this, right? Now, if you want to read, all you need to do is this.", 'start': 31336.801, 'duration': 10.911}, {'end': 31353.695, 'text': 'So I can simply go here and I can simply say Spark dot SQL.', 'start': 31348.432, 'duration': 5.263}, {'end': 31358.558, 'text': 'And I can say select star from.', 'start': 31355.236, 'duration': 3.322}, {'end': 31367.383, 'text': 'I think this should work.', 'start': 31358.578, 'duration': 8.805}, {'end': 31376.451, 'text': "See, you don't have to do any of this, like creating hive context and doing this, that and all.", 'start': 31370.467, 'duration': 5.984}], 'summary': 'Using spark session simplifies data querying, eliminating need for creating hive context.', 'duration': 39.65, 'max_score': 31336.801, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4031336801.jpg'}, {'end': 32335.123, 'src': 'embed', 'start': 32303.493, 'weight': 0, 'content': [{'end': 32314.596, 'text': 'what is the advantage of doing a broadcast join is that when you do broadcast join, so this data frame one is 20 gb.', 'start': 32303.493, 'duration': 11.103}, {'end': 32323.499, 'text': 'right, imagine data frame one is lying on 33 machines because it is a very big file.', 'start': 32314.596, 'duration': 8.903}, {'end': 32328.301, 'text': "it is divided into what blocks and let's say, the data is lying on 33 machines.", 'start': 32323.499, 'duration': 4.802}, {'end': 32335.123, 'text': 'DataFrame 2 will be normally relying on only one machine because it is a very small file.', 'start': 32329.781, 'duration': 5.342}], 'summary': 'Broadcast join optimizes data processing by minimizing data movement, e.g. reducing 20gb to 1 machine.', 'duration': 31.63, 'max_score': 32303.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4032303493.jpg'}, {'end': 32599.193, 'src': 'embed', 'start': 32572.088, 'weight': 2, 'content': [{'end': 32577.29, 'text': 'So these are some of the things which will become very clear to you when you start working.', 'start': 32572.088, 'duration': 5.202}, {'end': 32586.673, 'text': 'Maybe not immediately you will understand this, but the size of the partition, the number of partitions to be used,', 'start': 32578.45, 'duration': 8.223}, {'end': 32592.356, 'text': 'these all have a very strong impact on your performance.', 'start': 32586.673, 'duration': 5.683}, {'end': 32596.297, 'text': "Any questions on these properties? Anything that you don't understand, let me know.", 'start': 32592.756, 'duration': 3.541}, {'end': 32599.193, 'text': 'on these properties we were discussing.', 'start': 32597.392, 'duration': 1.801}], 'summary': 'Partition size and number impact performance significantly.', 'duration': 27.105, 'max_score': 32572.088, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4032572088.jpg'}], 'start': 30076.203, 'title': 'Spark sql and hive integration', 'summary': 'Covers connecting to rdbms like mysql in spark, integrating spark sql with hive, working with spark and hive, spark hive integration, performance tuning in spark sql including caching data, tuning compression, and adjusting batch sizes, and optimizing join operations in spark sql for improved performance.', 'chapters': [{'end': 30608.825, 'start': 30076.203, 'title': 'Connecting to rdbms and hive in spark', 'summary': 'Discusses the process of connecting to an rdbms like mysql in spark, fixing errors related to driver and connection string, and explains the practical scenario of reading data from rdbms into spark. it also touches upon accessing hive from spark for data operations.', 'duration': 532.622, 'highlights': ['Connecting to RDBMS like MySQL in Spark The chapter discusses the process of connecting to an RDBMS like MySQL in Spark, fixing errors related to driver and connection string.', 'Practical scenario of reading data from RDBMS into Spark The practical scenario of reading data from RDBMS into Spark is explained, emphasizing the use case for join operations between large and small datasets.', 'Accessing Hive from Spark for data operations The process of accessing Hive from Spark for data operations is briefly touched upon, including showing databases and tables in Hive.']}, {'end': 31029.886, 'start': 30608.825, 'title': 'Integrating spark sql with hive', 'summary': 'Discusses the process of integrating spark sql with hive, including steps such as copying the hive-site.xml file to the spark configuration folder, creating a hive context to connect with hive from the spark sql session, and creating a hive table using spark sql.', 'duration': 421.061, 'highlights': ['The process of integrating Spark SQL with Hive involves copying the hive-site.xml file to the Spark configuration folder, which is typically done by the administrator in big data infrastructure setups. This process is typically handled by the administrator in big data infrastructures, involving copying the hive-site.xml file from the hive configuration folder into the Spark configuration folder.', 'Creating a hive context object allows connecting with Hive from the Spark SQL session and using it for operations on the hive table. Creating a hive context object enables the connection with Hive from the Spark SQL session, allowing for operations on the hive table.', "The chapter also covers the creation of a hive table using Spark SQL, demonstrated through the 'create table' statement for the Yahoo stocks table. The creation of a hive table using Spark SQL is demonstrated through the 'create table' statement for the Yahoo stocks table."]}, {'end': 31278.796, 'start': 31031.275, 'title': 'Working with spark and hive', 'summary': 'Discusses the process of creating a hive context, loading data into hive tables using spark, and the deprecation of using hive context in the latest versions of spark.', 'duration': 247.521, 'highlights': ["The process involves creating a Hive context object using 'new HiveContext' and running Hive queries directly from the Spark SQL command line.", "Loading data into a Hive table is done using commands such as 'myhivecontext.sql.load data in path' and verifying the data in Hive using commands like 'show tables' and 'select star from table'.", 'The technique of creating a Hive context and accessing Hive tables is considered older and deprecated in the latest versions of Spark, as the Spark session itself supports these functionalities.', 'An important point to note is that when creating a Hive context, a deprecation warning is received, indicating that the technique is outdated and should be rerun with deprecation.']}, {'end': 31750.341, 'start': 31279.277, 'title': 'Spark hive integration', 'summary': 'Discusses the process of integrating spark with apache hive, emphasizing on reading and writing data, using spark sql within the spark session, and saving data frames in hive tables with different save modes.', 'duration': 471.064, 'highlights': ['The recommended way in Spark version 2.2 and 2.3 is to use the spark session directly to read any table from hive, providing a simplified approach for hive integration.', 'When saving a data frame in a hive table, different save modes such as append, overwrite, and replace can be used, allowing for flexibility in managing the data being saved.', 'Using Spark SQL within the Spark session allows for simple reading of hive tables by just specifying the table name and running the desired query, streamlining the data retrieval process.', 'The configuration for Spark SQL to read and write data stored in Apache Hive involves placing Hive Site XML or Core Site XML in the Spark configuration, providing an easy setup process for hive integration.', 'By using spark.read.csv to read data into Spark and then saving it in a hive table using save as table with the specified save mode, the data can be easily transferred and stored in hive, enhancing the data management capabilities.', 'The process of integrating Spark with Apache Hive involves utilizing the Spark session for hive integration, simplifying the reading and storing of data in hive tables, ensuring a seamless and efficient data management process.']}, {'end': 32230.108, 'start': 31750.341, 'title': 'Performance tuning in spark sql', 'summary': 'Discusses performance tuning in spark sql, emphasizing the benefits of caching data in memory, tuning compression, and adjusting batch sizes to improve query execution and reduce memory usage, with techniques such as caching tables, tuning compression codecs, setting batch sizes, and adjusting partitioning properties.', 'duration': 479.767, 'highlights': ['Caching tables in Spark SQL can improve performance, with the ability to cache data frames using in-memory columnar format, resulting in faster queries on the same data frame.', 'Tuning compression codecs in Spark SQL can automatically select appropriate compression codecs for columns based on data statistics, optimizing memory usage based on the unique values within each column.', 'Adjusting batch sizes for columnar caching in Spark SQL can impact memory utilization and compression, with larger batch sizes potentially improving memory usage but also carrying a risk of out-of-memory errors.', 'Setting properties such as maximum partition bytes and estimated cost to open a file in Spark SQL can optimize file reading and partitioning, crucial for efficient query execution.', 'Configuring parameters like broadcast timeout and auto broadcast join threshold in Spark SQL can optimize broadcast joins and table broadcasting for improved query performance.']}, {'end': 32641.032, 'start': 32230.108, 'title': 'Optimizing join operations in spark sql', 'summary': 'Explains how to optimize join operations in spark sql by using broadcast joins, setting broadcast join threshold, configuring spark sql shuffle partitions, and managing partition size to improve performance, which can make join operations two to three times faster.', 'duration': 410.924, 'highlights': ['Using broadcast joins in Spark SQL can make join operations two to three times faster by broadcasting the smaller file to all the machines where the larger file is distributed. When using broadcast joins, the smaller file is broadcasted to all the machines where the larger file is distributed, making join operations two to three times faster.', 'Setting the broadcast join threshold to a value less than 100 MB can speed up the join operation, but attempting broadcast join for files larger than 100 MB can lead to performance issues. Setting the broadcast join threshold to a value less than 100 MB can speed up the join operation, while attempting broadcast join for files larger than 100 MB can lead to performance issues.', 'Configuring the number of partitions to use when shuffling data for joins or aggregations using Spark SQL shuffle partitions is important, with the default number being 200 and sometimes recommended to be more than 200 to avoid shuffle operation failures. Configuring the number of partitions to use when shuffling data for joins or aggregations using Spark SQL shuffle partitions is important, with the default number being 200 and sometimes recommended to be more than 200 to avoid shuffle operation failures.', 'Partition size plays a critical role in shuffle and join operations, as partitions larger than 2 GB cannot support shuffle or join operations, and repartitioning can be used to manage partition size effectively. Partition size plays a critical role in shuffle and join operations, as partitions larger than 2 GB cannot support shuffle or join operations, and repartitioning can be used to manage partition size effectively.']}], 'duration': 2564.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4030076203.jpg', 'highlights': ['Caching tables in Spark SQL can improve performance, with the ability to cache data frames using in-memory columnar format, resulting in faster queries on the same data frame.', 'Using broadcast joins in Spark SQL can make join operations two to three times faster by broadcasting the smaller file to all the machines where the larger file is distributed.', 'The recommended way in Spark version 2.2 and 2.3 is to use the spark session directly to read any table from hive, providing a simplified approach for hive integration.', 'Tuning compression codecs in Spark SQL can automatically select appropriate compression codecs for columns based on data statistics, optimizing memory usage based on the unique values within each column.', 'The process of integrating Spark with Apache Hive involves utilizing the Spark session for hive integration, simplifying the reading and storing of data in hive tables, ensuring a seamless and efficient data management process.']}, {'end': 34897.674, 'segs': [{'end': 32989.705, 'src': 'embed', 'start': 32954.841, 'weight': 4, 'content': [{'end': 32956.042, 'text': 'You have a chat server.', 'start': 32954.841, 'duration': 1.201}, {'end': 32958.383, 'text': 'What is a chat server doing?', 'start': 32957.002, 'duration': 1.381}, {'end': 32962.523, 'text': 'It is storing all the records of the chats right?', 'start': 32958.463, 'duration': 4.06}, {'end': 32967.046, 'text': 'Now, probably the chat server need to communicate with the database server.', 'start': 32963.345, 'duration': 3.701}, {'end': 32973.788, 'text': "End of the day, probably the data is being pushed to some sort of a, let's say, NoSQL database or something.", 'start': 32968.066, 'duration': 5.722}, {'end': 32979.797, 'text': 'And probably the security system, also want the data from the chat server,', 'start': 32974.713, 'duration': 5.084}, {'end': 32989.705, 'text': 'because the security system is probably analyzing the chat logs and then making sure that there are no threats in the chat or something like that.', 'start': 32979.797, 'duration': 9.908}], 'summary': 'Chat server stores chat records, communicates with database, and feeds data to security system for analysis.', 'duration': 34.864, 'max_score': 32954.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4032954841.jpg'}, {'end': 33052.749, 'src': 'embed', 'start': 33004.904, 'weight': 1, 'content': [{'end': 33007.305, 'text': 'You can chat with him and then you can trade him.', 'start': 33004.904, 'duration': 2.401}, {'end': 33010.608, 'text': 'So this bank was actually analyzing the chat logs.', 'start': 33007.646, 'duration': 2.962}, {'end': 33019.413, 'text': 'To understand whether, you know, there was any threats or if the trade is not happening properly, etc.', 'start': 33011.708, 'duration': 7.705}, {'end': 33022.175, 'text': "Right If I'm placing my order through chat.", 'start': 33019.833, 'duration': 2.342}, {'end': 33025.396, 'text': 'Right Then what is the guarantee? Right.', 'start': 33022.955, 'duration': 2.441}, {'end': 33026.237, 'text': 'It will happen.', 'start': 33025.617, 'duration': 0.62}, {'end': 33027.678, 'text': 'So the bank had an application.', 'start': 33026.317, 'duration': 1.361}, {'end': 33033.436, 'text': 'which will collect this chat messages, analyze it for any discrepancies.', 'start': 33028.413, 'duration': 5.023}, {'end': 33038.46, 'text': "So first thing, the chat server is communicating with, let's say, the database server.", 'start': 33034.157, 'duration': 4.303}, {'end': 33042.642, 'text': 'It is also communicating with the security system.', 'start': 33039.1, 'duration': 3.542}, {'end': 33051.947, 'text': "Let's say there is a real time monitoring system which is analyzing the chat messages and displaying things like you know how many agents are active,", 'start': 33043.423, 'duration': 8.524}, {'end': 33052.749, 'text': 'so on and so forth.', 'start': 33051.947, 'duration': 0.802}], 'summary': 'Bank uses chat analysis for trade monitoring and security.', 'duration': 47.845, 'max_score': 33004.904, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4033004904.jpg'}, {'end': 33208.32, 'src': 'embed', 'start': 33176.197, 'weight': 0, 'content': [{'end': 33182.48, 'text': 'and kafka can be easily incorporated with any big data tool or framework.', 'start': 33176.197, 'duration': 6.283}, {'end': 33188.981, 'text': 'so basically, what happens in kafka is that there are producers and consumers.', 'start': 33182.48, 'duration': 6.501}, {'end': 33192.262, 'text': "that's what you see in the picture.", 'start': 33188.981, 'duration': 3.281}, {'end': 33198.463, 'text': 'producers are the applications which are sending the data.', 'start': 33192.262, 'duration': 6.201}, {'end': 33201.964, 'text': 'so producers send the data to the kafka cluster.', 'start': 33198.463, 'duration': 3.501}, {'end': 33208.32, 'text': 'consumers subscribe and receive the data from the Kafka cluster.', 'start': 33201.964, 'duration': 6.356}], 'summary': 'Kafka integrates with big data tools, using producers to send data to a kafka cluster, which consumers subscribe to.', 'duration': 32.123, 'max_score': 33176.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4033176197.jpg'}], 'start': 32641.032, 'title': 'Kafka in real-time data analytics', 'summary': 'Discusses the need for kafka in real-time data analytics, emphasizing its efficiency in real-time data collection, architecture overview, scalability, offset and retention concepts, data replication, and consumer groups. examples include sensor data, stock market data, and configuration for data retention and capacity.', 'chapters': [{'end': 32895.362, 'start': 32641.032, 'title': 'Kafka and real-time data analytics', 'summary': 'Discusses the need for kafka in real-time data analytics, highlighting the challenges faced by industries, such as the requirement for real-time data collection and analysis, and the examples of sensor data and stock market data. it also mentions the use of scala with eclipse and sbt for recompiling code for performance improvement.', 'duration': 254.33, 'highlights': ['The need for real-time data analytics is emphasized through examples of sensor data analysis in thermal power plants and real-time stock market data analysis, showcasing the challenges faced by industries.', 'The discussion of using Scala with Eclipse and SBT for recompiling code to enhance performance provides practical insights into optimizing code for better runtime efficiency.']}, {'end': 33428.778, 'start': 32896.384, 'title': 'Real-time data collection with kafka', 'summary': 'Discusses the need for real-time data collection in organizations with multiple front end servers, and how kafka efficiently solves the problem by acting as a message queue, managing and maintaining the real-time stream of data from different applications, and decoupling the data pipeline.', 'duration': 532.394, 'highlights': ['Kafka efficiently collects real-time data from multiple front end servers, such as web servers, app servers, and others, by acting as a message queue, managing and maintaining the real-time stream of data from different applications, and decoupling the data pipeline.', 'Kafka is designed to work on a fairly large amount of data and can be easily incorporated with any big data tool or framework, providing a solution for managing complex data pipelines.', 'Producers send data to the Kafka cluster, while consumers subscribe and receive the data, making it a typical publish-subscribe message queue, enabling asynchronous communication and message sending.', 'Kafka, originally created at LinkedIn, is written in Scala and Java, and is promoted by the company Confluent, which introduced the Kafka Streams project to add processing capabilities to Kafka.', 'Kafka does not interfere with the processing part of the data and does not transform the data, but the new project Kafka Streams aims to add processing capabilities to Kafka.']}, {'end': 33913.427, 'start': 33429.718, 'title': 'Understanding kafka architecture', 'summary': 'Provides an overview of kafka architecture, including the role of brokers, producers, consumers, topics, and spark streaming, emphasizing the flexibility of data formats and sources, and the role of partitions in managing data distribution and replication.', 'duration': 483.709, 'highlights': ['Brokers in a Kafka cluster manage and mediate the conversation between systems, with each node in a cluster becoming a broker, and are responsible for storing messages. In a Kafka cluster, each server or node becomes a broker, managing and mediating the conversation between different systems and storing messages. For instance, in a 100-node Kafka cluster, each server becomes a broker.', 'Kafka is agnostic to the message format, storing messages as byte arrays, enabling flexibility in data formats such as string, JSON, or Avro. Kafka does not impose constraints on the format of messages, treating them as byte arrays, allowing developers to decide the format, whether string, JSON, or other formats.', 'Topics are used to distinguish and organize messages, with each producer sending data to specific topics, preventing data mixing and enabling effective data organization. Topics in Kafka are crucial for organizing messages, ensuring that data from different producers is not mixed up, and allowing producers to send data to specific topics for effective organization.', 'Producers are processes that publish data to topics, with examples including sensor systems, Apache Flume, Amazon Kinesis, and Java programs, demonstrating the diverse sources of data. Producers in Kafka can include a wide variety of sources, such as sensor systems, Apache Flume, Amazon Kinesis, and Java programs, highlighting the diverse nature of data sources.', 'Spark streaming serves as a consumer for Kafka, enabling real-time analytics by processing data received from Kafka, exemplifying its role as a real-time analytics engine. Spark streaming acts as a consumer for Kafka, facilitating real-time analytics by processing data received from Kafka, particularly relevant when dealing with high data volumes, such as analyzing sensor data.']}, {'end': 34298.077, 'start': 33913.945, 'title': 'Kafka: scalability and partitions', 'summary': 'Explains the concept of kafka topics, partitions, and brokers, emphasizing the importance of partitioning for scalability, round-robin distribution of data by producers, and the role of zookeeper in storing cluster status and consumer offsets.', 'duration': 384.132, 'highlights': ['Kafka topics are divided into multiple partitions for scalability, with each partition assigned to a broker, allowing for efficient data storage and management. When a topic is divided into multiple partitions, each partition is assigned to a broker, enabling efficient scalability and storage of large volumes of real-time data.', 'Producers utilize a round-robin algorithm to equally distribute data records to different partitions within a topic, and Zookeeper stores information about cluster status and consumer offsets. The round-robin algorithm ensures that data sent by producers is equally distributed among partitions, and Zookeeper is responsible for storing crucial information about cluster status and consumer offsets.', 'Producers can configure the priority of data records and select the partition to send the message per topic, while brokers assign offsets to messages and commit them to storage on disk. Producers have the flexibility to configure the priority of data records and select the target partition, while brokers assign offsets to messages and store them on disk, ensuring efficient data management.', "Kafka supports replication of topic partitions across multiple brokers, ensuring data availability and fault tolerance in case of broker failures. Kafka's support for replication of topic partitions across multiple brokers ensures data availability and fault tolerance, critical for maintaining uninterrupted data processing in case of broker failures."]}, {'end': 34546.827, 'start': 34299.138, 'title': 'Kafka offset and retention', 'summary': "Explains the concept of kafka offset, which allows consumers to resume reading data from a specific point and the retention policy, where kafka retains data for a specified period, with an example of a cluster's 7-day retention and a configuration for a 3-day retention for a 500-terabyte capacity, along with the role of zookeeper in coordinating messages' sequence.", 'duration': 247.689, 'highlights': ["Kafka retains data for one week by default, but this can be configured differently in production clusters, such as retaining data for three days in a 500-terabyte capacity cluster. In production clusters, Kafka's retention period can be configured differently, such as retaining data for three days in a 500-terabyte capacity cluster, ensuring efficient data management.", 'Kafka offset allows consumers to resume reading data from a specific point, with each message having an offset ID, and the need for consumers to store this offset, typically in ZooKeeper. Kafka offset allows consumers to resume reading data from a specific point, with each message having an offset ID, and the need for consumers to store this offset, typically in ZooKeeper, ensuring continuity in data consumption.', "ZooKeeper plays a role in coordinating messages' sequence, ensuring that messages are delivered in the sequence based on the producer's order. ZooKeeper plays a role in coordinating messages' sequence, ensuring that messages are delivered in the sequence based on the producer's order, maintaining the integrity of data delivery."]}, {'end': 34897.674, 'start': 34547.287, 'title': 'Understanding kafka data replication and consumer groups', 'summary': 'Explains data replication in kafka, with topics divided into partitions and replicated across multiple brokers, and how consumer groups read messages from different partitions while managing offsets and processing data.', 'duration': 350.387, 'highlights': ['Kafka topics are divided into partitions, which are replicated across multiple brokers, with each partition having a leader broker and additional replicas, allowing for fault tolerance and high availability. Kafka topics are divided into partitions, replicated across multiple brokers, with each partition having a leader broker and additional replicas. This allows for fault tolerance and high availability. Replication can be set to three, similar to Hadoop.', 'Consumer groups in Kafka consist of one or more consumers, and each consumer within a group is responsible for reading messages from specific partitions, managing offsets, and processing the data. Consumer groups in Kafka consist of one or more consumers, and each consumer within a group is responsible for reading messages from specific partitions, managing offsets, and processing the data. Consumer groups ensure efficient message consumption and processing.', 'Consumers store the offset of messages they have read, and messages can be read by multiple consumers, providing fault tolerance and enabling parallel processing of messages. Consumers store the offset of messages they have read, and messages can be read by multiple consumers, providing fault tolerance and enabling parallel processing of messages. This ensures efficient and fault-tolerant message processing.']}], 'duration': 2256.642, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4032641032.jpg', 'highlights': ['Kafka efficiently collects real-time data from multiple front end servers, such as web servers, app servers, and others, by acting as a message queue, managing and maintaining the real-time stream of data from different applications, and decoupling the data pipeline.', 'Kafka topics are divided into multiple partitions for scalability, with each partition assigned to a broker, allowing for efficient data storage and management.', 'Kafka supports replication of topic partitions across multiple brokers, ensuring data availability and fault tolerance in case of broker failures.', 'Kafka retains data for one week by default, but this can be configured differently in production clusters, such as retaining data for three days in a 500-terabyte capacity cluster.', 'Consumer groups in Kafka consist of one or more consumers, and each consumer within a group is responsible for reading messages from specific partitions, managing offsets, and processing the data.']}, {'end': 36251.463, 'segs': [{'end': 35016.896, 'src': 'embed', 'start': 34957.315, 'weight': 3, 'content': [{'end': 34959.516, 'text': "And I'm being very honest here.", 'start': 34957.315, 'duration': 2.201}, {'end': 34966.677, 'text': 'Okay Because the problem is that if you ask me, is it a developer tool? No.', 'start': 34959.876, 'duration': 6.801}, {'end': 34970.118, 'text': 'If you ask me whether developers needs it? Yes.', 'start': 34967.717, 'duration': 2.401}, {'end': 34978.662, 'text': 'So, basically, the idea of Zookeeper is that it is a project which was developed at Yahoo right?', 'start': 34971.4, 'duration': 7.262}, {'end': 34984.784, 'text': 'And its original idea was to maintain the status of different services.', 'start': 34979.302, 'duration': 5.482}, {'end': 34990.125, 'text': "In the sense, let's say you have a 100 node Hadoop cluster.", 'start': 34986.304, 'duration': 3.821}, {'end': 34992.426, 'text': 'You have a 100 node Hadoop cluster.', 'start': 34990.505, 'duration': 1.921}, {'end': 34997.387, 'text': 'Now, in the 100 node Hadoop cluster, you will have HDFS.', 'start': 34993.526, 'duration': 3.861}, {'end': 35000.608, 'text': 'So you have name node, data node, and these guys.', 'start': 34997.987, 'duration': 2.621}, {'end': 35005.334, 'text': 'You have Yarn, you have a resource manager, node manager, these guys.', 'start': 35001.493, 'duration': 3.841}, {'end': 35014.476, 'text': 'You may also have Kafka, right? So in Kafka, you have multiple brokers, then there is leader, then there is follower for each partition, et cetera.', 'start': 35006.074, 'duration': 8.402}, {'end': 35016.896, 'text': 'You may also have HBase.', 'start': 35015.376, 'duration': 1.52}], 'summary': 'Zookeeper is a project developed at yahoo to maintain status of different services like hadoop, yarn, kafka, and hbase.', 'duration': 59.581, 'max_score': 34957.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4034957315.jpg'}, {'end': 35116.56, 'src': 'embed', 'start': 35090.733, 'weight': 6, 'content': [{'end': 35095.434, 'text': 'Now, anything can happen, right? The primary can go down, standby can take over.', 'start': 35090.733, 'duration': 4.701}, {'end': 35096.615, 'text': 'That can also go down.', 'start': 35095.474, 'duration': 1.141}, {'end': 35098.055, 'text': 'Another standby can take over.', 'start': 35096.675, 'duration': 1.38}, {'end': 35104.577, 'text': 'So when I want to connect with my cluster, I want to know who is the current active name node.', 'start': 35098.655, 'duration': 5.922}, {'end': 35106.817, 'text': 'I go and ask Zookeeper.', 'start': 35105.077, 'duration': 1.74}, {'end': 35108.598, 'text': 'That is what it does.', 'start': 35107.738, 'duration': 0.86}, {'end': 35116.56, 'text': 'Also, in terms of when you look at HBase, You want to read and write data from HBase.', 'start': 35109.458, 'duration': 7.102}], 'summary': 'In a cluster, multiple nodes can take over; zookeeper helps identify active name node; hbase for reading and writing data.', 'duration': 25.827, 'max_score': 35090.733, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035090733.jpg'}, {'end': 35167.517, 'src': 'embed', 'start': 35135.67, 'weight': 9, 'content': [{'end': 35143.71, 'text': 'So typically when we install Zookeeper, we install it on a cluster like either three machines or five machines.', 'start': 35135.67, 'duration': 8.04}, {'end': 35147.251, 'text': 'So in this example, you can see there are five servers.', 'start': 35144.29, 'duration': 2.961}, {'end': 35149.391, 'text': 'They are all Zookeeper servers.', 'start': 35147.851, 'duration': 1.54}, {'end': 35155.753, 'text': 'Now clients can query the information from any Zookeeper servers.', 'start': 35149.872, 'duration': 5.881}, {'end': 35167.517, 'text': 'So Zookeeper is a centralized place where it stores the status of each service in your cluster and the machine.', 'start': 35156.674, 'duration': 10.843}], 'summary': 'Zookeeper is typically installed on a cluster of three to five machines, serving as a centralized place for storing service status in the cluster.', 'duration': 31.847, 'max_score': 35135.67, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035135670.jpg'}, {'end': 35357.034, 'src': 'embed', 'start': 35326.128, 'weight': 2, 'content': [{'end': 35330.071, 'text': 'Now, one more thing you need to understand is that Kafka clusterfiles.', 'start': 35326.128, 'duration': 3.943}, {'end': 35334.855, 'text': 'So you can have single node, single broker.', 'start': 35330.432, 'duration': 4.423}, {'end': 35337.758, 'text': 'What is this? One server.', 'start': 35335.556, 'duration': 2.202}, {'end': 35339.467, 'text': 'one broker.', 'start': 35338.607, 'duration': 0.86}, {'end': 35340.868, 'text': 'this is so.', 'start': 35339.467, 'duration': 1.401}, {'end': 35344.949, 'text': 'this is only used for learning and testing and all.', 'start': 35340.868, 'duration': 4.081}, {'end': 35349.551, 'text': 'why? because you are having only one server, one broker.', 'start': 35344.949, 'duration': 4.602}, {'end': 35355.393, 'text': 'you can also have single node, multiple broker cluster.', 'start': 35349.551, 'duration': 5.842}, {'end': 35357.034, 'text': 'very important.', 'start': 35355.393, 'duration': 1.641}], 'summary': 'Kafka cluster can be single node, single broker or single node, multiple broker cluster, suitable for learning and testing.', 'duration': 30.906, 'max_score': 35326.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035326128.jpg'}, {'end': 35858.465, 'src': 'embed', 'start': 35828.685, 'weight': 8, 'content': [{'end': 35829.986, 'text': 'Yeah, that is one way.', 'start': 35828.685, 'duration': 1.301}, {'end': 35836.89, 'text': 'But Kafka also gives you a very simple utility to test this.', 'start': 35830.786, 'duration': 6.104}, {'end': 35838.53, 'text': 'We just want to test it.', 'start': 35837.47, 'duration': 1.06}, {'end': 35841.952, 'text': 'Right We are just interested in testing this.', 'start': 35838.59, 'duration': 3.362}, {'end': 35846.535, 'text': 'So Kafka gives you a very interesting utility to test it.', 'start': 35842.493, 'duration': 4.042}, {'end': 35850.497, 'text': 'Right So there is something called a console producer.', 'start': 35846.975, 'duration': 3.522}, {'end': 35858.465, 'text': 'and console consumer, meaning from the console you can start a producer.', 'start': 35851.221, 'duration': 7.244}], 'summary': 'Kafka provides a simple utility for testing, including console producer and consumer.', 'duration': 29.78, 'max_score': 35828.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035828685.jpg'}, {'end': 35924.675, 'src': 'embed', 'start': 35893.25, 'weight': 1, 'content': [{'end': 35894.85, 'text': 'This is used for testing purpose.', 'start': 35893.25, 'duration': 1.6}, {'end': 35904.977, 'text': "And when I'm starting a producer, I'm saying that the broker is running on localhost 9092 because we are having only one machine.", 'start': 35895.771, 'duration': 9.206}, {'end': 35909.299, 'text': 'And the topic that I want to send data is example one.', 'start': 35905.477, 'duration': 3.822}, {'end': 35910.54, 'text': 'Okay Key tender.', 'start': 35909.719, 'duration': 0.821}, {'end': 35913.989, 'text': 'And you can see a prompt will come.', 'start': 35912.068, 'duration': 1.921}, {'end': 35918.912, 'text': 'But we also need a consumer, right? So I will take one more session.', 'start': 35914.99, 'duration': 3.922}, {'end': 35920.473, 'text': 'I will open one more session.', 'start': 35918.932, 'duration': 1.541}, {'end': 35924.675, 'text': 'Here I will start a consumer.', 'start': 35922.754, 'duration': 1.921}], 'summary': 'Testing producer and consumer on localhost 9092 for example one topic.', 'duration': 31.425, 'max_score': 35893.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035893250.jpg'}, {'end': 35986.819, 'src': 'embed', 'start': 35960.487, 'weight': 13, 'content': [{'end': 35966.249, 'text': 'bootstrap server is something which is there in the latest versions of kafka.', 'start': 35960.487, 'duration': 5.762}, {'end': 35968.449, 'text': 'what is the idea of a bootstrap server?', 'start': 35966.249, 'duration': 2.2}, {'end': 35978.155, 'text': "nothing. you just connect with one of the machine in the kafka cluster and then that machine will internally resolve what you're trying to do.", 'start': 35968.449, 'duration': 9.706}, {'end': 35986.819, 'text': "so right now i have only one bootstrap server, my broker only, and i'm saying that i want to consume from this topic called example one.", 'start': 35978.155, 'duration': 8.664}], 'summary': "Kafka's latest versions have a bootstrap server; connect to one machine in the cluster for resolution.", 'duration': 26.332, 'max_score': 35960.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4035960487.jpg'}, {'end': 36193.89, 'src': 'embed', 'start': 36128.98, 'weight': 0, 'content': [{'end': 36133.462, 'text': 'ideally, this has nothing to do with topic, any topic.', 'start': 36128.98, 'duration': 4.482}, {'end': 36166.954, 'text': 'it should be able to process now what is happening from beginning.', 'start': 36133.462, 'duration': 33.492}, {'end': 36178.801, 'text': "well, i can show you, kafka, but that's not the idea, right.", 'start': 36166.954, 'duration': 11.847}, {'end': 36180.503, 'text': 'oh, now it started reading.', 'start': 36178.801, 'duration': 1.702}, {'end': 36183.004, 'text': 'see, now it started reading.', 'start': 36180.503, 'duration': 2.501}, {'end': 36184.385, 'text': 'raghu ramen.', 'start': 36183.004, 'duration': 1.381}, {'end': 36187.807, 'text': 'Here is Raghu Raman.', 'start': 36186.886, 'duration': 0.921}, {'end': 36188.847, 'text': 'You guys can see right?', 'start': 36187.907, 'duration': 0.94}, {'end': 36193.89, 'text': 'So basically, once you start a console producer right?', 'start': 36190.728, 'duration': 3.162}], 'summary': 'Demonstration of kafka processing by raghu raman, starting console producer.', 'duration': 64.91, 'max_score': 36128.98, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4036128980.jpg'}], 'start': 34897.674, 'title': 'Kafka, zookeeper, and spark for real-time sensor data processing', 'summary': "Delves into real-time sensor data processing using a spark streaming application, kafka's role in holding data for multiple consumers, zookeeper's function in maintaining service status within a cluster, managing kafka and zookeeper, creating kafka topics, and transmitting data using kafka.", 'chapters': [{'end': 35043.976, 'start': 34897.674, 'title': 'Real-time sensor data processing with kafka and zookeeper', 'summary': 'Discusses the real-time processing of sensor data using a spark streaming application, the use of kafka to hold data for multiple consumers, and the role of zookeeper in maintaining the status of different services within a cluster.', 'duration': 146.302, 'highlights': ["Kafka can hold the data for a configurable period of time so that multiple consumers can read it if they want. Kafka's capability to hold data for multiple consumers allows for efficient access to sensor data by various systems, enhancing data accessibility and utilization.", 'Spark streaming application processing sensor data in real time and making decisions. Real-time processing by the Spark streaming application enables quick decision-making based on the incoming sensor data, enhancing situational awareness and response capabilities.', "Zookeeper's role in maintaining the status of different services within a cluster, such as Hadoop, Kafka, and HBase. Zookeeper's primary function of maintaining service status within a cluster, including Hadoop, Kafka, and HBase, ensures the reliability and availability of various services, contributing to overall system stability and performance."]}, {'end': 35411.131, 'start': 35043.976, 'title': 'Zookeeper and kafka relationship', 'summary': 'Discusses how zookeeper is used to store and retrieve information about the status of services in a cluster, such as determining the active name node in hadoop and coordinating kafka brokers. it also explains the usage of zookeeper to maintain metadata and consumer offsets in kafka, and the configurations of single node, single broker and multiple node, multiple broker clusters in kafka.', 'duration': 367.155, 'highlights': ['Zookeeper allows registering services and storing information about the status of each service in a cluster. Zookeeper enables registering services and storing their status in a cluster, facilitating the retrieval of information about the status of each service.', 'Determining the active name node in Hadoop is made possible through Zookeeper, which stores information about the status of name nodes in the cluster. Zookeeper facilitates determining the active name node in Hadoop by storing and providing information about the status of name nodes in the cluster.', 'Kafka brokers coordinate with each other using Zookeeper for tasks such as determining the status of other brokers and notifying producers and consumers about the presence or failure of a broker. Zookeeper is utilized by Kafka brokers to coordinate tasks, including determining the status of other brokers and notifying producers and consumers about the presence or failure of a broker.', 'Zookeeper maintains metadata for the entire Kafka, similar to the role of the name node in Hadoop for the Hadoop cluster. Zookeeper serves to maintain metadata for the entire Kafka, akin to the role of the name node in Hadoop for the Hadoop cluster.', 'In modern versions of Kafka, the offset can be stored anywhere, as opposed to older versions where the consumer offsets were stored in Zookeeper. Modern versions of Kafka allow the storage of offsets anywhere, contrasting the older versions where consumer offsets were stored in Zookeeper.']}, {'end': 35737.842, 'start': 35411.811, 'title': 'Managing kafka and zookeeper', 'summary': 'Provides a detailed guide on managing kafka and zookeeper, including commands to verify running processes, start zookeeper and kafka, and create a topic with specific replication factor and partitions.', 'duration': 326.031, 'highlights': ['The chapter provides a detailed guide on managing Kafka and Zookeeper, including commands to verify running processes, start Zookeeper and Kafka, and create a topic with specific replication factor and partitions. Commands for verifying running processes, starting Zookeeper and Kafka, and creating a topic are provided.', 'The JPS command displays all the Java processes running in the system, such as data node, node manager, name node, secondary name node, and resource manager. JPS command shows various Java processes running in the system.', 'The default port number for Kafka to listen through is 9092, and for Zookeeper, it is 2181. Default port numbers for Kafka and Zookeeper are 9092 and 2181 respectively.', 'A topic in Kafka can be created using the kafkatopics.sh command, with options to specify the replication factor, number of partitions, and the topic name. Creation of a topic in Kafka with options for replication factor, partitions, and topic name is demonstrated.']}, {'end': 35986.819, 'start': 35738.503, 'title': 'Kafka topic creation and data handling', 'summary': 'Covers the creation of kafka topics using kafkatopics.sh, listing topics, and testing with console producer and consumer utilities, emphasizing the simplicity and utility of kafka for testing.', 'duration': 248.316, 'highlights': ['Kafka provides a simple utility for testing with console producer and consumer, allowing data to be sent and read easily. The chapter emphasizes the simplicity and utility of Kafka for testing, providing a built-in console producer and consumer for easily sending and reading data.', "Kafka topics can be listed using the command 'kafkatopics list zookeeper', providing an overview of available topics. The 'kafkatopics list zookeeper' command allows for listing all the topics available, offering an overview of the existing topics.", 'Kafka topics may take time to configure internally after creation, particularly in a local cluster setup. In a local cluster setup, Kafka topics may take some time to configure internally after creation, highlighting a potential delay in topic configuration.']}, {'end': 36251.463, 'start': 35986.819, 'title': 'Kafka data transmission demo', 'summary': 'Demonstrates the process of transmitting and receiving data using kafka, including configuring topics, sending messages, and reading data from the beginning, with a focus on troubleshooting and successful data transmission.', 'duration': 264.644, 'highlights': ['The process of sending and receiving data in Kafka is demonstrated, including configuring topics and successfully transmitting messages.', 'The concept of reading data from the beginning in Kafka is explained, with the option to start from the first or last message.', 'Troubleshooting steps are highlighted, such as addressing issues with data display and ensuring successful data transmission.']}], 'duration': 1353.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4034897674.jpg', 'highlights': ['Real-time processing by the Spark streaming application enables quick decision-making based on the incoming sensor data, enhancing situational awareness and response capabilities.', 'Kafka can hold the data for a configurable period of time so that multiple consumers can read it if they want, enhancing data accessibility and utilization.', "Zookeeper's primary function of maintaining service status within a cluster, including Hadoop, Kafka, and HBase, ensures the reliability and availability of various services, contributing to overall system stability and performance.", 'Zookeeper enables registering services and storing their status in a cluster, facilitating the retrieval of information about the status of each service.', 'Zookeeper facilitates determining the active name node in Hadoop by storing and providing information about the status of name nodes in the cluster.', 'Zookeeper is utilized by Kafka brokers to coordinate tasks, including determining the status of other brokers and notifying producers and consumers about the presence or failure of a broker.', 'Commands for verifying running processes, starting Zookeeper and Kafka, and creating a topic are provided.', 'Default port numbers for Kafka and Zookeeper are 9092 and 2181 respectively.', 'Creation of a topic in Kafka with options for replication factor, partitions, and topic name is demonstrated.', 'The chapter emphasizes the simplicity and utility of Kafka for testing, providing a built-in console producer and consumer for easily sending and reading data.', "The 'kafkatopics list zookeeper' command allows for listing all the topics available, offering an overview of the existing topics.", 'In a local cluster setup, Kafka topics may take some time to configure internally after creation, highlighting a potential delay in topic configuration.', 'The process of sending and receiving data in Kafka is demonstrated, including configuring topics and successfully transmitting messages.', 'The concept of reading data from the beginning in Kafka is explained, with the option to start from the first or last message.', 'Troubleshooting steps are highlighted, such as addressing issues with data display and ensuring successful data transmission.']}, {'end': 37410.556, 'segs': [{'end': 36280.701, 'src': 'embed', 'start': 36252.223, 'weight': 6, 'content': [{'end': 36254.265, 'text': 'You can see the messages are being exchanged.', 'start': 36252.223, 'duration': 2.042}, {'end': 36263.193, 'text': 'So this is to test Kafka, right? So in reality, you may not be using the console producer to send the data.', 'start': 36255.226, 'duration': 7.967}, {'end': 36267.777, 'text': 'This is just to test saying that, okay, things are working fine.', 'start': 36263.954, 'duration': 3.823}, {'end': 36269.599, 'text': 'We are able to look at it.', 'start': 36267.817, 'duration': 1.782}, {'end': 36276.12, 'text': 'In reality, you may be using something else, another producer.', 'start': 36270.099, 'duration': 6.021}, {'end': 36280.701, 'text': 'And that we will look at in the Spark class.', 'start': 36276.84, 'duration': 3.861}], 'summary': 'Testing kafka messages exchange for spark class.', 'duration': 28.478, 'max_score': 36252.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4036252223.jpg'}, {'end': 37159.917, 'src': 'embed', 'start': 37132.625, 'weight': 0, 'content': [{'end': 37138.248, 'text': 'this is one thing where I have personal experience in not in conducting the event.', 'start': 37132.625, 'duration': 5.623}, {'end': 37141.57, 'text': 'but now why this is important?', 'start': 37138.248, 'duration': 3.322}, {'end': 37143.952, 'text': 'because what Flipkart does?', 'start': 37141.57, 'duration': 2.382}, {'end': 37150.713, 'text': 'they will announce a sale and then they will have a rough idea as to how many people will come and what they will buy.', 'start': 37143.952, 'duration': 6.761}, {'end': 37155.815, 'text': 'But they will not really have an exact idea like what will happen.', 'start': 37151.553, 'duration': 4.262}, {'end': 37159.917, 'text': 'So what Flipkart does, they run the analysis in real time.', 'start': 37156.015, 'duration': 3.902}], 'summary': 'Flipkart uses real-time analysis to predict customer behavior during sales.', 'duration': 27.292, 'max_score': 37132.625, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4037132625.jpg'}, {'end': 37263.487, 'src': 'embed', 'start': 37233.915, 'weight': 4, 'content': [{'end': 37238.537, 'text': 'they will say that we have a new deal coming up in next hour.', 'start': 37233.915, 'duration': 4.622}, {'end': 37239.538, 'text': 'some off right?', 'start': 37238.537, 'duration': 1.001}, {'end': 37240.838, 'text': 'You could have seen that right?', 'start': 37239.798, 'duration': 1.04}, {'end': 37245.8, 'text': 'When you go for sales purchase and all in online sales event.', 'start': 37240.878, 'duration': 4.922}, {'end': 37249.282, 'text': 'So how these deals are working?', 'start': 37246.761, 'duration': 2.521}, {'end': 37255.424, 'text': "because Amazon knows that in the last half an hour, let's say,", 'start': 37249.282, 'duration': 6.142}, {'end': 37263.487, 'text': '1 million people searched for a particular product or a particular category that is coming as a deal in the next hour.', 'start': 37255.424, 'duration': 8.063}], 'summary': 'Amazon is launching a new deal in the next hour for a product or category that 1 million people searched for in the last half hour.', 'duration': 29.572, 'max_score': 37233.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4037233915.jpg'}], 'start': 36252.223, 'title': 'Real-time processing for big data', 'summary': 'Covers essential kafka concepts and practical use cases from ge, discusses the importance of real-time processing in analytics, and highlights the challenges and tools for real-time processing for big data, including examples of credit card fraud detection and online sales events.', 'chapters': [{'end': 36500.491, 'start': 36252.223, 'title': 'Kafka for spark: key concepts and use cases', 'summary': 'Covers essential kafka concepts such as producers, consumers, topics, and partitioning, with practical use cases from ge involving real-time and batch processing, and the integration of kafka with spark.', 'duration': 248.268, 'highlights': ['Kafka concepts: producers, consumers, topics, and partitioning The chapter explains the fundamental concepts of Kafka, including producers, consumers, topics, and partitioning, illustrating their significance in data processing.', 'Use case from GE: integrating Kafka with ERP data for real-time and batch processing The chapter presents a practical use case from GE, demonstrating the integration of Kafka with ERP data for real-time processing, as well as batch processing, highlighting the versatility of Kafka in different scenarios.', 'Partitioning for specific use cases: time series data and log data processing The chapter emphasizes the importance of partitioning for specific use cases, such as processing time series data and log data, outlining the significance of assigning data to specific partitions for tailored processing.', 'Integration with Spark: leveraging Kafka for Spark streaming The chapter discusses the integration of Kafka with Spark for streaming data processing, showcasing the practical application of Kafka in conjunction with Spark for real-time data processing.']}, {'end': 36951.808, 'start': 36501.131, 'title': 'Importance of real-time processing in analytics', 'summary': 'Discusses the importance of real-time processing in analytics, citing examples such as sentiment analysis, credit card fraud detection, and online sales events, and emphasizes the need for real-time decision-making to personalize content for users in online advertising, highlighting the challenges of processing massive amounts of user data in real-time.', 'duration': 450.677, 'highlights': ['Real-time processing is crucial for online advertising to make rapid decisions and personalize content for users based on their browsing behavior and interests. Real-time processing is essential for online advertising to quickly assess user behavior, personalize content, and make timely decisions to display relevant ads, demonstrating the need to process massive amounts of user data in real-time.', 'The examples of sentiment analysis, credit card fraud detection, and online sales events illustrate the practical significance of real-time processing in analytics. The examples of sentiment analysis, credit card fraud detection, and online sales events highlight the practical significance of real-time processing in analytics, showcasing its application in different domains to achieve timely insights and decision-making.', 'The challenge of processing vast amounts of user data in real-time, such as filtering millions of users and personalizing content for individual users, demonstrates the complexity and necessity of real-time processing in analytics. The challenge of processing massive amounts of user data in real-time, including filtering millions of users and personalizing content for individual users, underscores the complexity and necessity of real-time processing in analytics to meet the demands of rapid decision-making and personalized user experiences.']}, {'end': 37410.556, 'start': 36952.449, 'title': 'Real-time processing for big data', 'summary': 'Discusses real-time processing for big data, including examples of credit card fraud detection, online sales events, and the need to process millions of messages in real time, highlighting the challenges and tools available such as apache storm, spark streaming, flink, and kafka streams.', 'duration': 458.107, 'highlights': ['Real-time credit card fraud detection for Citibank involved processing 2-3 million transactions per minute using Apache Storm. Citibank processed 2-3 million transactions per minute for credit card fraud detection using Apache Storm.', 'Online sales events like those conducted by Flipkart require real-time analysis every 5-15 minutes to determine customer behavior and optimize profit. Flipkart runs real-time analysis every 5-15 minutes during online sales events to understand customer behavior and optimize profit.', 'Processing millions and billions of messages in real time poses a challenge that traditional big data techniques (Hadoop, MapReduce, etc.) cannot address. Traditional big data techniques are inadequate for processing millions and billions of messages in real time.', 'Spark Streaming, Flink, Storm, and Kafka Streams are popular real-time processing engines for big data analytics. Popular real-time processing engines for big data analytics include Spark Streaming, Flink, Storm, and Kafka Streams.']}], 'duration': 1158.333, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4036252223.jpg', 'highlights': ['Real-time credit card fraud detection for Citibank involved processing 2-3 million transactions per minute using Apache Storm.', 'Online sales events like those conducted by Flipkart require real-time analysis every 5-15 minutes to determine customer behavior and optimize profit.', 'The examples of sentiment analysis, credit card fraud detection, and online sales events illustrate the practical significance of real-time processing in analytics.', 'Real-time processing is crucial for online advertising to make rapid decisions and personalize content for users based on their browsing behavior and interests.', 'Integration with Spark: leveraging Kafka for Spark streaming', 'The challenge of processing vast amounts of user data in real-time, such as filtering millions of users and personalizing content for individual users, demonstrates the complexity and necessity of real-time processing in analytics.', 'Partitioning for specific use cases: time series data and log data processing', 'Use case from GE: integrating Kafka with ERP data for real-time and batch processing', 'Kafka concepts: producers, consumers, topics, and partitioning']}, {'end': 42775.465, 'segs': [{'end': 37718.716, 'src': 'embed', 'start': 37653.754, 'weight': 0, 'content': [{'end': 37659.539, 'text': 'now say, please help me connect.', 'start': 37653.754, 'duration': 5.785}, {'end': 37671.438, 'text': 'say, yes, so I am in my second machine.', 'start': 37659.539, 'duration': 11.899}, {'end': 37677.903, 'text': "now let's go and connect to my third machine.", 'start': 37671.438, 'duration': 6.465}, {'end': 37680.784, 'text': 'I say connect, get the command.', 'start': 37677.903, 'duration': 2.881}, {'end': 37700.297, 'text': 'here I say all right, awesome.', 'start': 37680.784, 'duration': 19.513}, {'end': 37707.545, 'text': "so I'm connected to my name, node, data node 1 and data node 2.", 'start': 37700.297, 'duration': 7.248}, {'end': 37713.17, 'text': 'now what you see is well, this is my name, node.', 'start': 37707.545, 'duration': 5.625}, {'end': 37718.395, 'text': 'this is my data, node 2, and just lost sight of my data, node 1.', 'start': 37713.17, 'duration': 5.225}, {'end': 37718.716, 'text': 'that is.', 'start': 37718.395, 'duration': 0.321}], 'summary': 'Connecting to three machines, data node 1 and 2.', 'duration': 64.962, 'max_score': 37653.754, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4037653754.jpg'}, {'end': 38362.013, 'src': 'embed', 'start': 38294.123, 'weight': 5, 'content': [{'end': 38295.304, 'text': 'So wget is installed.', 'start': 38294.123, 'duration': 1.181}, {'end': 38324.924, 'text': 'now i am downloading java, so i am doing nothing just in just downloading java from the internet, nothing else.', 'start': 38304.358, 'duration': 20.566}, {'end': 38327.005, 'text': 'just downloading java from the internet.', 'start': 38324.924, 'duration': 2.081}, {'end': 38350.569, 'text': 'java is getting downloaded from the internet, nothing else.', 'start': 38336.184, 'duration': 14.385}, {'end': 38359.692, 'text': 'now, when you install java, there are certain files that are of not useful to us and they are not present.', 'start': 38350.569, 'duration': 9.123}, {'end': 38362.013, 'text': "so java may complain, but that's okay.", 'start': 38359.692, 'duration': 2.321}], 'summary': 'Java is being downloaded from the internet and installed; some unnecessary files may cause complaints.', 'duration': 67.89, 'max_score': 38294.123, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4038294123.jpg'}, {'end': 39530.574, 'src': 'embed', 'start': 39500.195, 'weight': 1, 'content': [{'end': 39505.899, 'text': 'all I have done is I have created a new configuration which is, as of this very moment, empty.', 'start': 39500.195, 'duration': 5.704}, {'end': 39507.34, 'text': "that's okay.", 'start': 39505.899, 'duration': 1.441}, {'end': 39516.606, 'text': 'and I have set a priority which is 99, and because I have the highest priority, my configuration is what hadoop will run with.', 'start': 39507.34, 'duration': 9.266}, {'end': 39523.39, 'text': 'is that clear, awesome.', 'start': 39516.606, 'duration': 6.784}, {'end': 39530.574, 'text': "now, let's do that on all three, because we want our configuration to be a distributed configuration is highest priority.", 'start': 39523.39, 'duration': 7.184}], 'summary': 'Configured new empty configuration with priority 99 for distributed setup.', 'duration': 30.379, 'max_score': 39500.195, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4039500195.jpg'}, {'end': 42402.601, 'src': 'embed', 'start': 42341.2, 'weight': 4, 'content': [{'end': 42356.428, 'text': 'how do fs mkdir test directory from data node?', 'start': 42341.2, 'duration': 15.228}, {'end': 42367.648, 'text': "how do fs ls do you see that It's a distributed file system?", 'start': 42356.428, 'duration': 11.22}, {'end': 42381.893, 'text': 'So now you can safely say your HDFS or your multi-node setup is 100% ready and raring to go.', 'start': 42373.27, 'duration': 8.623}, {'end': 42394.457, 'text': "Now let's get to the MapReduce part.", 'start': 42387.755, 'duration': 6.702}, {'end': 42397.917, 'text': 'now in a multinode cluster.', 'start': 42396.636, 'duration': 1.281}, {'end': 42402.601, 'text': "we've already created temp, we've created our root directory, we've created input directory.", 'start': 42397.917, 'duration': 4.684}], 'summary': 'Setting up a multinode hdfs cluster with directories, ready for mapreduce.', 'duration': 61.401, 'max_score': 42341.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4042341200.jpg'}, {'end': 42774.685, 'src': 'heatmap', 'start': 42341.2, 'weight': 0.911, 'content': [{'end': 42356.428, 'text': 'how do fs mkdir test directory from data node?', 'start': 42341.2, 'duration': 15.228}, {'end': 42367.648, 'text': "how do fs ls do you see that It's a distributed file system?", 'start': 42356.428, 'duration': 11.22}, {'end': 42381.893, 'text': 'So now you can safely say your HDFS or your multi-node setup is 100% ready and raring to go.', 'start': 42373.27, 'duration': 8.623}, {'end': 42394.457, 'text': "Now let's get to the MapReduce part.", 'start': 42387.755, 'duration': 6.702}, {'end': 42397.917, 'text': 'now in a multinode cluster.', 'start': 42396.636, 'duration': 1.281}, {'end': 42402.601, 'text': "we've already created temp, we've created our root directory, we've created input directory.", 'start': 42397.917, 'duration': 4.684}, {'end': 42407.246, 'text': "also, now it's time for us to create a MapReduce directory.", 'start': 42402.601, 'duration': 4.645}, {'end': 42412.43, 'text': "that's a system directory for us to run our program.", 'start': 42407.246, 'duration': 5.184}, {'end': 42433.631, 'text': 'what we do is we say hadoop, fs, mkdir, MapRed system.', 'start': 42412.43, 'duration': 21.201}, {'end': 42442.918, 'text': 'We give the ownership of that directory to MapRed user.', 'start': 42439.535, 'duration': 3.383}, {'end': 42453.146, 'text': 'We say ch own MapRed.', 'start': 42446.561, 'duration': 6.585}, {'end': 42461.433, 'text': 'so MapRed owns this particular directory.', 'start': 42455.751, 'duration': 5.682}, {'end': 42467.316, 'text': 'all right, all done.', 'start': 42461.433, 'duration': 5.883}, {'end': 42478.602, 'text': "now it's time for us to run a MapReduce program.", 'start': 42467.316, 'duration': 11.286}, {'end': 42483.825, 'text': 'so now we are all set to run our program.', 'start': 42478.602, 'duration': 5.223}, {'end': 42514.095, 'text': "let's do it say Hadoop jar, user lib, Hadoop examples, input grep, input output.", 'start': 42486.212, 'duration': 27.883}, {'end': 42522.729, 'text': 'wait, we have not started job tracker and task tracker.', 'start': 42516.365, 'duration': 6.364}, {'end': 42523.87, 'text': 'we have not started job track.', 'start': 42522.729, 'duration': 1.141}, {'end': 42527.933, 'text': "fantastic, let's start, etc.", 'start': 42523.87, 'duration': 4.063}, {'end': 42538.861, 'text': 'and i did start.', 'start': 42527.933, 'duration': 10.928}, {'end': 42540.802, 'text': 'so i have started job tracker on the master.', 'start': 42538.861, 'duration': 1.941}, {'end': 42551.927, 'text': 'So what you see is name node, secondary name, node, job tracker on the master.', 'start': 42546.926, 'duration': 5.001}, {'end': 42559.948, 'text': 'Here we will start the task tracker.', 'start': 42552.827, 'duration': 7.121}, {'end': 42580.791, 'text': 'Alright, If I do a JPS, what you see is data node and task tracker is up here.', 'start': 42570.67, 'duration': 10.121}, {'end': 42586.794, 'text': 'if i do a jps here, i see data node and task tracker is up here.', 'start': 42580.791, 'duration': 6.003}, {'end': 42589.455, 'text': 'now. our map reduce is also started.', 'start': 42586.794, 'duration': 2.661}, {'end': 42593.977, 'text': "now it's time to really do the.", 'start': 42589.455, 'duration': 4.522}, {'end': 42629.304, 'text': 'what we say is we say hadoop jar user lib hadoop 0.20 group examples dot char grip input output, dfs red dot plus.', 'start': 42593.977, 'duration': 35.327}, {'end': 42637.928, 'text': 'so for guys who have already done the first, hands-on you guys easily.', 'start': 42629.304, 'duration': 8.624}, {'end': 42639.209, 'text': 'so now i am running the job.', 'start': 42637.928, 'duration': 1.281}, {'end': 42642.71, 'text': "It's a multi-node setup and I'm running this job.", 'start': 42640.109, 'duration': 2.601}, {'end': 42649.194, 'text': 'I hope it runs.', 'start': 42648.534, 'duration': 0.66}, {'end': 42650.234, 'text': "No, it doesn't.", 'start': 42649.474, 'duration': 0.76}, {'end': 42652.896, 'text': 'It says there are some issues.', 'start': 42651.115, 'duration': 1.781}, {'end': 42653.816, 'text': "Let's see what.", 'start': 42653.296, 'duration': 0.52}, {'end': 42661.64, 'text': 'It says user root has no permissions to write on the temp directory.', 'start': 42654.957, 'duration': 6.683}, {'end': 42663.001, 'text': 'My mistake.', 'start': 42662.481, 'duration': 0.52}, {'end': 42673.539, 'text': 'Because when I was doing chmod on temp I should have done hyphen r.', 'start': 42663.802, 'duration': 9.737}, {'end': 42675.181, 'text': 'I did not give recursive permission.', 'start': 42673.539, 'duration': 1.642}, {'end': 42691.753, 'text': 'Oops Now I am running this here.', 'start': 42681.846, 'duration': 9.907}, {'end': 42694.115, 'text': 'Start it.', 'start': 42693.755, 'duration': 0.36}, {'end': 42732.454, 'text': 'All right, job is done.', 'start': 42730.834, 'duration': 1.62}, {'end': 42734.995, 'text': 'I can verify from here also.', 'start': 42733.615, 'duration': 1.38}, {'end': 42745.818, 'text': 'Classy output, say Hadoop, fs cat output.', 'start': 42739.416, 'duration': 6.402}, {'end': 42752.079, 'text': 'Got it.', 'start': 42751.819, 'duration': 0.26}, {'end': 42757.821, 'text': 'So in this case, there are only four variables which are having BFS.', 'start': 42753.08, 'duration': 4.741}, {'end': 42759.001, 'text': 'Just a quick info guys.', 'start': 42758.101, 'duration': 0.9}, {'end': 42764.503, 'text': 'If you also want to become a certified Big Data Hadoop professional, Intellipaat offers a complete course on the same.', 'start': 42759.281, 'duration': 5.222}, {'end': 42766.883, 'text': 'The links for which are given in the description box below.', 'start': 42764.783, 'duration': 2.1}, {'end': 42768.463, 'text': 'We hope this video was helpful to you.', 'start': 42767.143, 'duration': 1.32}, {'end': 42771.424, 'text': 'If you have any further queries, let us know in the comment section below.', 'start': 42768.723, 'duration': 2.701}, {'end': 42772.604, 'text': "We'll be happy to help you.", 'start': 42771.704, 'duration': 0.9}, {'end': 42774.685, 'text': 'So thank you so much for watching this video.', 'start': 42772.864, 'duration': 1.821}], 'summary': 'Setting up a multi-node hadoop cluster, running a mapreduce program, encountering permission issues, and resolving them.', 'duration': 433.485, 'max_score': 42341.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4042341200.jpg'}], 'start': 37411.476, 'title': 'Setting up hadoop on aws and multi-node cluster', 'summary': 'Discusses setting up three machines on aws, renaming hosts, preference for kinesis and kafka streams over storm, and configuring a multi-node cluster with detailed steps, covering hadoop cluster configuration, alternatives framework, hdfs and mapreduce configurations, and troubleshooting port connectivity and file permission issues.', 'chapters': [{'end': 38031.854, 'start': 37411.476, 'title': 'Setting up machines on aws and renaming hosts', 'summary': 'Discusses the process of setting up three machines on aws, connecting to them, and renaming their host names to more meaningful names, ensuring they can be easily identified. it also covers the preference for kinesis and kafka streams over storm for aws usage, as well as the specification of server configurations for the machines.', 'duration': 620.378, 'highlights': ['Kinesis and Kafka Streams are preferred over Storm for AWS usage, with Kafka streams being the latest edition and gaining popularity in the last year. Kinesis is very popular for AWS usage, while Kafka streams is the latest edition and has gained popularity in the last year.', 'The speaker purchases three machines on AWS, specifying the server configuration for RAM and the region for deployment. The speaker purchases three machines on AWS, specifying a server with 6.3 versions and 4 gigs of RAM for each machine, and deploying them in the west coast region.', 'The speaker connects to the machines, names them as name node, data node 1, and data node 2, and changes their host names accordingly. The speaker connects to the purchased machines, names them as name node, data node 1, and data node 2, and changes their host names accordingly to ensure easy identification.']}, {'end': 39147.385, 'start': 38033.26, 'title': 'Setting up multi-node cluster', 'summary': 'Covers the step-by-step process of setting up a multi-node cluster, including installing java, setting up a cloud era repository, and verifying the installation of hadoop, with clear guidance and explanations.', 'duration': 1114.125, 'highlights': ["Installing Java on the virtual machines The speaker outlines the step of installing Java on the blank machines, emphasizing the simplicity of the process and providing guidance for the command 'yum install wget.'", 'Setting up a cloud era repository and installing Hadoop The chapter details the process of setting up a cloud era repository and installing Hadoop, simplifying the complex task and providing reassurance about the automatic nature of the installation.', "Verifying the installation of Java and Hadoop The speaker demonstrates how to verify the installation of Java by using commands like 'java' and 'java version,' and also shares a command to verify the installation of Hadoop, offering a clear and practical approach to ensuring the successful installation of these components."]}, {'end': 39908.103, 'start': 39147.385, 'title': 'Understanding alternatives framework', 'summary': 'Explains the concept of the alternatives framework, allowing software to run on multiple configurations with the ability to prioritize and switch between them, demonstrated through hadoop configuration setup and networking setup including ip translation and firewall configuration.', 'duration': 760.718, 'highlights': ['The alternatives framework allows software to run on multiple configurations and prioritize them based on user preference, demonstrated through the setup of multiple Hadoop configurations with different priorities.', 'The chapter demonstrates the process of creating and prioritizing new configurations using the alternatives framework, exemplified through the creation of a new distributed configuration with a higher priority.', 'The networking setup is illustrated, including the translation of IPs in the host file to ensure machines can communicate with each other, followed by the configuration of firewall rules to enable ICMP communication for pinging between the machines.']}, {'end': 40562.772, 'start': 39909.083, 'title': 'Configuring hadoop cluster', 'summary': 'Details the process of configuring a hadoop cluster, including setting up key configurations such as hdfs, core site, and map reduce, with specific details on data storage locations and access points, ultimately aiming to set up a cluster for operation.', 'duration': 653.689, 'highlights': ['Setting up key Hadoop configurations including HDFS, core site, and map reduce The chapter covers the process of setting up key Hadoop configurations, specifically HDFS, core site, and map reduce.', 'Defining data storage locations for name node and data node in HDFS The transcript provides specific details on defining data storage locations for name node and data node in HDFS, such as Home Disk 1 DFS NN and Home Disk 1 DFS DN.', 'Specifying access points for name node in HDFS It includes specifying access points for the name node in HDFS, such as port number 50070.', 'Explaining the standardization of configurations across a Hadoop cluster The chapter explains the standardization of configurations across a Hadoop cluster to maintain consistency and ease of administration.']}, {'end': 41370.802, 'start': 40562.972, 'title': 'Setting up hdfs and mapreduce configurations', 'summary': 'Covers the setup of hdfs and mapreduce configurations, including creating directories, giving permissions, and formatting the name node for cluster setup.', 'duration': 807.83, 'highlights': ["Formatting the name node The name node is formatted using the command 'sudo a new HDFS Hadoop name node format', which is essential for setting up the cluster.", "Creating directories and giving permissions The speaker creates directories like 'disk1 dfs disk2 nn' and gives ownership to HDFS and MapReduce using commands like 'sudo mkdir', 'sudo chom-r hdfs adu om disk1'.", "Explanation of mkdir -p and HDFS Hadoop command The speaker explains the significance of 'mkdir -p' as creating parent directories if not existing, and the HDFS Hadoop command as assigning ownership to HDFS user belonging to the Hadoop group."]}, {'end': 42775.465, 'start': 41405.003, 'title': 'Configuring multinode hadoop cluster', 'summary': 'Discusses the setup of a multi-node hadoop cluster, including configuring name node, secondary name node, data node, and running mapreduce program, with emphasis on troubleshooting port connectivity and file permission issues.', 'duration': 1370.462, 'highlights': ['Configuring name node, secondary name node, and data node to set up a multi-node Hadoop cluster The chapter explains the process of configuring name node, secondary name node, and data node to establish a multi-node Hadoop cluster, simulating real-world scenarios.', 'Troubleshooting port connectivity issues and firewall configuration The chapter outlines the steps to troubleshoot port connectivity issues by checking and opening ports, as well as temporarily disabling the firewall to enable communication between nodes.', 'Setting file permissions and resolving permission issues It discusses the importance of setting proper file permissions and resolving permission issues such as recursively setting permissions on directories to avoid write permission errors.', 'Running MapReduce program and verifying its execution The chapter covers the process of running a MapReduce program, starting job tracker and task tracker, and verifying the execution and output of the program.']}], 'duration': 5363.989, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/CZOSvw3hx40/pics/CZOSvw3hx4037411476.jpg', 'highlights': ['Setting up a multi-node Hadoop cluster with detailed steps', 'Kinesis and Kafka Streams are preferred over Storm for AWS usage', 'Configuring name node, secondary name node, and data node to set up a multi-node Hadoop cluster', 'Setting up key Hadoop configurations including HDFS, core site, and map reduce', 'Troubleshooting port connectivity issues and firewall configuration', 'Formatting the name node']}], 'highlights': ['Real-time credit card fraud detection for Citibank involved processing 2-3 million transactions per minute using Apache Storm.', 'Online sales events like those conducted by Flipkart require real-time analysis every 5-15 minutes to determine customer behavior and optimize profit.', 'The examples of sentiment analysis, credit card fraud detection, and online sales events illustrate the practical significance of real-time processing in analytics.', 'Real-time processing is crucial for online advertising to make rapid decisions and personalize content for users based on their browsing behavior and interests.', 'Integration with Spark: leveraging Kafka for Spark streaming', 'The challenge of processing vast amounts of user data in real-time, such as filtering millions of users and personalizing content for individual users, demonstrates the complexity and necessity of real-time processing in analytics.', 'Partitioning for specific use cases: time series data and log data processing', 'Use case from GE: integrating Kafka with ERP data for real-time and batch processing', 'Kafka concepts: producers, consumers, topics, and partitioning', 'Setting up a multi-node Hadoop cluster with detailed steps', 'Kinesis and Kafka Streams are preferred over Storm for AWS usage', 'Configuring name node, secondary name node, and data node to set up a multi-node Hadoop cluster', 'Setting up key Hadoop configurations including HDFS, core site, and map reduce', 'Troubleshooting port connectivity issues and firewall configuration', 'Formatting the name node', 'Real-time processing by the Spark streaming application enables quick decision-making based on the incoming sensor data, enhancing situational awareness and response capabilities.', 'Kafka can hold the data for a configurable period of time so that multiple consumers can read it if they want, enhancing data accessibility and utilization.', "Zookeeper's primary function of maintaining service status within a cluster, including Hadoop, Kafka, and HBase, ensures the reliability and availability of various services, contributing to overall system stability and performance.", 'Zookeeper enables registering services and storing their status in a cluster, facilitating the retrieval of information about the status of each service.', 'Zookeeper facilitates determining the active name node in Hadoop by storing and providing information about the status of name nodes in the cluster.', 'Zookeeper is utilized by Kafka brokers to coordinate tasks, including determining the status of other brokers and notifying producers and consumers about the presence or failure of a broker.', 'Commands for verifying running processes, starting Zookeeper and Kafka, and creating a topic are provided.', 'Default port numbers for Kafka and Zookeeper are 9092 and 2181 respectively.', 'Creation of a topic in Kafka with options for replication factor, partitions, and topic name is demonstrated.', 'The chapter emphasizes the simplicity and utility of Kafka for testing, providing a built-in console producer and consumer for easily sending and reading data.', "The 'kafkatopics list zookeeper' command allows for listing all the topics available, offering an overview of the existing topics.", 'In a local cluster setup, Kafka topics may take some time to configure internally after creation, highlighting a potential delay in topic configuration.', 'The process of sending and receiving data in Kafka is demonstrated, including configuring topics and successfully transmitting messages.', 'The concept of reading data from the beginning in Kafka is explained, with the option to start from the first or last message.', 'Troubleshooting steps are highlighted, such as addressing issues with data display and ensuring successful data transmission.']}