title
Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

description
PyData London Meetup #54 Tuesday, March 5, 2019 Data pipelines are necessary for the flow of information from its source to its consumers, typically data scientists, analysts and software developers. Managing data flow from many sources is a complex task where the maintenance cost limits scale of being able to build a large reliable data warehouse. This presentation proposes a number of applied data engineering principles that can be used to build robust easily manageable data pipelines and data products. Examples will be shown using Python on AWS. Sponsored & Hosted by Man AHL **** www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 00:10 Help us add time stamps or captions to this video! See the description for details. Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps

detail
{'title': 'Data Engineering Principles - Build frameworks not pipelines - Gatis Seja', 'heatmap': [{'end': 630.349, 'start': 612.372, 'weight': 0.841}, {'end': 901.065, 'start': 879.974, 'weight': 0.72}, {'end': 1024.586, 'start': 985.106, 'weight': 0.833}, {'end': 1120.75, 'start': 1076.717, 'weight': 0.824}], 'summary': 'Covers historical data standardization challenges, data transport formats, data warehouse best practices, data consumer needs, technology and best practices, as well as challenges and best practices in data pipelines, emphasizing the use of airflow and jenkins for monitoring and standardizing data extraction.', 'chapters': [{'end': 463.784, 'segs': [{'end': 316.488, 'src': 'embed', 'start': 270.299, 'weight': 0, 'content': [{'end': 272.4, 'text': 'what was the measurement system?', 'start': 270.299, 'duration': 2.101}, {'end': 272.96, 'text': 'What was the quality?', 'start': 272.44, 'duration': 0.52}, {'end': 273.501, 'text': 'What was the price?', 'start': 272.98, 'duration': 0.521}, {'end': 274.681, 'text': 'Language, dialect, currencies?', 'start': 273.541, 'duration': 1.14}, {'end': 276.142, 'text': 'Were there any levies?', 'start': 275.142, 'duration': 1}, {'end': 276.883, 'text': 'What was the storage?', 'start': 276.182, 'duration': 0.701}, {'end': 277.783, 'text': 'How was he going to store it?', 'start': 276.923, 'duration': 0.86}, {'end': 280.965, 'text': 'What are the laws and religious customs of 1431?', 'start': 277.803, 'duration': 3.162}, {'end': 287.568, 'text': "And that, you'd have to agree, is a massive job, just to sell a bit of cheese and a bit of quicklime.", 'start': 280.965, 'duration': 6.603}, {'end': 289.009, 'text': 'A massive job.', 'start': 288.469, 'duration': 0.54}, {'end': 297.173, 'text': "So what's the answer to John's problem? Standardization.", 'start': 291.089, 'duration': 6.084}, {'end': 303.598, 'text': "The problem is Hampshire and London don't have the same standards.", 'start': 297.714, 'duration': 5.884}, {'end': 309.523, 'text': 'John had to understand the price of the goods in his destination before even buying the goods.', 'start': 304.059, 'duration': 5.464}, {'end': 312.285, 'text': 'He had to have his whole planned out, all of it planned out.', 'start': 309.763, 'duration': 2.522}, {'end': 316.488, 'text': 'And the transport of information back in that time was very slow.', 'start': 312.385, 'duration': 4.103}], 'summary': 'John faced challenges with measurement, quality, price, and standards in trading goods in 1431.', 'duration': 46.189, 'max_score': 270.299, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg270299.jpg'}, {'end': 434.625, 'src': 'embed', 'start': 400.749, 'weight': 2, 'content': [{'end': 406.471, 'text': 'In 1215, the Magna Carta tried to standardize weights and I think it was length.', 'start': 400.749, 'duration': 5.722}, {'end': 412.313, 'text': 'The biggest change happened in 1824 with a Weights and Measures Act.', 'start': 407.251, 'duration': 5.062}, {'end': 416.894, 'text': 'However, I would like to read out the top one for you in 1924, which is quite late.', 'start': 413.133, 'duration': 3.761}, {'end': 430.144, 'text': 'Um, For there are still in use 25 local corn weights and measures, 12 different bushels, 13 different pounds,', 'start': 420.063, 'duration': 10.081}, {'end': 434.625, 'text': '10 different stone and nine different tons in 1924..', 'start': 430.144, 'duration': 4.481}], 'summary': 'The 1924 report revealed 25 local corn weights, 12 bushels, 13 pounds, 10 stone, and 9 tons still in use.', 'duration': 33.876, 'max_score': 400.749, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg400749.jpg'}], 'start': 0.349, 'title': 'Historical data standardization', 'summary': 'Delves into the challenges of standardization in historical data, using the example of 1431 merchant john beer, highlighting the relevance of history in addressing modern data standardization issues.', 'chapters': [{'end': 463.784, 'start': 0.349, 'title': 'Standardization and historical data', 'summary': 'Discusses the challenges of standardization in historical data, citing the example of a 1431 merchant, john beer, and emphasizes the importance of understanding history to solve modern problems related to data and standardization.', 'duration': 463.435, 'highlights': ["John Beer's challenges in selling products in 1431 due to different measurement systems, currencies, languages, and laws, emphasizing the massive effort required for a simple transaction. The transcript details the extensive challenges John Beer faced in 1431 in selling his products, including understanding different measurement systems, currencies, languages, laws, and transportation logistics, highlighting the significant effort required for a seemingly simple transaction.", 'The need for standardization to address the varying standards between different regions, exemplified by the difficulties John Beer faced in ensuring quality and pricing in different locations. It emphasizes the necessity of standardization to address the varying standards between different regions, as shown by the difficulties John Beer encountered in ensuring quality and pricing in distinct locations.', "Historical attempts at standardization, such as King Edgar's ordinance in 1959 and the Weights and Measures Act in 1824, demonstrating the ongoing efforts to establish uniform standards. The transcript mentions historical attempts at standardization, including King Edgar's ordinance in 1959 and the Weights and Measures Act in 1824, indicating continuous efforts to establish uniform standards over time."]}], 'duration': 463.435, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg349.jpg', 'highlights': ["John Beer's challenges in selling products in 1431 due to different measurement systems, currencies, languages, and laws, emphasizing the massive effort required for a simple transaction.", 'The need for standardization to address the varying standards between different regions, exemplified by the difficulties John Beer faced in ensuring quality and pricing in different locations.', "Historical attempts at standardization, such as King Edgar's ordinance in 1959 and the Weights and Measures Act in 1824, demonstrating the ongoing efforts to establish uniform standards."]}, {'end': 1087.467, 'segs': [{'end': 509.251, 'src': 'embed', 'start': 464.744, 'weight': 0, 'content': [{'end': 472.691, 'text': 'So the problem is not a technological one, the problem is a human one.', 'start': 464.744, 'duration': 7.947}, {'end': 482.078, 'text': 'How can you convince these people that you should standardize the way that you transport goods and that you have the same language throughout that process?', 'start': 473.811, 'duration': 8.267}, {'end': 490.584, 'text': "So this is very similar to a data engineer's problem, or the problem that you face.", 'start': 485.282, 'duration': 5.302}, {'end': 498.187, 'text': 'When you get data, you get them from many different sources, databases, FTP, API, S3, file shares, HTML.', 'start': 490.864, 'duration': 7.323}, {'end': 500.508, 'text': 'You might do web scraping, all these areas to get your data.', 'start': 498.207, 'duration': 2.301}, {'end': 502.809, 'text': 'They come in many different formats.', 'start': 501.408, 'duration': 1.401}, {'end': 509.251, 'text': "You've got JSON, XML, CSV, ORK, or some weird format that somebody on some vendor created just to put in their LinkedIn profile.", 'start': 502.829, 'duration': 6.422}], 'summary': 'Challenges of standardizing data formats, akin to transport logistics.', 'duration': 44.507, 'max_score': 464.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg464744.jpg'}, {'end': 557.731, 'src': 'embed', 'start': 532.233, 'weight': 6, 'content': [{'end': 537.077, 'text': 'And then what you really have is you can have all these different sources with all these different formats,', 'start': 532.233, 'duration': 4.844}, {'end': 538.478, 'text': 'with all these different compression types.', 'start': 537.077, 'duration': 1.401}, {'end': 540.319, 'text': "And it's really a cross-joint problem.", 'start': 538.838, 'duration': 1.481}, {'end': 544.062, 'text': 'And what you really want is to get your data into your data warehouse in a two-dimensional format.', 'start': 540.739, 'duration': 3.323}, {'end': 546.283, 'text': 'On the left, you have multi-dimensional data.', 'start': 544.122, 'duration': 2.161}, {'end': 548.305, 'text': 'On the right, you have two dimensions.', 'start': 546.543, 'duration': 1.762}, {'end': 550.926, 'text': 'This is where your stakeholders are consuming your data.', 'start': 548.325, 'duration': 2.601}, {'end': 552.007, 'text': "And it's even more complicated.", 'start': 551.167, 'duration': 0.84}, {'end': 553.328, 'text': "It's not just a relational database.", 'start': 552.027, 'duration': 1.301}, {'end': 554.909, 'text': 'You can have analytical column stores.', 'start': 553.648, 'duration': 1.261}, {'end': 556.35, 'text': 'You can have schema on read.', 'start': 554.929, 'duration': 1.421}, {'end': 557.731, 'text': 'Which ones do you use?', 'start': 556.631, 'duration': 1.1}], 'summary': 'Challenges in integrating diverse data sources into a two-dimensional format for a data warehouse, complicating data consumption by stakeholders.', 'duration': 25.498, 'max_score': 532.233, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg532233.jpg'}, {'end': 598.824, 'src': 'embed', 'start': 568.893, 'weight': 4, 'content': [{'end': 574.075, 'text': 'A data warehouse has at least two things that make it work well.', 'start': 568.893, 'duration': 5.182}, {'end': 577.016, 'text': 'One, it has to have a large variety of data.', 'start': 574.755, 'duration': 2.261}, {'end': 580.237, 'text': 'The data warehouse is the modern paradigm of a library.', 'start': 577.877, 'duration': 2.36}, {'end': 586.38, 'text': 'If a library only has one topic, not many people are going to go to that library and read its books.', 'start': 580.577, 'duration': 5.803}, {'end': 590.461, 'text': 'The second thing, it has to be a trusted source of information.', 'start': 587.02, 'duration': 3.441}, {'end': 598.824, 'text': "If I'm not getting my data when I expect it to be there, then I'm not really going to use that for my systems that I produce later on.", 'start': 591.141, 'duration': 7.683}], 'summary': 'A successful data warehouse needs diverse data and reliability.', 'duration': 29.931, 'max_score': 568.893, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg568893.jpg'}, {'end': 661.622, 'src': 'heatmap', 'start': 612.372, 'weight': 2, 'content': [{'end': 615.774, 'text': 'So the only bit of Python I actually have in this talk is that.', 'start': 612.372, 'duration': 3.402}, {'end': 623.698, 'text': "Anybody know where that's from? Yeah, good.", 'start': 619.576, 'duration': 4.122}, {'end': 627.32, 'text': "It's a bit more complicated than simple, but we'll get into it.", 'start': 625.038, 'duration': 2.282}, {'end': 630.349, 'text': "So let's start off what we have to do.", 'start': 628.587, 'duration': 1.762}, {'end': 632.03, 'text': 'We have our source data.', 'start': 630.709, 'duration': 1.321}, {'end': 636.274, 'text': "And this is our data infrastructure, where we'll be doing our work.", 'start': 633.632, 'duration': 2.642}, {'end': 641.038, 'text': 'And we have our data warehouse, which is typically the place where people consume their data.', 'start': 637.275, 'duration': 3.763}, {'end': 643.901, 'text': 'They write their SQL queries, join different data sets together, and get their data out.', 'start': 641.218, 'duration': 2.683}, {'end': 646.616, 'text': 'You have your data lake.', 'start': 645.675, 'duration': 0.941}, {'end': 650.897, 'text': 'Who here knows what a data lake is or has used a data lake? Show of hands, please.', 'start': 647.036, 'duration': 3.861}, {'end': 653.539, 'text': "Okay, that's quite a lot.", 'start': 650.917, 'duration': 2.622}, {'end': 655.94, 'text': "Okay, who hasn't? Maybe I should have asked that question.", 'start': 653.639, 'duration': 2.301}, {'end': 658.561, 'text': 'Okay Oh, 50-50, okay.', 'start': 656.78, 'duration': 1.781}, {'end': 661.622, 'text': 'So a quick introduction to data lakes.', 'start': 658.961, 'duration': 2.661}], 'summary': 'Introduction to data infrastructure, data warehouse, and data lake in python talk.', 'duration': 42.046, 'max_score': 612.372, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg612372.jpg'}, {'end': 732.489, 'src': 'embed', 'start': 706.969, 'weight': 7, 'content': [{'end': 711.875, 'text': "Now, this by far, please, if you're going to remember anything, please remember this bit right there.", 'start': 706.969, 'duration': 4.906}, {'end': 717.36, 'text': 'When you extract your data, save your data in its raw form.', 'start': 713.256, 'duration': 4.104}, {'end': 719.282, 'text': 'somewhere in your data lake.', 'start': 718.121, 'duration': 1.161}, {'end': 723.424, 'text': 'It can be in your folder file system, but save it in its raw form.', 'start': 719.602, 'duration': 3.822}, {'end': 732.489, 'text': "That means if you get a HTML website, if you're scraping a website, you should save that website in its raw HTML form to get it later.", 'start': 723.444, 'duration': 9.045}], 'summary': 'Save extracted data in raw form in data lake or file system for future use.', 'duration': 25.52, 'max_score': 706.969, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg706969.jpg'}, {'end': 906.97, 'src': 'heatmap', 'start': 879.974, 'weight': 0.72, 'content': [{'end': 886.758, 'text': 'I expect on that website to have these fields, that this data would exist, that table would exist in that HTML website.', 'start': 879.974, 'duration': 6.784}, {'end': 889.919, 'text': "If it doesn't,", 'start': 888.318, 'duration': 1.601}, {'end': 901.065, 'text': 'then save that data into an extracted failed area so you can have a look at what the problem was and change your functional methodology to understand how things have changed throughout the process.', 'start': 889.919, 'duration': 11.146}, {'end': 906.97, 'text': "Next thing what you should do is validate your data before it's given to your stakeholder.", 'start': 903.004, 'duration': 3.966}], 'summary': 'Expect fields and table on website. save failed data for analysis. validate data before sharing with stakeholder.', 'duration': 26.996, 'max_score': 879.974, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg879974.jpg'}, {'end': 964.241, 'src': 'embed', 'start': 935.578, 'weight': 5, 'content': [{'end': 940.502, 'text': 'So for example, I should be expecting these three values in my data.', 'start': 935.578, 'duration': 4.924}, {'end': 943.604, 'text': 'Am I sure that that data is there? No.', 'start': 940.682, 'duration': 2.922}, {'end': 945.185, 'text': "If it's no, then you fail it.", 'start': 944.024, 'duration': 1.161}, {'end': 947.387, 'text': "If it's yes, then it carries on.", 'start': 945.766, 'duration': 1.621}, {'end': 952.571, 'text': 'And then you need monitoring and all these different processes to say where is it breaking?', 'start': 947.888, 'duration': 4.683}, {'end': 953.572, 'text': 'Where is it working?', 'start': 952.611, 'duration': 0.961}, {'end': 956.094, 'text': 'And you should always be informed of that.', 'start': 953.972, 'duration': 2.122}, {'end': 959.777, 'text': 'In fact, what you should be doing is, this goes more into data as a product.', 'start': 956.154, 'duration': 3.623}, {'end': 964.241, 'text': 'Your final consumer should be able to see the monitoring that happens on your data.', 'start': 960.237, 'duration': 4.004}], 'summary': 'Ensure data quality by monitoring for failures and successes, and provide consumer visibility into data monitoring.', 'duration': 28.663, 'max_score': 935.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg935578.jpg'}, {'end': 1024.586, 'src': 'heatmap', 'start': 985.106, 'weight': 0.833, 'content': [{'end': 989.887, 'text': 'Remember that data is dirty, and you have to think of a good process to clean that data.', 'start': 985.106, 'duration': 4.781}, {'end': 993.608, 'text': 'And when it fails, you go back to it, and you clean it, and you understand what the problems are.', 'start': 990.147, 'duration': 3.461}, {'end': 999.355, 'text': 'So as a result of this, you have two situations.', 'start': 996.874, 'duration': 2.481}, {'end': 1003.117, 'text': "As time goes on, this is what you'd like to do.", 'start': 1000.175, 'duration': 2.942}, {'end': 1006.238, 'text': "You'd like your data to be trickling into your database.", 'start': 1003.157, 'duration': 3.081}, {'end': 1009.84, 'text': "But let's say that we have a validation fail in this area.", 'start': 1007.559, 'duration': 2.281}, {'end': 1011.4, 'text': 'So two situations can happen.', 'start': 1010.14, 'duration': 1.26}, {'end': 1015.342, 'text': 'So we get our data, we stop it, and no more data goes in.', 'start': 1012.521, 'duration': 2.821}, {'end': 1019.824, 'text': 'Or it failed some number of processes, but we get snippets of that data in.', 'start': 1016.382, 'duration': 3.442}, {'end': 1021.505, 'text': 'Which one do we use??', 'start': 1020.464, 'duration': 1.041}, {'end': 1024.586, 'text': 'Do we use the middle one or the bottom one??', 'start': 1021.605, 'duration': 2.981}], 'summary': 'Data cleaning process is essential for continuous database updates and error handling.', 'duration': 39.48, 'max_score': 985.106, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg985106.jpg'}], 'start': 464.744, 'title': 'Data transport and warehouse best practices', 'summary': 'Discusses challenges of standardizing data transport formats such as json, xml, csv, and various compression formats. it also emphasizes data warehouse best practices including maintaining a trusted and reliable data warehouse, using a two-dimensional format, and implementing data validation and monitoring processes.', 'chapters': [{'end': 530.349, 'start': 464.744, 'title': 'Data transport standardization', 'summary': 'Discusses the challenge of standardizing data transport formats and the complexity of working with diverse data sources, including json, xml, csv, and various compression formats, urging against the use of certain formats.', 'duration': 65.605, 'highlights': ['Working with diverse data sources, including JSON, XML, CSV, and various compression formats such as Zip, tar.gz, and gzip.', 'Challenges of standardizing data transport formats and the need to convince stakeholders to adopt a unified approach.', 'The complexity of deciphering and understanding data from different sources, leading to weeks of effort in data engineering.', 'The problem of dealing with non-standard formats created by vendors, causing inefficiencies in data processing.']}, {'end': 1087.467, 'start': 532.233, 'title': 'Data warehouse best practices', 'summary': 'Discusses the complexities of managing data from various sources and formats, emphasizing the importance of maintaining a trusted and reliable data warehouse, utilizing a two-dimensional format, and implementing data validation and monitoring processes to ensure data accuracy and reliability.', 'duration': 555.234, 'highlights': ['The importance of maintaining a trusted and reliable data warehouse Emphasizes the need for a data warehouse to have a large variety of data and be a trusted source of information, highlighting the modern paradigm of a library and the impact of data reliability on stakeholder usage.', 'Implementing data validation and monitoring processes Discusses the necessity of adding validation stages at the extract phase and the importance of validating data before consumption, along with the need for constant monitoring and informing stakeholders about the reliability and potential failures of data processes.', 'Utilizing a two-dimensional format for data in the data warehouse Stresses the importance of transforming data into a two-dimensional format for consumption by stakeholders and the challenges of dealing with multi-dimensional and two-dimensional data.', 'Storing data in its raw form and the importance of maintaining data integrity Highlights the significance of saving data in its raw form to enable backward tracking and stresses the importance of maintaining data integrity throughout the transformation and loading process.', 'Introduction to data lakes and their benefits Provides a brief introduction to data lakes, emphasizing their scalability, accessibility, and centralized nature, highlighting popular examples such as Amazon Web Services S3 and Digital Ocean Spaces.']}], 'duration': 622.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg464744.jpg', 'highlights': ['Challenges of standardizing data transport formats and convincing stakeholders to adopt a unified approach.', 'Working with diverse data sources, including JSON, XML, CSV, and various compression formats.', 'The complexity of deciphering and understanding data from different sources, leading to weeks of effort in data engineering.', 'The problem of dealing with non-standard formats created by vendors, causing inefficiencies in data processing.', 'The importance of maintaining a trusted and reliable data warehouse.', 'Implementing data validation and monitoring processes.', 'Utilizing a two-dimensional format for data in the data warehouse.', 'Storing data in its raw form and the importance of maintaining data integrity.', 'Introduction to data lakes and their benefits.']}, {'end': 1453.219, 'segs': [{'end': 1131.716, 'src': 'embed', 'start': 1089.005, 'weight': 0, 'content': [{'end': 1091.627, 'text': 'So the main important thing is understanding your data consumer.', 'start': 1089.005, 'duration': 2.622}, {'end': 1093.589, 'text': 'By far.', 'start': 1092.748, 'duration': 0.841}, {'end': 1094.23, 'text': 'what do they want??', 'start': 1093.589, 'duration': 0.641}, {'end': 1095.151, 'text': 'Do they want streaming?', 'start': 1094.27, 'duration': 0.881}, {'end': 1098.033, 'text': 'How quick do they need their information and their insights?', 'start': 1095.831, 'duration': 2.202}, {'end': 1099.234, 'text': 'Do they need it in seconds??', 'start': 1098.294, 'duration': 0.94}, {'end': 1100.476, 'text': 'Do they need it in hours??', 'start': 1099.254, 'duration': 1.222}, {'end': 1101.416, 'text': 'Do they need it in days?', 'start': 1100.496, 'duration': 0.92}, {'end': 1103.378, 'text': 'That will affect the technology that you use.', 'start': 1101.697, 'duration': 1.681}, {'end': 1106.921, 'text': "If you look at the diagram, I haven't put really any technology on there.", 'start': 1103.598, 'duration': 3.323}, {'end': 1108.323, 'text': "I've just put cogs.", 'start': 1107.282, 'duration': 1.041}, {'end': 1111.045, 'text': 'All right, I put an S3 bucket, but you can use whatever you want for that.', 'start': 1108.763, 'duration': 2.282}, {'end': 1114.007, 'text': 'This has just been from principles, building that up.', 'start': 1111.986, 'duration': 2.021}, {'end': 1120.75, 'text': 'And understanding your data consumer and your data will determine what you use in those cog processes.', 'start': 1114.847, 'duration': 5.903}, {'end': 1124.752, 'text': 'Keep your data in its raw form.', 'start': 1123.432, 'duration': 1.32}, {'end': 1125.833, 'text': 'I think we understand why.', 'start': 1124.952, 'duration': 0.881}, {'end': 1128.134, 'text': "We don't want to lose any of our data.", 'start': 1125.853, 'duration': 2.281}, {'end': 1131.716, 'text': "Don't delete or move your raw data.", 'start': 1129.214, 'duration': 2.502}], 'summary': 'Understanding data consumers is key for choosing technology and preserving raw data.', 'duration': 42.711, 'max_score': 1089.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1089005.jpg'}, {'end': 1218.403, 'src': 'embed', 'start': 1151.543, 'weight': 2, 'content': [{'end': 1155.905, 'text': 'my transformation function extracts the data I want from my extracted layer into my main table.', 'start': 1151.543, 'duration': 4.362}, {'end': 1156.525, 'text': 'perfectly well.', 'start': 1155.905, 'duration': 0.62}, {'end': 1160.826, 'text': "And that is the data engineer's problem, really, is to maintain that transformation stuff.", 'start': 1157.025, 'duration': 3.801}, {'end': 1163.167, 'text': 'You do not want to maintain other things, just that.', 'start': 1161.006, 'duration': 2.161}, {'end': 1167.344, 'text': 'Separate out your extract and transform load process.', 'start': 1164.623, 'duration': 2.721}, {'end': 1169.944, 'text': "You don't want one to fail and the other to continue.", 'start': 1167.584, 'duration': 2.36}, {'end': 1176.526, 'text': 'You want them to be separate processes that happen by themselves.', 'start': 1171.785, 'duration': 4.741}, {'end': 1180.327, 'text': 'Minimize the number of data and compute nodes.', 'start': 1178.287, 'duration': 2.04}, {'end': 1186.169, 'text': 'If I have too many nodes in my system, I have more chance for bugs to crop up, more error to happen.', 'start': 1180.487, 'duration': 5.682}, {'end': 1191.25, 'text': 'So by having only two, I have less of a chance of anything actually going wrong.', 'start': 1186.469, 'duration': 4.781}, {'end': 1194.528, 'text': "I'll skip that one.", 'start': 1193.387, 'duration': 1.141}, {'end': 1199.551, 'text': 'Make your ETL acyclical, which means data should only flow in one direction.', 'start': 1194.608, 'duration': 4.943}, {'end': 1206.796, 'text': "I've seen some companies have databases which refer to other databases and they cycle around and refer back to itself.", 'start': 1200.312, 'duration': 6.484}, {'end': 1210.538, 'text': 'What is the true source of your data anymore? It could actually be lost.', 'start': 1207.316, 'duration': 3.222}, {'end': 1211.358, 'text': 'You have no idea.', 'start': 1210.578, 'duration': 0.78}, {'end': 1217.182, 'text': 'Even in databases, how you structure a database, if you create a new table, it should be in an acyclical nature.', 'start': 1211.619, 'duration': 5.563}, {'end': 1218.403, 'text': 'It should only go in one direction.', 'start': 1217.202, 'duration': 1.201}], 'summary': 'Maintain separate extract and transform load processes to minimize errors and ensure data flows acyclically.', 'duration': 66.86, 'max_score': 1151.543, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1151543.jpg'}, {'end': 1272.228, 'src': 'embed', 'start': 1245.708, 'weight': 4, 'content': [{'end': 1254.176, 'text': 'He had to understand the data, understand his product, understand where his destination and where he was going to.', 'start': 1245.708, 'duration': 8.468}, {'end': 1261.122, 'text': "He shouldn't have to delete or move his product too much around lots of different areas.", 'start': 1254.636, 'duration': 6.486}, {'end': 1263.864, 'text': 'he he was looking at.', 'start': 1262.063, 'duration': 1.801}, {'end': 1266.285, 'text': 'he was validating the data to make sure that,', 'start': 1263.864, 'duration': 2.421}, {'end': 1272.228, 'text': 'or validating his product to make sure it was a good enough quality to set to give to his final consumer.', 'start': 1266.285, 'duration': 5.943}], 'summary': 'The entrepreneur focused on understanding data and product, minimizing product movement, and validating product quality.', 'duration': 26.52, 'max_score': 1245.708, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1245708.jpg'}], 'start': 1089.005, 'title': 'Data consumer, technology, and best practices for data engineering', 'summary': 'Emphasizes understanding data consumer needs, streaming requirements, and speed of insights delivery to determine technology used, while highlighting best practices for data engineering such as maintaining data transformations, minimizing nodes for error reduction, ensuring acyclical data flow, and treating data as a product.', 'chapters': [{'end': 1131.716, 'start': 1089.005, 'title': 'Understanding data consumer and technology', 'summary': "Emphasizes the importance of understanding the data consumer's needs, such as streaming requirements and speed of insights delivery, which determines the technology used, while also emphasizing the preservation of raw data to avoid loss.", 'duration': 42.711, 'highlights': ["Understanding the data consumer's needs, such as streaming requirements and speed of insights delivery, is crucial for determining the technology used (e.g., seconds, hours, days).", 'Preserving raw data is essential to avoid data loss and should not be deleted or moved.']}, {'end': 1453.219, 'start': 1131.816, 'title': 'Best practices for data engineering', 'summary': 'Highlights the best practices for data engineering, including maintaining data transformations over time, separating etl processes, minimizing nodes for error reduction, ensuring acyclical data flow, validating data before consumer delivery, and treating data as a product.', 'duration': 321.403, 'highlights': ['Maintain data transformations over time Data transformations should be maintained over time to ensure the extraction of data from the extract layer into the main table, enabling the data engineer to maintain the transformation process and minimize the chance of errors.', 'Separate ETL processes Separating extract, transform, and load processes minimizes the risk of failure in one process affecting the others, enhancing the efficiency and reliability of each individual process.', 'Validate data before consumer delivery Validating data before it reaches consumers ensures the quality of the data and enhances trust in the final product, contributing to customer satisfaction and minimizing potential issues.', 'Ensure acyclical data flow Data flow should only occur in one direction to prevent confusion about the true source of the data, maintaining a clear and unidirectional flow of information to avoid potential data loss and ambiguity.', 'Minimize nodes for error reduction Reducing the number of data and compute nodes minimizes the chances of bugs and errors, thereby enhancing the reliability and stability of the system for efficient data processing and management.']}], 'duration': 364.214, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1089005.jpg', 'highlights': ["Understanding the data consumer's needs is crucial for determining the technology used (e.g., seconds, hours, days).", 'Preserving raw data is essential to avoid data loss and should not be deleted or moved.', 'Maintain data transformations over time to ensure the extraction of data into the main table and minimize the chance of errors.', 'Separating extract, transform, and load processes minimizes the risk of failure in one process affecting the others, enhancing efficiency and reliability.', 'Validating data before it reaches consumers ensures the quality of the data and enhances trust in the final product.', 'Ensure acyclical data flow to prevent confusion about the true source of the data and avoid potential data loss and ambiguity.', 'Minimize nodes for error reduction to enhance the reliability and stability of the system for efficient data processing and management.']}, {'end': 1786.697, 'segs': [{'end': 1479.442, 'src': 'embed', 'start': 1453.619, 'weight': 0, 'content': [{'end': 1459.113, 'text': 'Is there any parallel to that in the data pipeline? I think so, yes.', 'start': 1453.619, 'duration': 5.494}, {'end': 1464.876, 'text': "Like, if you make a process easier, if you automate a process, there'll be less people doing jobs.", 'start': 1459.193, 'duration': 5.683}, {'end': 1465.956, 'text': "There'll be less jobs for it.", 'start': 1465.016, 'duration': 0.94}, {'end': 1468.818, 'text': "There'll be less data engineers doing the work.", 'start': 1466.116, 'duration': 2.702}, {'end': 1471.759, 'text': 'So yes, the more standardization.', 'start': 1470.439, 'duration': 1.32}, {'end': 1479.442, 'text': "But frankly, I had a very good talk that said, this is something that we shouldn't be doing.", 'start': 1471.779, 'duration': 7.663}], 'summary': 'Automating processes in the data pipeline may reduce jobs for data engineers, despite greater standardization.', 'duration': 25.823, 'max_score': 1453.619, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1453619.jpg'}, {'end': 1560.106, 'src': 'embed', 'start': 1529.912, 'weight': 1, 'content': [{'end': 1534.633, 'text': 'My curiosity is all about the very end when you talk about the main data that are already in the database.', 'start': 1529.912, 'duration': 4.721}, {'end': 1543.595, 'text': 'Do you envisage at all an additional validation stage where you can check actually that what you query,', 'start': 1535.253, 'duration': 8.342}, {'end': 1546.316, 'text': 'that the query itself gives what you would expect?', 'start': 1543.595, 'duration': 2.721}, {'end': 1552.343, 'text': "So I normally, when I'm making data engineering, I do that at that stage.", 'start': 1547.88, 'duration': 4.463}, {'end': 1560.106, 'text': 'So I will make a union of the two data sets and compare the, if I run a query, that gives me the same result as it should be getting out at the end.', 'start': 1552.823, 'duration': 7.283}], 'summary': 'Suggests adding a validation stage to check query results, to ensure accurate data retrieval.', 'duration': 30.194, 'max_score': 1529.912, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1529912.jpg'}, {'end': 1613.995, 'src': 'embed', 'start': 1586.504, 'weight': 3, 'content': [{'end': 1595.01, 'text': 'So, usually, what would you recommend as the best practice, to monitor after each and every step, or monitor at the end,', 'start': 1586.504, 'duration': 8.506}, {'end': 1600.073, 'text': "after everything is done? If you're starting from scratch? Yeah,", 'start': 1595.01, 'duration': 5.063}, {'end': 1607.71, 'text': 'Airflow is a good way to go.', 'start': 1604.167, 'duration': 3.543}, {'end': 1609.411, 'text': 'Use Jenkins to do it.', 'start': 1608.33, 'duration': 1.081}, {'end': 1610.772, 'text': 'Not as good, in my opinion.', 'start': 1609.671, 'duration': 1.101}, {'end': 1613.995, 'text': 'You can have monitoring that.', 'start': 1612.333, 'duration': 1.662}], 'summary': 'Best practice: monitor after each step using airflow or jenkins for monitoring.', 'duration': 27.491, 'max_score': 1586.504, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1586504.jpg'}, {'end': 1722.911, 'src': 'embed', 'start': 1694.522, 'weight': 2, 'content': [{'end': 1697.783, 'text': 'In fact, what you should be doing is even standardizing the way that you extract data.', 'start': 1694.522, 'duration': 3.261}, {'end': 1704.085, 'text': "So what I've taught my data engineers and what I've implemented is, for all these different databases,", 'start': 1698.043, 'duration': 6.042}, {'end': 1710.287, 'text': 'I have created Python code that extracts data from databases FTP sites, API S3, in a standardized format.', 'start': 1704.085, 'duration': 6.202}, {'end': 1714.008, 'text': 'You give it some configuration, and then it will spin out the monitoring for you.', 'start': 1710.327, 'duration': 3.681}, {'end': 1716.429, 'text': 'or the validation stuff for you during that process.', 'start': 1714.548, 'duration': 1.881}, {'end': 1718.829, 'text': "So you don't have to code it a thousand times for the same thing.", 'start': 1716.449, 'duration': 2.38}, {'end': 1722.911, 'text': 'Better? Yeah.', 'start': 1720.87, 'duration': 2.041}], 'summary': 'Standardized data extraction using python code for different databases and sources', 'duration': 28.389, 'max_score': 1694.522, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1694522.jpg'}], 'start': 1453.619, 'title': 'Challenges and best practices in data pipeline', 'summary': 'Covers the impact of automation on job roles, desire for meaningful work, and importance of data validation. it also discusses best practices for monitoring data pipelines, including the use of airflow and jenkins, creating custom exceptions, and standardizing data extraction.', 'chapters': [{'end': 1585.804, 'start': 1453.619, 'title': 'Challenges in data pipeline', 'summary': 'Discusses the impact of automation on job roles, the desire for more meaningful work in the data industry, and the importance of data validation, with a brief mention of adopting european units of measurement.', 'duration': 132.185, 'highlights': ['The impact of automation on job roles in the data pipeline is discussed, emphasizing that making processes easier and automating them may result in fewer jobs for data engineers.', 'The speaker expresses a desire to work on more interesting tasks, describing the current role as mundane, and expresses a wish to focus on gaining insights from data.', 'The importance of data validation is highlighted, with a specific focus on the need for validating queries and ensuring that the results align with expectations, mentioning the use of union and comparison of data sets during the validation process.', 'A brief mention is made about the adoption of European units of measurements, with the speaker humorously dismissing any recommendations for English people.', 'The talk briefly touches on the topic of monitoring in the data pipeline, indicating an interest in discussing this aspect further.']}, {'end': 1786.697, 'start': 1586.504, 'title': 'Best practices for monitoring data pipelines', 'summary': 'Discusses best practices for monitoring data pipelines, including the use of airflow and jenkins for monitoring, creating custom exceptions in python code, and standardizing data extraction to streamline monitoring and validation.', 'duration': 200.193, 'highlights': ['Standardizing data extraction in Python code for different databases, FTP sites, and API S3 to streamline monitoring and validation The speaker emphasizes the importance of creating Python code for standardized data extraction from various sources such as databases, FTP sites, and API S3, which streamlines the monitoring and validation process.', 'Using Airflow as a recommended tool for monitoring data pipelines The speaker recommends Airflow as a good tool for monitoring data pipelines, highlighting its effectiveness in the monitoring process.', 'Creating custom exceptions in Python code for different problems to facilitate identifying and addressing issues The speaker suggests creating custom exceptions in Python code to identify different problems and facilitate easier issue identification and resolution.', 'Recommendation to monitor data extraction and standardizing the process throughout the development phase The speaker recommends monitoring data extraction and standardizing the process throughout the development phase to streamline the monitoring and validation process.']}], 'duration': 333.078, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pzfgbSfzhXg/pics/pzfgbSfzhXg1453619.jpg', 'highlights': ['The impact of automation on job roles in the data pipeline is discussed, emphasizing that making processes easier and automating them may result in fewer jobs for data engineers.', 'The importance of data validation is highlighted, with a specific focus on the need for validating queries and ensuring that the results align with expectations, mentioning the use of union and comparison of data sets during the validation process.', 'Standardizing data extraction in Python code for different databases, FTP sites, and API S3 to streamline monitoring and validation The speaker emphasizes the importance of creating Python code for standardized data extraction from various sources such as databases, FTP sites, and API S3, which streamlines the monitoring and validation process.', 'Using Airflow as a recommended tool for monitoring data pipelines The speaker recommends Airflow as a good tool for monitoring data pipelines, highlighting its effectiveness in the monitoring process.']}], 'highlights': ['The need for standardization to address the varying standards between different regions, exemplified by the difficulties John Beer faced in ensuring quality and pricing in different locations.', "Historical attempts at standardization, such as King Edgar's ordinance in 1959 and the Weights and Measures Act in 1824, demonstrating the ongoing efforts to establish uniform standards.", 'Challenges of standardizing data transport formats and convincing stakeholders to adopt a unified approach.', 'Working with diverse data sources, including JSON, XML, CSV, and various compression formats.', 'The importance of maintaining a trusted and reliable data warehouse.', "Understanding the data consumer's needs is crucial for determining the technology used (e.g., seconds, hours, days).", 'Preserving raw data is essential to avoid data loss and should not be deleted or moved.', 'Maintain data transformations over time to ensure the extraction of data into the main table and minimize the chance of errors.', 'Separating extract, transform, and load processes minimizes the risk of failure in one process affecting the others, enhancing efficiency and reliability.', 'The impact of automation on job roles in the data pipeline is discussed, emphasizing that making processes easier and automating them may result in fewer jobs for data engineers.', 'The importance of data validation is highlighted, with a specific focus on the need for validating queries and ensuring that the results align with expectations, mentioning the use of union and comparison of data sets during the validation process.', 'Standardizing data extraction in Python code for different databases, FTP sites, and API S3 to streamline monitoring and validation The speaker emphasizes the importance of creating Python code for standardized data extraction from various sources such as databases, FTP sites, and API S3, which streamlines the monitoring and validation process.', 'Using Airflow as a recommended tool for monitoring data pipelines The speaker recommends Airflow as a good tool for monitoring data pipelines, highlighting its effectiveness in the monitoring process.']}