Coursnap

title
18. Quality and Safety for LLM Applications| Andrew Ng | DeepLearning.ai - Full Course

description
The course comes from [https://learn.deeplearning.ai/quality-safety-llm-applications/lesson/1/introduction](https://learn.deeplearning.ai/quality-safety-llm-applications/lesson/1/introduction) created by Andrew Ng This short course on "Quality and Safety for LLM Applications" by Andrew Ng, in partnership with Wabs, focuses on ensuring the reliability and safety of Language Model (LM) applications. The course addresses common issues such as data leakage, prompt injections, hallucinations, and toxicity in LM outputs. It introduces tools and techniques to detect, measure, and mitigate these problems, utilizing open-source packages like PyEM, Lanets, and YLogs, along with insights from practitioners and researchers. The course emphasizes the ongoing process of ensuring quality and safety, particularly in long-term deployment of LM-powered applications. Get free course notes: https://t.me/NoteForYoutubeCourse

detail
{'title': '18. Quality and Safety for LLM Applications| Andrew Ng | DeepLearning.ai - Full Course', 'heatmap': [], 'summary': 'Discusses the importance of using metrics to ensure the quality and safety of llm-powered apps, highlighting common risks like prompt injections, hallucinations, data leakage, and toxicity. it also covers techniques, tools, and best practices for detection, assessment, and mitigation of these issues, leveraging insights from industry experts.', 'chapters': [{'end': 67.21, 'segs': [{'end': 67.21, 'src': 'embed', 'start': 23.928, 'weight': 0, 'content': [{'end': 26.609, 'text': 'Maybe you can throw something together in days or weeks.', 'start': 23.928, 'duration': 2.681}, {'end': 33.153, 'text': "But the process of then understanding if it's safe to deploy then holds up as getting into actual usage.", 'start': 27.129, 'duration': 6.024}, {'end': 38.738, 'text': 'This short course goes over the most common ways an LLM application can go wrong.', 'start': 34.113, 'duration': 4.625}, {'end': 47.207, 'text': 'You hear about prompt injections, hallucinations, data leakage, and toxicity, plus tools to mitigate the risks.', 'start': 39.319, 'duration': 7.888}, {'end': 55.117, 'text': "I'm delighted to introduce the instructor for this course, Bernice Renan, who is Senior Data Scientist at YLABS.", 'start': 48.251, 'duration': 6.866}, {'end': 60.102, 'text': 'Bernice has worked for the last six years on evaluation and metrics for AI systems,', 'start': 55.818, 'duration': 4.284}, {'end': 66.208, 'text': "and I've had the pleasure of collaborating with her a few times already, since YLABS is a portfolio company of my team, AI Fund.", 'start': 60.102, 'duration': 6.106}, {'end': 67.21, 'text': 'Thanks, Andrew.', 'start': 66.51, 'duration': 0.7}], 'summary': 'Llm application risks and mitigation strategies by bernice renan, senior data scientist at ylabs.', 'duration': 43.282, 'max_score': 23.928, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns23928.jpg'}], 'start': 0.089, 'title': 'Quality and safety for llm applications', 'summary': 'Discusses the importance of using metrics to ensure the quality and safety of llm-powered apps, highlighting common risks like prompt injections, hallucinations, data leakage, and toxicity, with insights from bernice renan, senior data scientist at ylabs.', 'chapters': [{'end': 67.21, 'start': 0.089, 'title': 'Quality and safety for llm applications', 'summary': 'Discusses the importance of using metrics to ensure the quality and safety of llm-powered apps, highlighting the common risks such as prompt injections, hallucinations, data leakage, and toxicity, with insights from bernice renan, senior data scientist at ylabs.', 'duration': 67.121, 'highlights': ['The process of understanding the safety of deploying an LLM app can be slow after the quick proof of concept, often taking days or weeks (quantifiable data: time taken for safety evaluation).', 'Bernice Renan, Senior Data Scientist at YLABS, with six years of experience, shares insights on the evaluation and metrics for AI systems, emphasizing the importance of understanding the risks associated with LLM applications (quantifiable data: years of experience).', 'Common risks associated with LLM applications include prompt injections, hallucinations, data leakage, and toxicity, and the course provides tools to mitigate these risks (quantifiable data: types of risks and risk mitigation tools).']}], 'duration': 67.121, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns89.jpg', 'highlights': ['Common risks associated with LLM applications include prompt injections, hallucinations, data leakage, and toxicity, and the course provides tools to mitigate these risks (quantifiable data: types of risks and risk mitigation tools).', 'The process of understanding the safety of deploying an LLM app can be slow after the quick proof of concept, often taking days or weeks (quantifiable data: time taken for safety evaluation).', 'Bernice Renan, Senior Data Scientist at YLABS, with six years of experience, shares insights on the evaluation and metrics for AI systems, emphasizing the importance of understanding the risks associated with LLM applications (quantifiable data: years of experience).']}, {'end': 689.794, 'segs': [{'end': 96.404, 'src': 'embed', 'start': 67.851, 'weight': 3, 'content': [{'end': 75.754, 'text': "I've been seeing a lot of LLM safety and quality issues across a lot of companies, and I'm excited to share best practices from the field.", 'start': 67.851, 'duration': 7.903}, {'end': 77.134, 'text': 'In this course,', 'start': 76.134, 'duration': 1}, {'end': 86.658, 'text': 'you learn to look for data leakage where personal information such as names and URL addresses might appear in either the input prompts or the output responses of the LLM.', 'start': 77.134, 'duration': 9.524}, {'end': 96.404, 'text': 'You also learn to detect prompt injections where a prompt attempts to get an element to output a response that it is supposed to refuse, for example,', 'start': 87.558, 'duration': 8.846}], 'summary': 'Llm safety and quality issues observed across companies, sharing best practices.', 'duration': 28.553, 'max_score': 67.851, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns67851.jpg'}, {'end': 163.831, 'src': 'embed', 'start': 139.371, 'weight': 2, 'content': [{'end': 145.897, 'text': 'Practitioners and researchers have been experimenting with countless LLM applications that could benefit society,', 'start': 139.371, 'duration': 6.526}, {'end': 150.141, 'text': 'but measuring how well the system works is a necessary step to the development process.', 'start': 145.897, 'duration': 4.244}, {'end': 158.447, 'text': 'In fact, even after a system is deployed, ensuring quality and safety of your AI application will continue to be an ongoing process.', 'start': 150.862, 'duration': 7.585}, {'end': 163.831, 'text': 'Ensuring your system works long-term requires techniques that work at scale.', 'start': 159.068, 'duration': 4.763}], 'summary': 'Llm applications are being experimented with, emphasizing the importance of measuring system performance and ensuring ongoing quality and safety.', 'duration': 24.46, 'max_score': 139.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns139371.jpg'}, {'end': 210.906, 'src': 'embed', 'start': 183.381, 'weight': 1, 'content': [{'end': 188.784, 'text': 'From DeepLearning.ai, Ellie Xu and Diala Azadeh had also contributed to this course.', 'start': 183.381, 'duration': 5.403}, {'end': 196.868, 'text': "The first lesson will give you a hands-on overview of methods and tools that you'll see throughout the course to help you detect data leakage,", 'start': 189.264, 'duration': 7.604}, {'end': 198.509, 'text': 'jailbreaks and hallucinations.', 'start': 196.868, 'duration': 1.641}, {'end': 200.437, 'text': 'That sounds great.', 'start': 199.636, 'duration': 0.801}, {'end': 203.139, 'text': "Let's go on to the next video and get started.", 'start': 200.837, 'duration': 2.302}, {'end': 210.906, 'text': "In this lesson, I'll introduce the dataset of LLM prompts and responses that we'll use throughout the course.", 'start': 204.46, 'duration': 6.446}], 'summary': 'Introduction to course contributors and datasets for detecting data leakage, jailbreaks, and hallucinations.', 'duration': 27.525, 'max_score': 183.381, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns183381.jpg'}, {'end': 575.576, 'src': 'embed', 'start': 529.523, 'weight': 0, 'content': [{'end': 535.485, 'text': 'A hallucination looks readable and coherent and looks like it could be a valid response to the prompt.', 'start': 529.523, 'duration': 5.962}, {'end': 542.087, 'text': "Hallucinations are really interesting because they're hard to measure, and there's many different ways people have proposed to measure them.", 'start': 536.105, 'duration': 5.982}, {'end': 544.668, 'text': "We're only going to look at two in this course.", 'start': 542.687, 'duration': 1.981}, {'end': 548.758, 'text': "Right now, let's look at prompt response relevance.", 'start': 545.836, 'duration': 2.922}, {'end': 557.964, 'text': 'A common way for practitioners to measure relevance is by looking at how similar the response from an LLM is to the prompt it was given.', 'start': 549.519, 'duration': 8.445}, {'end': 563.228, 'text': 'We use the cosine similarity of the sentence embeddings to do that in LaneKit.', 'start': 558.325, 'duration': 4.903}, {'end': 572.115, 'text': "We'll import the input output module from LaneKit.", 'start': 569.073, 'duration': 3.042}, {'end': 575.576, 'text': "We'll use one of the helper methods to help us visualize this.", 'start': 572.615, 'duration': 2.961}], 'summary': 'Hallucinations are hard to measure; relevance is measured using cosine similarity in lanekit.', 'duration': 46.053, 'max_score': 529.523, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns529523.jpg'}], 'start': 67.851, 'title': 'Llm safety and detection', 'summary': 'Discusses llm safety, quality issues, and best practices to detect data leakage, prompt injections, and implicit toxicity using techniques like self-checked gpt framework. it also covers the importance of measuring llm application performance, techniques to detect issues, and demonstrates the use of open-source python packages, nand codes, ylogs, and huggingfix tools for mitigation.', 'chapters': [{'end': 121.929, 'start': 67.851, 'title': 'Llm safety best practices', 'summary': 'Discusses llm safety and quality issues, and best practices to detect data leakage, prompt injections, and implicit toxicity using techniques like self-checked gpt framework.', 'duration': 54.078, 'highlights': ['You learn to look for data leakage where personal information might appear in either the input prompts or the output responses of the LLM.', 'You also learn to detect prompt injections where a prompt attempts to get an element to output a response that it is supposed to refuse, for example, reviewing instructions for causing harm.', 'Implicit toxicity models go beyond identifying toxic words and can detect more subtle forms of toxicity where the words may sound innocent but are meaningless not.', 'Identify when responses are more likely to be hallucinations, using the self-checked GPT framework.']}, {'end': 689.794, 'start': 121.929, 'title': 'Detecting llm issues with python', 'summary': 'Discusses the importance of measuring the performance of llm applications, introduces techniques to detect issues such as data leakage, jailbreaks, and hallucinations, and demonstrates the use of open-source python packages, nand codes, ylogs, and huggingfix tools for mitigating these issues.', 'duration': 567.865, 'highlights': ["Measuring LLM Performance Practitioners and researchers have been experimenting with countless LLM applications, emphasizing the necessity of measuring the system's performance, particularly in detecting issues like data leakage, prompt injections, and hallucinations.", 'Detecting LLM Issues Introducing techniques to detect data leakage, jailbreaks, and hallucinations, using open-source Python packages, NAND codes, YLOGs, and HuggingFix tools to mitigate these issues.', 'Importance of Relevance in LLM Responses Emphasizing the significance of relevance in LLM responses, including the use of cosine similarity of sentence embeddings to measure relevance, with low scores indicating a higher likelihood of hallucinations.']}], 'duration': 621.943, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns67851.jpg', 'highlights': ['Identify when responses are more likely to be hallucinations, using the self-checked GPT framework.', 'Detecting LLM Issues Introducing techniques to detect data leakage, jailbreaks, and hallucinations, using open-source Python packages, NAND codes, YLOGs, and HuggingFix tools to mitigate these issues.', "Measuring LLM Performance Practitioners and researchers have been experimenting with countless LLM applications, emphasizing the necessity of measuring the system's performance, particularly in detecting issues like data leakage, prompt injections, and hallucinations.", 'You learn to look for data leakage where personal information might appear in either the input prompts or the output responses of the LLM.', 'Importance of Relevance in LLM Responses Emphasizing the significance of relevance in LLM responses, including the use of cosine similarity of sentence embeddings to measure relevance, with low scores indicating a higher likelihood of hallucinations.']}, {'end': 1131.81, 'segs': [{'end': 792.231, 'src': 'embed', 'start': 762.106, 'weight': 1, 'content': [{'end': 769.89, 'text': 'Phone numbers, email addresses, and other personally identifiable information tend to have a lot of structure that lends well to regex.', 'start': 762.106, 'duration': 7.784}, {'end': 773.532, 'text': "We're going to import lane kit metrics for data leakage.", 'start': 770.51, 'duration': 3.022}, {'end': 778.554, 'text': "Now we'll use the same helper function to visualize the metric it creates.", 'start': 773.992, 'duration': 4.562}, {'end': 792.231, 'text': 'We see email addresses, phone numbers, mailing addresses, and social security numbers in our dataset.', 'start': 786.789, 'duration': 5.442}], 'summary': 'Using regex to identify structured personal information for data leakage detection.', 'duration': 30.125, 'max_score': 762.106, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns762106.jpg'}, {'end': 874.101, 'src': 'embed', 'start': 838.428, 'weight': 0, 'content': [{'end': 841.148, 'text': 'We can see prompt toxicity is really long-tailed.', 'start': 838.428, 'duration': 2.72}, {'end': 845.769, 'text': 'Most of the toxicity is at zero, and only a few have higher values.', 'start': 841.788, 'duration': 3.981}, {'end': 856.281, 'text': 'we see a similar trend for response toxicity.', 'start': 853.356, 'duration': 2.925}, {'end': 864.297, 'text': "So you sometimes see an LLM respond with, sorry, I can't answer that, or I can't help you with that kind of request.", 'start': 857.614, 'duration': 6.683}, {'end': 874.101, 'text': "This is a refusal, where the LLM detects that the prompt may ask it to do something it's not programmed to do, so it provides a non-response.", 'start': 864.697, 'duration': 9.404}], 'summary': 'Prompt toxicity is long-tailed with most at zero; similar trend for response toxicity.', 'duration': 35.673, 'max_score': 838.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns838428.jpg'}, {'end': 963.549, 'src': 'embed', 'start': 930.426, 'weight': 2, 'content': [{'end': 935.488, 'text': "If you look at the distribution of jailbreaks, you'll see lots of near ones and zeros.", 'start': 930.426, 'duration': 5.062}, {'end': 938.789, 'text': "That's because the model is pretty confident in many examples.", 'start': 936.048, 'duration': 2.741}, {'end': 947.132, 'text': 'This particular dataset over-represents jailbreaks for learning purposes, but these would normally be very rare in real-world datasets.', 'start': 939.349, 'duration': 7.783}, {'end': 953.341, 'text': "Now let's look at the examples that are most likely to be prompt injections.", 'start': 949.398, 'duration': 3.943}, {'end': 963.549, 'text': 'We can see here very complicated prompts, prompts that have lots of redirections, such as, I am a programmer, and please answer in certain ways.', 'start': 955.342, 'duration': 8.207}], 'summary': 'Model is confident in jailbreak examples; over-represented for learning. complicated prompts for injection.', 'duration': 33.123, 'max_score': 930.426, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns930426.jpg'}, {'end': 1008.927, 'src': 'embed', 'start': 976.431, 'weight': 4, 'content': [{'end': 981.175, 'text': "To do this, I made a dashboard using Y logs that we'll use to see how well we're doing.", 'start': 976.431, 'duration': 4.744}, {'end': 985.278, 'text': 'To use it, we just pass in the examples that we believe are problematic.', 'start': 981.795, 'duration': 3.483}, {'end': 998.729, 'text': "You see here that we're still failing all of our objectives except for one.", 'start': 994.585, 'duration': 4.144}, {'end': 1004.746, 'text': 'And our final objective, we just need less than five total false positives.', 'start': 999.625, 'duration': 5.121}, {'end': 1008.927, 'text': "Because we haven't passed in any data, we definitely haven't gotten five yet.", 'start': 1005.286, 'duration': 3.641}], 'summary': "Dashboard shows we're failing most objectives, need <5 false positives.", 'duration': 32.496, 'max_score': 976.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns976431.jpg'}, {'end': 1103.631, 'src': 'embed', 'start': 1078.407, 'weight': 5, 'content': [{'end': 1085.012, 'text': 'The next lessons are all about discovering and creating new metrics to identify these issues and get all of those tests green.', 'start': 1078.407, 'duration': 6.605}, {'end': 1095.141, 'text': 'In this lesson, we will detect hallucinations in our data, which represent an inaccurate or irrelevant response to a prompt.', 'start': 1087.054, 'duration': 8.087}, {'end': 1101.389, 'text': 'How do we determine if an LLM is hallucinating? We start by measuring text similarity.', 'start': 1096.104, 'duration': 5.285}, {'end': 1103.631, 'text': "Let's take a look at how to do that now.", 'start': 1102.129, 'duration': 1.502}], 'summary': 'Lessons focus on creating new metrics to detect data hallucinations and measuring text similarity.', 'duration': 25.224, 'max_score': 1078.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1078407.jpg'}], 'start': 690.274, 'title': 'Llm metrics & security assessment', 'summary': 'Delves into llm metrics encompassing response relevance, self-similarity, data leakage, and toxicity, emphasizing the skewed toxicity value distribution. additionally, it explores jailbreak prompt injections, assessing security metrics, data quality, and hallucination detection, with emphasis on text similarity measurement.', 'chapters': [{'end': 864.297, 'start': 690.274, 'title': 'Llm metrics & discoveries', 'summary': 'Discusses llm metrics including prompt response relevance, response self-similarity, data leakage through regex pattern matching, and toxicity metrics for prompts and responses, highlighting the long-tailed distribution of toxicity values.', 'duration': 174.023, 'highlights': ['The chapter discusses LLM metrics including prompt response relevance, response self-similarity, data leakage through regex pattern matching, and toxicity metrics for prompts and responses. The chapter covers various LLM metrics such as prompt response relevance, response self-similarity, data leakage through regex pattern matching, and toxicity metrics for prompts and responses.', 'The toxicity metric for prompts and responses shows a long-tailed distribution with most toxicity values at zero and only a few having higher values. The toxicity metric for prompts and responses reveals a long-tailed distribution with the majority of toxicity values at zero and a few instances with higher values.', 'Data leakage is addressed through regex pattern matching, particularly for identifying personally identifiable information like email addresses, phone numbers, mailing addresses, and social security numbers. The discussion includes addressing data leakage through regex pattern matching to identify personally identifiable information such as email addresses, phone numbers, mailing addresses, and social security numbers.']}, {'end': 1131.81, 'start': 864.697, 'title': 'Jailbreak prompt injections', 'summary': 'Explores the jailbreak prompt injections, discusses the distribution of jailbreaks, evaluates security and data quality metrics, and addresses the detection of hallucinations in llm, with a focus on text similarity measurement.', 'duration': 267.113, 'highlights': ['The model is pretty confident in many examples, but jailbreaks would normally be very rare in real-world datasets. The distribution of jailbreaks shows that the model is confident in many examples, but these prompt injections would typically be rare in real-world datasets.', 'The dataset over-represents jailbreaks for learning purposes. The dataset contains an over-representation of jailbreaks for educational purposes.', 'The course aims to detect problematic examples and build metrics for security and data quality. The course focuses on identifying problematic examples and constructing metrics to assess security and data quality.', 'The dashboard shows that the objectives are not being met, except for one, and the final objective requires less than five total false positives. The dashboard indicates that most objectives are not being achieved, except for one, and the final objective sets a threshold of less than five total false positives.', 'The course encourages trying different filters to identify various problematic examples in the data, such as filtering for long prompts. The course recommends experimenting with different filters, like filtering for long prompts, to uncover different types of problematic examples in the data.', 'The lesson focuses on detecting hallucinations in the data by measuring text similarity. The lesson emphasizes the detection of hallucinations by measuring text similarity to identify low-quality and inaccurate responses.']}], 'duration': 441.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns690274.jpg', 'highlights': ['The toxicity metric for prompts and responses reveals a long-tailed distribution with the majority of toxicity values at zero and a few instances with higher values.', 'Data leakage is addressed through regex pattern matching, particularly for identifying personally identifiable information such as email addresses, phone numbers, mailing addresses, and social security numbers.', 'The distribution of jailbreaks shows that the model is confident in many examples, but these prompt injections would typically be rare in real-world datasets.', 'The dataset contains an over-representation of jailbreaks for educational purposes.', 'The dashboard indicates that most objectives are not being achieved, except for one, and the final objective sets a threshold of less than five total false positives.', 'The lesson emphasizes the detection of hallucinations by measuring text similarity to identify low-quality and inaccurate responses.']}, {'end': 1559.333, 'segs': [{'end': 1223.275, 'src': 'embed', 'start': 1156.665, 'weight': 0, 'content': [{'end': 1160.748, 'text': "We'll look at four different metrics, which you can see here, all of different characteristics,", 'start': 1156.665, 'duration': 4.083}, {'end': 1163.709, 'text': "And we'll talk about the details of each when we go along.", 'start': 1161.388, 'duration': 2.321}, {'end': 1165.811, 'text': "First, let's get started with setup.", 'start': 1164.25, 'duration': 1.561}, {'end': 1172.575, 'text': "We'll import our helpers module that we've been using throughout the course.", 'start': 1168.032, 'duration': 4.543}, {'end': 1177.247, 'text': "And now we'll import Evaluate.", 'start': 1175.225, 'duration': 2.022}, {'end': 1187.034, 'text': 'So Evaluate is a Hugging Face library that includes a number of different evaluation metrics for machine learning.', 'start': 1177.887, 'duration': 9.147}, {'end': 1192.579, 'text': 'So my own research is largely centered around evaluation metrics for machine learning.', 'start': 1187.455, 'duration': 5.124}, {'end': 1198.844, 'text': "There's something really painful about using evaluation metrics and implementing those evaluation metrics.", 'start': 1193.039, 'duration': 5.805}, {'end': 1203.566, 'text': "Often, they're not fully described in the papers or resources when they're first created.", 'start': 1199.284, 'duration': 4.282}, {'end': 1209.389, 'text': "And additionally, they're rarely implemented exactly the same way across open source tools.", 'start': 1203.586, 'duration': 5.803}, {'end': 1214.711, 'text': "That's why it can be really helpful to have packages like evaluate gain popularity.", 'start': 1209.929, 'duration': 4.782}, {'end': 1223.275, 'text': 'When one package with a single implementation gains popularity, we start to find more of a consensus on the implementation details.', 'start': 1215.611, 'duration': 7.664}], 'summary': 'Discussion on evaluating machine learning models using various metrics and challenges in their implementation and description.', 'duration': 66.61, 'max_score': 1156.665, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1156665.jpg'}, {'end': 1271.533, 'src': 'embed', 'start': 1245.5, 'weight': 4, 'content': [{'end': 1251.782, 'text': "Blue scores give us a score from 0 to 1, but the score that's given really depends on the dataset.", 'start': 1245.5, 'duration': 6.282}, {'end': 1261.688, 'text': 'For example, the original paper that introduced the metric saw blue scores between 0.05 and 0.26.', 'start': 1252.503, 'duration': 9.185}, {'end': 1264.71, 'text': 'Other instances have blue scores up to 0.8.', 'start': 1261.688, 'duration': 3.022}, {'end': 1271.533, 'text': "It really depends on the data set that you're using, and they're not easily comparable across data sets or tasks.", 'start': 1264.71, 'duration': 6.823}], 'summary': 'Blue scores range from 0.05 to 0.8 depending on dataset.', 'duration': 26.033, 'max_score': 1245.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1245500.jpg'}, {'end': 1381.512, 'src': 'embed', 'start': 1329.706, 'weight': 5, 'content': [{'end': 1337.67, 'text': "If you're curious about where those precision scores came from, they're all about comparing tokens across the two text references.", 'start': 1329.706, 'duration': 7.964}, {'end': 1342.393, 'text': "So for unigrams, we're looking for a single token.", 'start': 1338.53, 'duration': 3.863}, {'end': 1345.755, 'text': "And tokens are often words, although they don't have to be.", 'start': 1342.553, 'duration': 3.202}, {'end': 1347.557, 'text': 'A single token.', 'start': 1345.775, 'duration': 1.782}, {'end': 1354.742, 'text': 'And do we see the presence of that token in both text examples? A bigram is one step up from that.', 'start': 1347.577, 'duration': 7.165}, {'end': 1359.406, 'text': "We're not looking for individual words, but we're looking for pairs of words together.", 'start': 1355.122, 'duration': 4.284}, {'end': 1361.427, 'text': 'in both examples.', 'start': 1360.467, 'duration': 0.96}, {'end': 1371.49, 'text': 'So despite there being a lot of common language between these two, the only true bigram match of these two are in B.', 'start': 1361.847, 'duration': 9.643}, {'end': 1374.831, 'text': 'And BlueScore is calculated using these comparisons.', 'start': 1371.49, 'duration': 3.341}, {'end': 1381.512, 'text': 'So we progressively measure unigrams, bigrams, trigrams, and other engrams.', 'start': 1375.291, 'duration': 6.221}], 'summary': 'Comparing tokens for precision scores, including unigrams and bigrams, to calculate bluescore.', 'duration': 51.806, 'max_score': 1329.706, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1329706.jpg'}, {'end': 1545.392, 'src': 'embed', 'start': 1516.253, 'weight': 7, 'content': [{'end': 1521.316, 'text': 'BERT score uses embeddings to find a semantic match between words.', 'start': 1516.253, 'duration': 5.063}, {'end': 1529.601, 'text': 'So how does this work? We take our two text samples and we calculate contextual embeddings for each of the specific words.', 'start': 1521.876, 'duration': 7.725}, {'end': 1537.647, 'text': 'Contextual embeddings are different from static embeddings because they give different embedding values depending on the context around the word.', 'start': 1529.821, 'duration': 7.826}, {'end': 1545.392, 'text': 'You can see the difference most easily for words like bank, which could mean snow bank or a bank that you take your money to.', 'start': 1538.047, 'duration': 7.345}], 'summary': 'Bert score uses contextual embeddings to find semantic matches between words.', 'duration': 29.139, 'max_score': 1516.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1516253.jpg'}], 'start': 1132.19, 'title': 'Evaluating ml metrics and blue scores', 'summary': 'Explores the use of different metrics for evaluating machine learning models, emphasizing challenges in implementation and comparison. it also analyzes blue scores for prompt response relevance, focusing on precision values and token comparisons, and discusses the calculation, visualization, and limitations of blue scores, along with the semantic matching approach of bert scores using contextual embeddings.', 'chapters': [{'end': 1223.275, 'start': 1132.19, 'title': 'Exploring evaluation metrics for machine learning', 'summary': 'Discusses the approach of using different metrics for evaluating machine learning models, highlighting the challenges in implementing and comparing evaluation metrics in research.', 'duration': 91.085, 'highlights': ['Evaluate is a Hugging Face library that includes a number of different evaluation metrics for machine learning, which can help in evaluating the performance of machine learning models.', 'The research focuses on the challenges of using evaluation metrics in machine learning, including the lack of comprehensive descriptions and varying implementations across different tools.', 'The chapter emphasizes the importance of having standardized packages for evaluation metrics to establish a consensus on implementation details and improve the reliability of model evaluations.', 'The chapter discusses the use of four different metrics for evaluating machine learning models, each with unique characteristics, providing a comprehensive approach to model assessment.']}, {'end': 1381.512, 'start': 1224.275, 'title': 'Analyzing blue scores for prompt response relevance', 'summary': 'Discusses the use of blue scores in evaluating prompt response relevance, highlighting its limitations and variability in scores, with a focus on precision values and comparisons of tokens.', 'duration': 157.237, 'highlights': ['Blue scores are used to measure prompt response relevance and yield scores between 0 and 1, but their variability across different datasets and tasks makes them challenging to compare. Blue scores provide a measure of prompt response relevance, with scores ranging from 0 to 1, but their comparability is hindered by variability across datasets.', 'The precision values are used to compare tokens across text references, with consideration for unigrams, bigrams, and other engrams. Precision values are employed to compare tokens across text references, focusing on unigrams, bigrams, and other engrams.', 'The calculation of blue scores involves comparing tokens across text examples using unigrams, bigrams, trigrams, and other engrams. The process of calculating blue scores entails comparing tokens across text examples through the evaluation of unigrams, bigrams, trigrams, and other engrams.']}, {'end': 1559.333, 'start': 1382.253, 'title': 'Calculating blue score and bert score metrics', 'summary': "Discusses the calculation and visualization of blue scores, showcasing a heavy tail distribution with many low scores and a few close to 0.5, followed by an explanation of bert score's semantic matching approach using contextual embeddings.", 'duration': 177.08, 'highlights': ['The blue score calculation is demonstrated, revealing a heavy tail distribution with many low scores and some close to 0.5.', "The BERT score's semantic matching approach using contextual embeddings is explained, highlighting the differentiation from static embeddings based on contextual meaning."]}], 'duration': 427.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1132190.jpg', 'highlights': ['The chapter emphasizes the importance of having standardized packages for evaluation metrics to establish a consensus on implementation details and improve the reliability of model evaluations.', 'Evaluate is a Hugging Face library that includes a number of different evaluation metrics for machine learning, which can help in evaluating the performance of machine learning models.', 'The research focuses on the challenges of using evaluation metrics in machine learning, including the lack of comprehensive descriptions and varying implementations across different tools.', 'The chapter discusses the use of four different metrics for evaluating machine learning models, each with unique characteristics, providing a comprehensive approach to model assessment.', 'Blue scores are used to measure prompt response relevance and yield scores between 0 and 1, but their comparability is hindered by variability across datasets.', 'The precision values are used to compare tokens across text references, focusing on unigrams, bigrams, and other engrams.', 'The calculation of blue scores involves comparing tokens across text examples using unigrams, bigrams, trigrams, and other engrams.', "The BERT score's semantic matching approach using contextual embeddings is explained, highlighting the differentiation from static embeddings based on contextual meaning.", 'The blue score calculation is demonstrated, revealing a heavy tail distribution with many low scores and some close to 0.5.']}, {'end': 2963.116, 'segs': [{'end': 1647.551, 'src': 'embed', 'start': 1612.313, 'weight': 2, 'content': [{'end': 1618.276, 'text': 'OK, so our results are a precision, a recall value, an F1 score.', 'start': 1612.313, 'duration': 5.963}, {'end': 1624.519, 'text': 'And for those who are unfamiliar, F1 scores are a weighted average of precision and recall.', 'start': 1618.296, 'duration': 6.223}, {'end': 1627.941, 'text': "Let's go ahead and create a new metric for BERT scores.", 'start': 1624.939, 'duration': 3.002}, {'end': 1631.603, 'text': "First, we'll add our decorator.", 'start': 1630.302, 'duration': 1.301}, {'end': 1637.965, 'text': "Then we'll add our new bird score function.", 'start': 1635.703, 'duration': 2.262}, {'end': 1647.551, 'text': "And we'll make sure to return a list of the F1 scores as our metric.", 'start': 1642.428, 'duration': 5.123}], 'summary': 'Results include precision, recall, and f1 score, with plans to create new metric for bert scores.', 'duration': 35.238, 'max_score': 1612.313, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1612313.jpg'}, {'end': 1701.227, 'src': 'embed', 'start': 1674.599, 'weight': 0, 'content': [{'end': 1680.105, 'text': 'This one looks much more like a bell curve with the highest frequency values being in the middle.', 'start': 1674.599, 'duration': 5.506}, {'end': 1685.01, 'text': "So now let's look at some of the queries that give us low BERT scores.", 'start': 1680.705, 'duration': 4.305}, {'end': 1699.845, 'text': "So if we have a low BERT score, we're more concerned about this response being a hallucination, because the prompt is different from the response,", 'start': 1691.218, 'duration': 8.627}, {'end': 1701.227, 'text': 'at least according to this metric.', 'start': 1699.845, 'duration': 1.382}], 'summary': 'Data analysis: bell curve distribution, focus on low bert scores and response hallucinations.', 'duration': 26.628, 'max_score': 1674.599, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1674599.jpg'}, {'end': 1844.736, 'src': 'embed', 'start': 1811.892, 'weight': 3, 'content': [{'end': 1815.255, 'text': "And because it's a little long, I will push this onto the next line.", 'start': 1811.892, 'duration': 3.363}, {'end': 1829.972, 'text': "Now what do we compare it with? I say we give a threshold of, let's start with 0.75.", 'start': 1817.396, 'duration': 12.576}, {'end': 1831.472, 'text': 'less than 0.75.', 'start': 1829.972, 'duration': 1.5}, {'end': 1835.833, 'text': 'So remember that if we have a low BERT score,', 'start': 1831.472, 'duration': 4.361}, {'end': 1844.736, 'text': "this means that we're more concerned that a particular prompt and response may represent a hallucination on the part of the LLM.", 'start': 1835.833, 'duration': 8.903}], 'summary': 'Comparing responses with a threshold of 0.75 bert score to detect potential hallucination.', 'duration': 32.844, 'max_score': 1811.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1811892.jpg'}, {'end': 1936.228, 'src': 'embed', 'start': 1907.444, 'weight': 4, 'content': [{'end': 1918.173, 'text': 'One place that this became popular is the self-check GPT paper, which is a comparison of the response to multiple responses using a number of metrics,', 'start': 1907.444, 'duration': 10.729}, {'end': 1923.597, 'text': "including the ones that we've just used, like blue score and BERT score, as well as others.", 'start': 1918.173, 'duration': 5.424}, {'end': 1928.762, 'text': 'To use this multiple response paradigm, we need to download some new data.', 'start': 1924.258, 'duration': 4.504}, {'end': 1936.228, 'text': "Let's call this data set chatsExtended, and it's in our chatsExtended CSV.", 'start': 1930.902, 'duration': 5.326}], 'summary': 'Comparison of multiple responses using metrics like blue score and bert score in the self-check gpt paper.', 'duration': 28.784, 'max_score': 1907.444, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1907444.jpg'}, {'end': 2051.71, 'src': 'embed', 'start': 2020.888, 'weight': 5, 'content': [{'end': 2025.369, 'text': "If we want to compare two embeddings, we'll need to calculate a cosine similarity between them.", 'start': 2020.888, 'duration': 4.481}, {'end': 2031.23, 'text': "There's many ways to do this, but let's use a utility function from the sentence transformers package.", 'start': 2025.989, 'duration': 5.241}, {'end': 2039.272, 'text': "So now let's put in our decorator where we're looking at response and the two responses, response two and response three.", 'start': 2031.93, 'duration': 7.342}, {'end': 2044.413, 'text': "We'll create a metric called response.sentenceEmbeddingSelfSimilarity.", 'start': 2040.472, 'duration': 3.941}, {'end': 2051.71, 'text': 'So our decorator needs a function.', 'start': 2049.85, 'duration': 1.86}], 'summary': 'Comparing embeddings using cosine similarity and sentence transformers package', 'duration': 30.822, 'max_score': 2020.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2020888.jpg'}, {'end': 2208.44, 'src': 'embed', 'start': 2184.822, 'weight': 6, 'content': [{'end': 2193.689, 'text': "So now that we're comparing responses to other responses, The differences that we capture are much more likely to be about the model.", 'start': 2184.822, 'duration': 8.867}, {'end': 2197.632, 'text': "We'd always suspect that there's some differences between prompt and response.", 'start': 2193.889, 'duration': 3.743}, {'end': 2204.897, 'text': 'So while that comparison is a good analogy, self-similarity across multiple responses are even better.', 'start': 2198.092, 'duration': 6.805}, {'end': 2208.44, 'text': "Let's look at which examples have the lowest self-similarity.", 'start': 2205.357, 'duration': 3.083}], 'summary': "Comparing responses to identify differences in the model's performance.", 'duration': 23.618, 'max_score': 2184.822, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2184822.jpg'}, {'end': 2362.92, 'src': 'embed', 'start': 2332.861, 'weight': 7, 'content': [{'end': 2334.882, 'text': "Okay, so it's a pretty large prompt.", 'start': 2332.861, 'duration': 2.021}, {'end': 2340.185, 'text': "So the prompt we'll use asks for the first text passage, which is the first response.", 'start': 2335.302, 'duration': 4.883}, {'end': 2346.048, 'text': 'Can the LLM rate the consistency of that text to the provided context?', 'start': 2340.585, 'duration': 5.463}, {'end': 2347.729, 'text': 'which are the other two responses?', 'start': 2346.048, 'duration': 1.681}, {'end': 2352.411, 'text': 'The reason that we use the word consistency here is largely a choice.', 'start': 2348.189, 'duration': 4.222}, {'end': 2355.073, 'text': 'Another word might be similarity or things like this.', 'start': 2352.572, 'duration': 2.501}, {'end': 2362.92, 'text': 'But we find that consistency tends to be more about whether or not two sentences logically can be true at the same time.', 'start': 2355.253, 'duration': 7.667}], 'summary': 'The prompt asks for the llm to rate the consistency of the first text passage to the provided context and requests the other two responses, emphasizing the importance of logical truth.', 'duration': 30.059, 'max_score': 2332.861, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2332861.jpg'}, {'end': 2573.966, 'src': 'embed', 'start': 2480.054, 'weight': 8, 'content': [{'end': 2485.617, 'text': "One thing that's difficult is getting a calibrated response when we ask an LLM for a number like this.", 'start': 2480.054, 'duration': 5.563}, {'end': 2493.462, 'text': "If we ask for numbers between 0 and 1, it's really difficult to understand what a 0.5 might mean or a 0.25 might mean.", 'start': 2486.257, 'duration': 7.205}, {'end': 2497.444, 'text': 'And those might change depending on slight nuances in your response.', 'start': 2493.882, 'duration': 3.562}, {'end': 2500.405, 'text': 'or between prompt to prompt.', 'start': 2498.124, 'duration': 2.281}, {'end': 2506.609, 'text': 'One approach in practice is to actually ask about specific sentences in the response.', 'start': 2500.966, 'duration': 5.643}, {'end': 2514.533, 'text': 'Is this specific sentence, the first sentence of our response, consistent with the whole second response?', 'start': 2507.249, 'duration': 7.284}, {'end': 2517.895, 'text': 'Some other ways you might change this prompt.', 'start': 2515.193, 'duration': 2.702}, {'end': 2525.499, 'text': 'instead of asking for a number between 0 and 1, we may try to calibrate by asking for categorical information.', 'start': 2517.895, 'duration': 7.604}, {'end': 2528.66, 'text': 'Maybe high, medium, low consistency.', 'start': 2525.799, 'duration': 2.861}, {'end': 2533.483, 'text': "Let's create a filter to look at self-similarity scores that are less than 0.8.", 'start': 2529.301, 'duration': 4.182}, {'end': 2543.368, 'text': "We'll pass in as our variable, the response dot prompted self-similarity.", 'start': 2533.483, 'duration': 9.885}, {'end': 2550.711, 'text': "Okay Let's see what we get.", 'start': 2543.388, 'duration': 7.323}, {'end': 2564.817, 'text': 'Okay, We have prompts here, such as this discover credit card issue,', 'start': 2550.731, 'duration': 14.086}, {'end': 2573.966, 'text': 'where some of the responses give a format for the credit card and other responses give some more details about the sorts of numbers.', 'start': 2564.817, 'duration': 9.149}], 'summary': 'Challenges in interpreting llm responses, suggest categorical information, filter for self-similarity scores <0.8', 'duration': 93.912, 'max_score': 2480.054, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2480054.jpg'}, {'end': 2663.388, 'src': 'embed', 'start': 2629.511, 'weight': 11, 'content': [{'end': 2634.156, 'text': "Now we'll move on to the next lesson, lesson three, about data leakage and toxicity.", 'start': 2629.511, 'duration': 4.645}, {'end': 2635.137, 'text': 'See you there.', 'start': 2634.616, 'duration': 0.521}, {'end': 2644.205, 'text': "In this lesson, you'll practice detecting data leakage, which is where private data appears in either the prompt or the LLMs response.", 'start': 2636.538, 'duration': 7.667}, {'end': 2647.909, 'text': "You'll go from simple metrics to state-of-the-art methods.", 'start': 2645.026, 'duration': 2.883}, {'end': 2649.771, 'text': "Let's try this out together.", 'start': 2648.53, 'duration': 1.241}, {'end': 2654.566, 'text': "Let's get started with data leakage and a bonus section on toxicity.", 'start': 2651.345, 'duration': 3.221}, {'end': 2663.388, 'text': 'Unlike our previous lesson on hallucinations, which can be considered largely quality metrics, data leakage is more of a safety issue.', 'start': 2655.326, 'duration': 8.062}], 'summary': 'Lesson three covers detecting data leakage and toxicity in ai models, progressing from simple metrics to advanced methods.', 'duration': 33.877, 'max_score': 2629.511, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2629511.jpg'}, {'end': 2746.014, 'src': 'embed', 'start': 2720.094, 'weight': 12, 'content': [{'end': 2725.899, 'text': 'For a third type of data leakage, we have leakage of our test data into our training data set.', 'start': 2720.094, 'duration': 5.805}, {'end': 2735.206, 'text': 'Since many of the LLMs that we use are either proprietary or difficult to nail down exactly what the training data set is,', 'start': 2726.379, 'duration': 8.827}, {'end': 2741.13, 'text': 'it can be nearly impossible to know if the data we want to use to test a model has been seen in training.', 'start': 2735.206, 'duration': 5.924}, {'end': 2746.014, 'text': 'And that would invalidate our tests for generalization and accuracy of the model.', 'start': 2741.61, 'duration': 4.404}], 'summary': 'Data leakage of test data into training set can invalidate model tests.', 'duration': 25.92, 'max_score': 2720.094, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2720094.jpg'}], 'start': 1560.074, 'title': 'Analyzing bert scores and llm self-similarity analysis', 'summary': 'Explores bert score computation, evaluation, and response comparison for semantic matching, focusing on identifying hallucinations and model differences. it also discusses llm self-similarity analysis, challenges, and potential approaches, and covers detecting data leakage scenarios and using regular expressions to identify sensitive data.', 'chapters': [{'end': 1750.114, 'start': 1560.074, 'title': 'Analyzing bert scores for semantic matching', 'summary': 'Explores the computation of bert scores for semantic matching, emphasizing the differences from blue scores and highlighting the challenges and limitations of using bert scores to identify hallucinations.', 'duration': 190.04, 'highlights': ['The BERT score distribution forms a bell curve, indicating a different pattern from the blue score distribution, with the highest frequency values in the middle.', 'Low BERT scores can indicate potential hallucinations, particularly when the prompt and response significantly differ in length and topic, as illustrated by specific examples.', 'The implementation of the BERT score function differs significantly from the blue score function, taking in lists of predictions and references in a different manner.', 'The chapter introduces a new metric for BERT scores, involving the computation of precision, recall value, and F1 score, which is a weighted average of precision and recall.']}, {'end': 2376.532, 'start': 1750.834, 'title': 'Bert score evaluation and response comparison', 'summary': 'Covers evaluating bert score metrics, thresholding scores, comparing responses, and using self-similarity metrics to evaluate responses, with a focus on identifying hallucinations and model differences.', 'duration': 625.698, 'highlights': ['The chapter covers evaluating BERT score metrics and using a threshold of 0.75 to identify hallucinations based on low scores. Using a threshold of 0.75 to identify hallucinations based on low BERT scores.', 'The chapter discusses comparing multiple responses using metrics like blue score and BERT score, and introduces the concept of self-similarity across responses. Introducing the concept of self-similarity across responses and comparing multiple responses using metrics like blue score and BERT score.', 'The chapter explains the process of calculating sentence embeddings and using cosine similarity to measure self-similarity across multiple responses. The process of calculating sentence embeddings and using cosine similarity to measure self-similarity across multiple responses.', 'The chapter emphasizes the use of self-similarity metrics to evaluate responses and identifies model differences and hallucinations. Emphasizing the use of self-similarity metrics to evaluate responses, identify model differences, and detect hallucinations.', 'The chapter introduces the use of OpenAI for evaluating the consistency and entailment of responses, using prompts to compare responses. Introducing the use of OpenAI for evaluating the consistency and entailment of responses, using prompts to compare responses.']}, {'end': 2629.191, 'start': 2376.892, 'title': 'Llm self-similarity analysis', 'summary': 'Discusses the implementation of llm self-similarity analysis, including the challenges of obtaining calibrated responses and potential approaches for refining the analysis, while showcasing examples of self-similarity scores in specific use cases.', 'duration': 252.299, 'highlights': ['The implementation of LLM self-similarity analysis and the challenges of obtaining calibrated responses The chapter discusses the challenges of obtaining calibrated responses when using LLM for self-similarity analysis and explores the difficulties in interpreting numerical outputs between 0 and 1.', 'Potential approaches for refining the analysis and obtaining categorical information The chapter suggests potential approaches for refining the analysis, such as asking for categorical information instead of numerical values, and exploring the use of specific sentences in the response for comparison.', 'Examples of self-similarity scores in specific use cases The chapter showcases examples of self-similarity scores in specific use cases, including filtering responses with self-similarity scores less than 0.8 and examples of hallucination in response to a prompt for code translation.']}, {'end': 2963.116, 'start': 2629.511, 'title': 'Detecting data leakage and toxicity', 'summary': 'Covers the identification of data leakage scenarios in llms, such as pii and confidential information appearing in prompts or model responses, and demonstrates the use of regular expressions to identify sensitive data like email addresses, social security numbers, and credit card numbers, with a focus on practical examples and tools.', 'duration': 333.605, 'highlights': ['The chapter covers the identification of data leakage scenarios in LLMs, such as PII and confidential information appearing in prompts or model responses. It discusses three data leakage scenarios: user sharing PII or confidential information in prompts, model returning PII or confidential information in responses, and leakage of test data into the training dataset, highlighting the safety implications.', 'Demonstrates the use of regular expressions to identify sensitive data like email addresses, social security numbers, and credit card numbers. The use of regular expressions to detect patterns of sensitive data in the text is showcased, with specific examples of identifying email addresses, phone numbers, mailing addresses, social security numbers, and credit card numbers in prompts and responses.', 'Focuses on practical examples and tools for detecting data leakage. Emphasizes the practicality of using simple tools like regular expressions to detect data leakage, with a demonstration of identifying sensitive information in example data, and mentions the customization of patterns using JSON files in a later lesson.']}], 'duration': 1403.042, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns1560074.jpg', 'highlights': ['The BERT score distribution forms a bell curve, indicating a different pattern from the blue score distribution, with the highest frequency values in the middle.', 'Low BERT scores can indicate potential hallucinations, particularly when the prompt and response significantly differ in length and topic, as illustrated by specific examples.', 'The chapter introduces a new metric for BERT scores, involving the computation of precision, recall value, and F1 score, which is a weighted average of precision and recall.', 'The chapter covers evaluating BERT score metrics and using a threshold of 0.75 to identify hallucinations based on low scores.', 'The chapter discusses comparing multiple responses using metrics like blue score and BERT score, and introduces the concept of self-similarity across responses.', 'The chapter explains the process of calculating sentence embeddings and using cosine similarity to measure self-similarity across multiple responses.', 'The chapter emphasizes the use of self-similarity metrics to evaluate responses and identifies model differences and hallucinations.', 'The chapter introduces the use of OpenAI for evaluating the consistency and entailment of responses, using prompts to compare responses.', 'The implementation of LLM self-similarity analysis and the challenges of obtaining calibrated responses when using LLM for self-similarity analysis and explores the difficulties in interpreting numerical outputs between 0 and 1.', 'Potential approaches for refining the analysis and obtaining categorical information The chapter suggests potential approaches for refining the analysis, such as asking for categorical information instead of numerical values, and exploring the use of specific sentences in the response for comparison.', 'Examples of self-similarity scores in specific use cases The chapter showcases examples of self-similarity scores in specific use cases, including filtering responses with self-similarity scores less than 0.8 and examples of hallucination in response to a prompt for code translation.', 'The chapter covers the identification of data leakage scenarios in LLMs, such as PII and confidential information appearing in prompts or model responses.', 'Demonstrates the use of regular expressions to identify sensitive data like email addresses, social security numbers, and credit card numbers.', 'Focuses on practical examples and tools for detecting data leakage.']}, {'end': 3699.912, 'segs': [{'end': 2993.918, 'src': 'embed', 'start': 2967.731, 'weight': 0, 'content': [{'end': 2974.713, 'text': 'Okay, so now we see our prompt and response as we had before, but now our prompt has patterns and our response has patterns.', 'start': 2967.731, 'duration': 6.982}, {'end': 2981.795, 'text': "And you'll see, while there are many nones, there's also phone number and different types where we do find a pattern.", 'start': 2975.333, 'duration': 6.462}, {'end': 2987.596, 'text': 'So now we need to filter this data.', 'start': 2985.696, 'duration': 1.9}, {'end': 2991.477, 'text': "Let's go ahead and define some filter using just the nulls.", 'start': 2987.876, 'duration': 3.601}, {'end': 2993.918, 'text': 'So I will copy this over here.', 'start': 2992.137, 'duration': 1.781}], 'summary': 'Data has patterns, including phone numbers. need to filter using nulls.', 'duration': 26.187, 'max_score': 2967.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2967731.jpg'}, {'end': 3064.883, 'src': 'embed', 'start': 3032.637, 'weight': 1, 'content': [{'end': 3035.435, 'text': "And we're going to set our scope to leakage.", 'start': 3032.637, 'duration': 2.798}, {'end': 3052.797, 'text': 'OK, so what do we see? We see that just this simple rule using the patterns that comes and link it will pass all of our easier data leakage examples.', 'start': 3043.072, 'duration': 9.725}, {'end': 3060.24, 'text': 'But I put in some very difficult examples for this problem so that we can learn to make more complex metrics.', 'start': 3053.377, 'duration': 6.863}, {'end': 3064.883, 'text': 'Another thing you might notice is that we have several false positives.', 'start': 3060.861, 'duration': 4.022}], 'summary': 'Scope set to leakage, simple rule passes easier leakage examples, with some false positives.', 'duration': 32.246, 'max_score': 3032.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3032637.jpg'}, {'end': 3135.803, 'src': 'embed', 'start': 3107.344, 'weight': 5, 'content': [{'end': 3110.465, 'text': 'especially when working within the context of a company.', 'start': 3107.344, 'duration': 3.121}, {'end': 3114.187, 'text': "So here's an example on screen of the entity recognition task.", 'start': 3110.845, 'duration': 3.342}, {'end': 3129.298, 'text': 'we have a sentence or multiple sentences where we want to go and label individual tokens or words or spans of multiple words that represent particular nouns or particular entities.', 'start': 3114.787, 'duration': 14.511}, {'end': 3135.803, 'text': 'Seattle is a place, Bill Gates is a person, October 28th, 1955 is a date.', 'start': 3129.478, 'duration': 6.325}], 'summary': 'Demonstrating entity recognition task for labeling tokens, words, and spans representing specific entities.', 'duration': 28.459, 'max_score': 3107.344, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3107344.jpg'}], 'start': 2967.731, 'title': 'Data filtering, evaluation, leakage, and toxicity metrics', 'summary': 'Covers the process of filtering data to identify and evaluate data leakage issues using specific patterns, null values, entity recognition, and pattern matching. it also explores creating complex toxicity metrics using existing models and datasets.', 'chapters': [{'end': 3025.938, 'start': 2967.731, 'title': 'Data filtering and evaluation', 'summary': 'Discusses the process of filtering data based on specific patterns in prompts and responses, aiming to identify and evaluate data leakage issues, involving the use of null values and patterns.', 'duration': 58.207, 'highlights': ['The process involves filtering for annotated chats where prompt has patterns is not null and annotated chats where response has patterns is not null, to identify data leakage issues.', 'Patterns and null values are used to filter the data in order to identify the presence of data leakage issues.', 'Evaluation is performed using an evaluation helper function to assess the identified examples with data leakage issues.']}, {'end': 3699.912, 'start': 3032.637, 'title': 'Data leakage and toxicity metrics', 'summary': 'Discusses creating complex metrics for data leakage by using entity recognition and pattern matching, and then delves into handling toxicity by using existing models and datasets to create metrics.', 'duration': 667.275, 'highlights': ['The chapter discusses creating complex metrics for data leakage by using entity recognition and pattern matching The chapter emphasizes the use of entity recognition and pattern matching to create complex metrics for data leakage.', 'The existing model for entity recognition identifies entities in the data and creates a metric from that An existing model is used to identify entities in the data, such as person, product, and organization, and creates a metric for data leakage.', 'Using the Toxigen dataset and models built on top of it to create metrics for handling toxicity The Toxigen dataset and models built on it are used to create metrics for handling toxicity, including both explicit and implicit toxicity.']}], 'duration': 732.181, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns2967731.jpg', 'highlights': ['The process involves filtering for annotated chats to identify data leakage issues.', 'Patterns and null values are used to filter the data to identify data leakage issues.', 'Evaluation is performed using an evaluation helper function to assess the identified examples with data leakage issues.', 'The chapter emphasizes the use of entity recognition and pattern matching to create complex metrics for data leakage.', 'An existing model is used to identify entities in the data and creates a metric for data leakage.', 'The Toxigen dataset and models built on it are used to create metrics for handling toxicity.']}, {'end': 4629.703, 'segs': [{'end': 3728.235, 'src': 'embed', 'start': 3700.132, 'weight': 3, 'content': [{'end': 3703.974, 'text': "So this is saying that they're both not toxic with pretty high scores.", 'start': 3700.132, 'duration': 3.842}, {'end': 3711.722, 'text': "So the second sentence here will sometimes trigger toxicity models that aren't about implicit toxicity,", 'start': 3704.957, 'duration': 6.765}, {'end': 3717.407, 'text': 'just by the inclusion of a keyword like women races this sort of thing.', 'start': 3711.722, 'duration': 5.685}, {'end': 3720.689, 'text': "Okay, let's go ahead and make a quick metric for that.", 'start': 3717.427, 'duration': 3.262}, {'end': 3726.274, 'text': 'Feel free to copy this metric and use it in your applications or change it as you wish.', 'start': 3721.87, 'duration': 4.404}, {'end': 3728.235, 'text': 'Just a quick explanation here.', 'start': 3726.914, 'duration': 1.321}], 'summary': "Non-toxic with high scores, trigger toxicity models by keywords like 'women' or 'races'.", 'duration': 28.103, 'max_score': 3700.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3700132.jpg'}, {'end': 3905.848, 'src': 'embed', 'start': 3877.754, 'weight': 2, 'content': [{'end': 3883.178, 'text': "For our purposes as application developers, we'll use this term for third-party rejections.", 'start': 3877.754, 'duration': 5.424}, {'end': 3893.866, 'text': "So that means when the LLM refuses us, that does not apply to the overall application built around the LLM, which we'll talk about in the next lesson.", 'start': 3883.278, 'duration': 10.588}, {'end': 3896.108, 'text': "Let's look at an example refusal.", 'start': 3894.387, 'duration': 1.721}, {'end': 3905.848, 'text': 'Okay, so we see a prompt that says, tell me how to hotwire a car.', 'start': 3900.703, 'duration': 5.145}], 'summary': "As application developers, we'll use third-party rejections, such as llm refusing us, not affecting the overall application. example: prompt to hotwire a car.", 'duration': 28.094, 'max_score': 3877.754, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3877754.jpg'}, {'end': 3971.998, 'src': 'embed', 'start': 3946.832, 'weight': 0, 'content': [{'end': 3961.015, 'text': "Knowing how often your LLM fails to respond to your message is really helpful in understanding your application's use and for redirecting the responses from the LLM to give a more custom experience,", 'start': 3946.832, 'duration': 14.183}, {'end': 3963.656, 'text': 'perhaps a more positive experience for your users.', 'start': 3961.015, 'duration': 2.641}, {'end': 3967.077, 'text': "So to create our metric, we're going to do the same thing that we've done before.", 'start': 3964.156, 'duration': 2.921}, {'end': 3971.998, 'text': "We're going to import the register dataset UDF, so from ylogs.", 'start': 3967.377, 'duration': 4.621}], 'summary': 'Understanding llm response rate is crucial for improving user experience.', 'duration': 25.166, 'max_score': 3946.832, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3946832.jpg'}, {'end': 4116.732, 'src': 'embed', 'start': 4076.786, 'weight': 1, 'content': [{'end': 4081.387, 'text': 'And maybe ahead of time, before we even do this, we might have some thoughts about how well this might work.', 'start': 4076.786, 'duration': 4.601}, {'end': 4090.75, 'text': "Will this capture many false positives? So cases where the response says, sorry, or I can't, but it's actually not a refusal.", 'start': 4082.468, 'duration': 8.282}, {'end': 4094.231, 'text': "Perhaps maybe I've asked for a script or dialogue.", 'start': 4091.51, 'duration': 2.721}, {'end': 4097.051, 'text': "Or cases where there's false negatives.", 'start': 4095.231, 'duration': 1.82}, {'end': 4101.933, 'text': "So where there are refusals, but they don't use the word sorry or I can't.", 'start': 4097.551, 'duration': 4.382}, {'end': 4116.732, 'text': "So now to look at our annotated data, so the data using these metrics to see all the values on our individual data points, we'll import UDF schema.", 'start': 4105.7, 'duration': 11.032}], 'summary': 'Exploring potential issues in capturing refusals and false positives and negatives in annotated data.', 'duration': 39.946, 'max_score': 4076.786, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4076786.jpg'}, {'end': 4185.292, 'src': 'embed', 'start': 4146.734, 'weight': 5, 'content': [{'end': 4148.015, 'text': "That's why we use an underscore.", 'start': 4146.734, 'duration': 1.281}, {'end': 4151.377, 'text': "And then we'll say UDF schema.", 'start': 4149.176, 'duration': 2.201}, {'end': 4162.461, 'text': 'Okay, so now we have our results.', 'start': 4160.16, 'duration': 2.301}, {'end': 4163.982, 'text': "Let's look at annotated chats.", 'start': 4162.541, 'duration': 1.441}, {'end': 4165.182, 'text': "I'm going to scroll here.", 'start': 4164.142, 'duration': 1.04}, {'end': 4176.728, 'text': 'Okay, so we see our prompt, our response, our response refusal match, which we just created.', 'start': 4165.202, 'duration': 11.526}, {'end': 4185.292, 'text': "So we have trues when we do see I'm sorry, and falses where we didn't see I'm sorry or I can't.", 'start': 4176.747, 'duration': 8.545}], 'summary': 'Using udf schema to analyze annotated chats for prompt and response refusal match.', 'duration': 38.558, 'max_score': 4146.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4146734.jpg'}], 'start': 3700.132, 'title': 'Toxicity metrics and refusal analysis', 'summary': 'Covers the creation of a metric for prompt.implicittoxicity, addressing challenges in identifying subtle toxicity and potential false positives. it also delves into techniques for detecting refusals in llm responses and the use of refusal analysis to filter annotated chats and create a secondary sentiment metric, along with the concept of prompt injections and their potential impacts on the llm system.', 'chapters': [{'end': 3770.039, 'start': 3700.132, 'title': 'Implicit toxicity metric creation', 'summary': 'Discusses the creation of a metric for prompt.implicittoxicity, highlighting the challenge of identifying subtle toxicity and the potential for false positives.', 'duration': 69.907, 'highlights': ['A metric for prompt.implicitToxicity is created by taking the last value of the label, casting it to an integer, and using it as a result.', 'The challenge of using subtle metrics is highlighted due to the concern of possibly having many false positives.', "The inclusion of keywords like 'women races' can sometimes trigger toxicity models that aren't about implicit toxicity."]}, {'end': 4248.633, 'start': 3770.58, 'title': 'Data leakage and toxicity in llm', 'summary': 'Discusses techniques for detecting refusals in llm responses using string matching and the importance of detecting refusals in understanding application use and providing a more custom experience for users.', 'duration': 478.053, 'highlights': ['Techniques for detecting refusals using string matching The chapter explores the use of string matching as a metric for detecting refusals in LLM responses, emphasizing the structured nature of responses and the potential for detecting refusals using just string matching.', 'Importance of detecting refusals in understanding application use Understanding how often the LLM fails to respond to messages is highlighted as a crucial factor in understanding application use and redirecting responses to provide a more custom and positive experience for users.', 'Refusal detection metric using UDF schema The chapter discusses the implementation of a refusal detection metric using UDF schema, showcasing the process of applying the metric to annotated data and addressing the potential for false positives and false negatives.']}, {'end': 4629.703, 'start': 4253.146, 'title': 'Refusal analysis and prompt injections', 'summary': 'Discusses using refusal analysis to filter annotated chats and creating a secondary sentiment metric to identify refusals, as well as the concept of prompt injections and their potential impacts on the llm system.', 'duration': 376.557, 'highlights': ['Creating a secondary sentiment metric for identifying refusals based on the sentiment range between 0 and -0.4. The sentiment for refusals is often in the very slight negative sentiment reading, between 0 and -0.4, leading to the creation of a new secondary metric for identifying refusals.', 'Explanation of prompt injections as a malicious attempt to manipulate the LLM system and the potential consequences of altering training data. Prompt injections are described as malicious attempts to manipulate the LLM system, including scenarios where false or harmful data is integrated into the model weights through scraping public websites.', 'Utilizing refusal analysis to filter annotated chats using a combination of metrics and tracking. The chapter emphasizes the importance of using multiple metrics for refusal analysis and tracking, highlighting the use of sentiment analysis in combination with other techniques to filter annotated chats.']}], 'duration': 929.571, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns3700132.jpg', 'highlights': ['Creation of a metric for prompt.implicitToxicity by casting the last label value to an integer.', 'Challenges in using subtle metrics due to concerns about false positives.', "Inclusion of keywords like 'women races' can trigger toxicity models not related to implicit toxicity.", 'Exploration of techniques for detecting refusals using string matching as a metric.', 'Importance of understanding application use by detecting refusals in LLM responses.', 'Implementation of a refusal detection metric using UDF schema to address false positives and negatives.', 'Creation of a secondary sentiment metric for identifying refusals based on sentiment range.', 'Explanation of prompt injections as malicious attempts to manipulate the LLM system.', 'Utilization of refusal analysis to filter annotated chats using a combination of metrics and tracking.']}, {'end': 5837.35, 'segs': [{'end': 4655.181, 'src': 'embed', 'start': 4630.723, 'weight': 4, 'content': [{'end': 4640.309, 'text': "So the user experiences this by calling into the LLM as normal, either through an application that we're creating around it or the LLM directly.", 'start': 4630.723, 'duration': 9.586}, {'end': 4648.755, 'text': 'And because it has been affected by this poor data, may get responses that are incorrect or actively harmful.', 'start': 4641.17, 'duration': 7.585}, {'end': 4655.181, 'text': "We're going to focus on a specific type of prompt injection, which is actually much simpler and related to refusals.", 'start': 4649.275, 'duration': 5.906}], 'summary': 'Llm users may receive incorrect or harmful responses due to poor data, focusing on prompt injection.', 'duration': 24.458, 'max_score': 4630.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4630723.jpg'}, {'end': 4702.29, 'src': 'embed', 'start': 4679.147, 'weight': 1, 'content': [{'end': 4686.594, 'text': "And our LLM notices that this is not something it wants to answer and responds, I'm sorry, I can't assist or provide information on this.", 'start': 4679.147, 'duration': 7.447}, {'end': 4691.719, 'text': 'But there are many clever ways for people to get around this response.', 'start': 4687.254, 'duration': 4.465}, {'end': 4697.308, 'text': "So for example, a popular one is saying, hey, here's a hypothetical situation.", 'start': 4692.467, 'duration': 4.841}, {'end': 4702.29, 'text': "Let's say you're describing a character who's planning to hotwire a car.", 'start': 4697.868, 'duration': 4.422}], 'summary': 'Llm refuses to answer, but people find clever ways to ask hypothetical questions.', 'duration': 23.143, 'max_score': 4679.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4679147.jpg'}, {'end': 4807.556, 'src': 'embed', 'start': 4770, 'weight': 3, 'content': [{'end': 4775.001, 'text': 'but often applies more broadly is just the length and complexity of the prompt.', 'start': 4770, 'duration': 5.001}, {'end': 4780.923, 'text': "So let's start off with a very, very, very simple metric, just comparing the length of the prompt.", 'start': 4775.201, 'duration': 5.722}, {'end': 4788.065, 'text': "So we'll use our same register dataset UDF.", 'start': 4785.725, 'duration': 2.34}, {'end': 4793.127, 'text': "We'll make sure that we're capturing the prompt.", 'start': 4791.046, 'duration': 2.081}, {'end': 4798.249, 'text': "And then we'll call it prompt.textLength.", 'start': 4795.287, 'duration': 2.962}, {'end': 4807.556, 'text': "And then we'll return our text prompt.", 'start': 4803.493, 'duration': 4.063}], 'summary': 'Analyzing prompt length and complexity to measure dataset usability.', 'duration': 37.556, 'max_score': 4770, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4770000.jpg'}, {'end': 5406.27, 'src': 'embed', 'start': 5380.259, 'weight': 0, 'content': [{'end': 5391.191, 'text': "where we learn how to use LaneKit and our custom metrics that we've created across all of the previous lessons on more realistic data sets for both active and passive monitoring settings.", 'start': 5380.259, 'duration': 10.932}, {'end': 5392.573, 'text': "Let's take a look.", 'start': 5391.992, 'duration': 0.581}, {'end': 5395.963, 'text': 'To ensure safety and quality.', 'start': 5393.922, 'duration': 2.041}, {'end': 5406.27, 'text': 'you can use the metrics in this course on collected data from your LLM application what we call passive monitoring, or apply them in real time,', 'start': 5395.963, 'duration': 10.307}], 'summary': 'Learn to use lanekit and custom metrics on realistic data for active and passive monitoring.', 'duration': 26.011, 'max_score': 5380.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns5380259.jpg'}, {'end': 5734.515, 'src': 'embed', 'start': 5712.359, 'weight': 2, 'content': [{'end': 5723.706, 'text': 'This way of looking at the data after the process has happened for our application and then analyzing to find potential issues or understand the usage is called passive monitoring.', 'start': 5712.359, 'duration': 11.347}, {'end': 5734.515, 'text': "So we might do things like look and see that there's an increase in refusals and toxicity on this particular date, as well as other things,", 'start': 5724.427, 'duration': 10.088}], 'summary': 'Passive monitoring involves analyzing post-process data for identifying issues, such as an increase in refusals and toxicity.', 'duration': 22.156, 'max_score': 5712.359, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns5712359.jpg'}], 'start': 4630.723, 'title': 'Data impact on user experience', 'summary': 'Delves into the impact of poor data on user experience, citing examples of prompt injection, jailbreak attempts, and refusal prompts. it discusses detecting prompt injection attempts using heuristic approach and lanekit metrics, emphasizing the importance of active and passive monitoring settings in addressing these issues.', 'chapters': [{'end': 4678.746, 'start': 4630.723, 'title': 'Llm prompt injection and jailbreaks', 'summary': 'Highlights the impact of poor data on user experience, particularly in the context of prompt injection and jailbreaks, with a specific example of a refusal prompt related to hotwiring a car.', 'duration': 48.023, 'highlights': ['The user experiences poor data impact when using the LLM, leading to incorrect or harmful responses.', 'Specific focus on prompt injection, particularly related to refusals and jailbreaks.', 'An example of a refusal prompt related to hotwiring a car is provided in the code.']}, {'end': 5178.821, 'start': 4679.147, 'title': 'Detecting prompt injection attempts', 'summary': 'Discusses the use of prompt injections to bypass ai language model restrictions, and proposes a heuristic based on prompt length and complexity, as well as utilizing lenkit to define and compare phrases for detecting prompt injection attempts.', 'duration': 499.674, 'highlights': ['Proposing a heuristic based on prompt length and complexity for detecting jailbreak attempts, with a simple metric comparing the length of the prompt and using a threshold of 200 or 300 characters as a bar for determining potential jailbreak attempts. Threshold of 200 or 300 characters', 'Utilizing LENKIT to define and compare phrases, importing LENKIT themes and JSON to specify the values for comparison, aiming to identify prompt injections and visualize the results through similarity values between 0 and 1. Visualizing similarity values between 0 and 1', 'Suggesting jailbreakchat.com as a source to collect prompt injection ideas, indicating how the community-collected set of jailbreak attempts can be used to add more examples for detecting and preventing prompt injections. Recommendation of jailbreakchat.com as a source for prompt injection ideas']}, {'end': 5565.582, 'start': 5179.061, 'title': 'Lanekit metrics and prompt injections', 'summary': 'Discusses the use of lanekit metrics for prompt injections, jailbreak attempts, and refusals, emphasizing the importance of using these metrics for both active and passive monitoring settings, and provides guidance on setting up and initializing metrics, importing datasets, and using udf schema for logging in realistic settings.', 'duration': 386.521, 'highlights': ['The chapter discusses the use of LaneKit metrics for prompt injections, jailbreak attempts, and refusals This includes the importance of using these metrics for both active and passive monitoring settings.', 'Provides guidance on setting up and initializing metrics, importing datasets, and using UDF schema for logging in realistic settings The guidance includes installing default metrics from the LaneKit library, using the init function to initialize metrics, importing datasets, copying metrics from earlier lessons, and using UDF schema for logging in production settings.', 'Emphasizes the importance of using these metrics for both active and passive monitoring settings ']}, {'end': 5837.35, 'start': 5566.162, 'title': 'Rolling logger and passive vs active monitoring', 'summary': 'Discusses the concept of a rolling logger compressing data over time and introduces passive monitoring, which involves analyzing data after interactions, and active monitoring, which occurs during the process of the application.', 'duration': 271.188, 'highlights': ['The rolling logger compresses all data seen in every hour into a single profile, set with an interval of one for every hour. The rolling logger compresses data seen in every hour into a single profile with an interval of one for every hour.', 'Passive monitoring involves analyzing combined data after interactions, while active monitoring occurs during the process of the application. Passive monitoring involves analyzing combined data after interactions, while active monitoring occurs during the process of the application.', 'In passive monitoring, the data is profiled over time and potential issues are identified by analyzing the usage patterns. In passive monitoring, data is profiled over time, and potential issues are identified by analyzing the usage patterns.', 'Active monitoring involves filtering and auditing messages in real-time, logging information during the process, and making decisions about user interaction. Active monitoring involves filtering and auditing messages in real-time, logging information during the process, and making decisions about user interaction.']}], 'duration': 1206.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns4630723.jpg', 'highlights': ['The chapter emphasizes the importance of using LaneKit metrics for both active and passive monitoring settings.', 'The rolling logger compresses all data seen in every hour into a single profile, set with an interval of one for every hour.', 'Utilizing LENKIT to define and compare phrases, importing LENKIT themes and JSON to specify the values for comparison, aiming to identify prompt injections and visualize the results through similarity values between 0 and 1.', 'Passive monitoring involves analyzing combined data after interactions, while active monitoring occurs during the process of the application.', 'The user experiences poor data impact when using the LLM, leading to incorrect or harmful responses.']}, {'end': 7172.839, 'segs': [{'end': 6297.86, 'src': 'embed', 'start': 6270.987, 'weight': 6, 'content': [{'end': 6280.512, 'text': 'Okay. So then what this is going to do is we will continue to loop through until we get either a keyboard interrupt or this LLM application validation error.', 'start': 6270.987, 'duration': 9.525}, {'end': 6283.893, 'text': 'You might be tempted to capture all exceptions.', 'start': 6281.192, 'duration': 2.701}, {'end': 6285.394, 'text': 'Oh, I apologize.', 'start': 6284.594, 'duration': 0.8}, {'end': 6286.695, 'text': "One thing we're missing here.", 'start': 6285.854, 'duration': 0.841}, {'end': 6297.86, 'text': 'So in that case, we want to use our user reply failure and pass in the request that we have if we have one.', 'start': 6287.495, 'duration': 10.365}], 'summary': 'Loop through until keyboard interrupt or llm application validation error, use user reply failure and pass in the request.', 'duration': 26.873, 'max_score': 6270.987, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6270987.jpg'}, {'end': 6356.973, 'src': 'embed', 'start': 6326.724, 'weight': 3, 'content': [{'end': 6329.186, 'text': "Uh, let's ask for a recipe, something like spaghetti.", 'start': 6326.724, 'duration': 2.462}, {'end': 6335.7, 'text': 'Okay, so it looks like we had some success.', 'start': 6332.918, 'duration': 2.782}, {'end': 6340.384, 'text': "Here's a recipe for spaghetti and we pass in six instructions.", 'start': 6336.261, 'duration': 4.123}, {'end': 6342.005, 'text': "Great Let's go ahead and quit.", 'start': 6340.584, 'duration': 1.421}, {'end': 6345.287, 'text': 'So this is really exciting and helpful.', 'start': 6343.206, 'duration': 2.081}, {'end': 6349.851, 'text': 'But the question is is when might we have other issues??', 'start': 6346.168, 'duration': 3.683}, {'end': 6355.315, 'text': "When might we want to break our process as a result of some of the metrics that we've created?", 'start': 6350.011, 'duration': 5.304}, {'end': 6356.973, 'text': 'Okay,', 'start': 6356.713, 'duration': 0.26}], 'summary': 'Successfully obtained a spaghetti recipe with six instructions, considering potential issues and metrics for process improvement.', 'duration': 30.249, 'max_score': 6326.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6326724.jpg'}, {'end': 6428.612, 'src': 'embed', 'start': 6404.767, 'weight': 4, 'content': [{'end': 6414.15, 'text': 'So, in a realistic setting, something we may wanna do is well, change our functionality of our prompt system, as we wanna do here,', 'start': 6404.767, 'duration': 9.383}, {'end': 6421.051, 'text': "but we may also want to send an alert to the data scientists to note that we've had this really bad issue.", 'start': 6414.15, 'duration': 6.901}, {'end': 6428.612, 'text': "Or we may wanna email the user and say, hey, sorry, you've used this application incorrectly or in a way we didn't expect.", 'start': 6421.651, 'duration': 6.961}], 'summary': 'Consider modifying prompt system and sending alerts or emails in response to issues.', 'duration': 23.845, 'max_score': 6404.767, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6404767.jpg'}, {'end': 6685.028, 'src': 'embed', 'start': 6656.601, 'weight': 1, 'content': [{'end': 6658.223, 'text': "And let's go ahead and call this refusal.", 'start': 6656.601, 'duration': 1.622}, {'end': 6662.254, 'text': "validator, we'll rename this.", 'start': 6659.793, 'duration': 2.461}, {'end': 6668.958, 'text': "And in our case, we're actually okay with the conditions being exactly the same.", 'start': 6665.116, 'duration': 3.842}, {'end': 6677.203, 'text': "So we're going to use two metrics one metric which gives a toxicity score and a score greater than 0.3,.", 'start': 6669.499, 'duration': 7.704}, {'end': 6685.028, 'text': "we might consider to be toxic, maybe 0.5 or 0.6, but in our application we'll be squeaky, clean and we'll look for a 0.3..", 'start': 6677.203, 'duration': 7.825}], 'summary': 'Using two metrics, a toxicity score > 0.3 will be considered toxic in our application.', 'duration': 28.427, 'max_score': 6656.601, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6656601.jpg'}, {'end': 6748.652, 'src': 'embed', 'start': 6713.086, 'weight': 2, 'content': [{'end': 6721.213, 'text': "Okay, so now that we've defined our two validators, we need to go ahead and pass a dictionary of the two in.", 'start': 6713.086, 'duration': 8.127}, {'end': 6726.117, 'text': 'We want to determine which metrics that these validators apply to.', 'start': 6721.654, 'duration': 4.463}, {'end': 6729.1, 'text': "So I'm gonna go ahead and call this LLM validators.", 'start': 6726.558, 'duration': 2.542}, {'end': 6739.149, 'text': "And the first we're gonna apply to prompt.toxicity, spelled correctly.", 'start': 6732.203, 'duration': 6.946}, {'end': 6748.652, 'text': "And the only validator that we'll have for prompt toxicity is the one here, toxicityValidator.", 'start': 6742.867, 'duration': 5.785}], 'summary': 'Defining two validators for prompt toxicity in llm', 'duration': 35.566, 'max_score': 6713.086, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6713086.jpg'}, {'end': 7019.241, 'src': 'embed', 'start': 6993.105, 'weight': 0, 'content': [{'end': 7000.63, 'text': 'So, without any additional if statements or things like this, we can just, using Ylogs,', 'start': 6993.105, 'duration': 7.525}, {'end': 7006.955, 'text': 'capture any of these issues that come up with the metrics that we log with Ylogs and take actions.', 'start': 7000.63, 'duration': 6.325}, {'end': 7009.475, 'text': 'Okay, so finally,', 'start': 7008.314, 'duration': 1.161}, {'end': 7019.241, 'text': "what I'll do is I'll copy the same code that we had earlier into a new cell so that we can run and play with our new application with the validation.", 'start': 7009.475, 'duration': 9.766}], 'summary': 'Ylogs captures issues with logged metrics, enabling action. code will be copied for validation.', 'duration': 26.136, 'max_score': 6993.105, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns6993105.jpg'}], 'start': 5837.931, 'title': 'Using openai and llm for application development', 'summary': 'Covers setting up openai, creating a basic logger, building a recipe application with language model, implementing validators, conditions, and exception handling, emphasizing on capturing and handling issues like toxicity and refusals.', 'chapters': [{'end': 5908.013, 'start': 5837.931, 'title': 'Setting up openai and creating a simple logger', 'summary': 'Demonstrates the process of setting up openai in a notebook and creating a simple logger for an application, with a focus on importing openai, obtaining an openai key, and setting up a basic logger.', 'duration': 70.082, 'highlights': ['Importing OpenAI and obtaining an OpenAI key', 'Creating a simple logger for the application', 'Using helper functions to obtain OpenAI key', 'Setting up a semi-realistic example inside the notebook']}, {'end': 6326.384, 'start': 5908.553, 'title': 'Recipe application with llm', 'summary': 'Details the creation of a recipe application using a language model (llm) to process user requests, prompt the llm for a response, handle success and failure scenarios, and implement logic using exceptions.', 'duration': 417.831, 'highlights': ['The chapter details the creation of a recipe application using a Language Model (LLM) to process user requests This is the primary goal of the application.', 'The application prompts the LLM for a response with a transformed version of the user request Describes the process of prompting the LLM for a response using user requests.', 'The chapter explains the handling of success and failure scenarios in the application, with custom messages for each Details the handling of success and failure scenarios, including custom messages for each scenario.', 'The logic for the application is implemented using exceptions, and a custom exception class is created Discusses the implementation of application logic using exceptions and the creation of a custom exception class.']}, {'end': 6476.096, 'start': 6326.724, 'title': 'Creating validators and conditions', 'summary': 'Discusses replicating thresholds, creating validators and conditions using y logs, and defining actions based on conditions, with a focus on raising exceptions. it highlights the potential actions to be taken based on the conditions and the process of creating a new function for raising exceptions.', 'duration': 149.372, 'highlights': ['The chapter discusses the process of creating a new function for raising exceptions, with a focus on using Y logs and replicating thresholds.', 'It highlights the potential actions to be taken based on the conditions, such as changing functionality, sending alerts, emailing users, logging additional information, or sending data out for human judgment.', 'The chapter introduces the concept of validators for specific conditions and actions to be taken if the conditions are not met, such as raising exceptions or sending alerts.']}, {'end': 7172.839, 'start': 6476.236, 'title': 'Creating validators for llm applications', 'summary': 'Discusses the process of creating validators for llm applications, including defining conditions, actions, and applying them to metrics, resulting in the ability to capture and handle issues such as toxicity and refusals, as demonstrated through examples.', 'duration': 696.603, 'highlights': ['The process of creating validators for LLM applications involves defining conditions, actions, and applying them to metrics to capture and handle issues such as toxicity and refusals. The chapter describes the steps involved in creating validators for LLM applications, which includes defining conditions and actions, and applying them to specific metrics to capture and handle issues such as toxicity and refusals.', 'The example demonstrates the use of validators to capture and handle issues such as toxicity and refusals, with specific quantifiable data like refusal value being greater than 0.3. The example showcases the practical application of validators to capture and handle issues such as toxicity and refusals, with specific quantifiable data, such as the refusal value being greater than 0.3, resulting in a failure of the refusal validator.', 'The chapter emphasizes the ability to capture and handle issues without additional if statements, using Ylogs to capture any issues with the logged metrics and take actions. The chapter highlights the convenience of capturing and handling issues without additional if statements, utilizing Ylogs to capture any issues with the logged metrics and take necessary actions.']}], 'duration': 1334.908, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/Fmn_HaoQ9Ns/pics/Fmn_HaoQ9Ns5837931.jpg', 'highlights': ['The chapter emphasizes capturing and handling issues like toxicity and refusals without additional if statements, using Ylogs to capture any issues with the logged metrics and take actions.', 'The example demonstrates the use of validators to capture and handle issues such as toxicity and refusals, with specific quantifiable data like refusal value being greater than 0.3.', 'The process of creating validators for LLM applications involves defining conditions, actions, and applying them to metrics to capture and handle issues such as toxicity and refusals.', 'The logic for the application is implemented using exceptions, and a custom exception class is created.', 'The chapter introduces the concept of validators for specific conditions and actions to be taken if the conditions are not met, such as raising exceptions or sending alerts.', 'The chapter discusses the process of creating a new function for raising exceptions, with a focus on using Y logs and replicating thresholds.', 'The application prompts the LLM for a response with a transformed version of the user request, detailing the process of prompting the LLM for a response using user requests.', 'The chapter details the creation of a recipe application using a Language Model (LLM) to process user requests, which is the primary goal of the application.', 'Importing OpenAI and obtaining an OpenAI key, creating a simple logger for the application, using helper functions to obtain OpenAI key, and setting up a semi-realistic example inside the notebook.']}], 'highlights': ['Common risks associated with LLM applications include prompt injections, hallucinations, data leakage, and toxicity, and the course provides tools to mitigate these risks.', 'The process of understanding the safety of deploying an LLM app can be slow after the quick proof of concept, often taking days or weeks.', 'Bernice Renan, Senior Data Scientist at YLABS, with six years of experience, shares insights on the evaluation and metrics for AI systems, emphasizing the importance of understanding the risks associated with LLM applications.', 'Identify when responses are more likely to be hallucinations, using the self-checked GPT framework.', 'Detecting LLM Issues Introducing techniques to detect data leakage, jailbreaks, and hallucinations, using open-source Python packages, NAND codes, YLOGs, and HuggingFix tools to mitigate these issues.', 'The toxicity metric for prompts and responses reveals a long-tailed distribution with the majority of toxicity values at zero and a few instances with higher values.', 'The chapter emphasizes the importance of having standardized packages for evaluation metrics to establish a consensus on implementation details and improve the reliability of model evaluations.', 'The BERT score distribution forms a bell curve, indicating a different pattern from the blue score distribution, with the highest frequency values in the middle.', 'The process involves filtering for annotated chats to identify data leakage issues.', 'Creation of a metric for prompt.implicitToxicity by casting the last label value to an integer.', 'The chapter emphasizes the importance of using LaneKit metrics for both active and passive monitoring settings.', 'The chapter emphasizes capturing and handling issues like toxicity and refusals without additional if statements, using Ylogs to capture any issues with the logged metrics and take actions.']}