title
10. Building and Evaluating Advanced RAG| Andrew Ng | DeepLearning.ai - Full Course

description
The course comes from [https://learn.deeplearning.ai/building-evaluating-advanced-rag/lesson/1/introduction](https://learn.deeplearning.ai/building-evaluating-advanced-rag/lesson/1/introduction) created by Andrew Ng This course introduces methods for building and evaluating advanced Search Enhancement Generation (RAG) systems. RAG has become a key way to answer language model (LM) questions on the user's own data. To build and promote a high-quality RAG system, it is essential to incorporate effective retrieval techniques, utilize the highly relevant context provided by the language model (LM) for generating answers, and establish an effective evaluation framework. This framework facilitates iterative improvements during the initial development phase as well as subsequent deployment and maintenance stages of the system.The course covers two advanced search methods, sentence window search and automatic merge search, to provide context better than simple methods. Three evaluation indicators, contextual relevance, grounding and answer relevance, are also introduced to evaluate LLM question answering systems. Get free course notes: https://t.me/NoteForYoutubeCourse

detail
{'title': '10. Building and Evaluating Advanced RAG| Andrew Ng | DeepLearning.ai - Full Course', 'heatmap': [], 'summary': 'Covers the development and evaluation of advanced rag systems, including techniques like sentence window retrieval, truelens evaluation, feedback functions, app and rag performance analysis, advanced rag techniques, node retrieval, and application evaluation and improvement, with specific metrics and scores discussed.', 'chapters': [{'end': 621.389, 'segs': [{'end': 62.268, 'src': 'embed', 'start': 21.562, 'weight': 0, 'content': [{'end': 28.148, 'text': 'and also to have an effective evaluation framework to help you efficiently iterate and improve your RAG system,', 'start': 21.562, 'duration': 6.586}, {'end': 32.351, 'text': 'both during initial development and during post-deployment maintenance.', 'start': 28.148, 'duration': 4.203}, {'end': 43.322, 'text': 'This course covers two advanced retrieval methods sentence window retrieval and auto-merging retrieval that deliver a significantly better context to BLM than simpler methods.', 'start': 32.692, 'duration': 10.63}, {'end': 53.285, 'text': 'It also covers how to evaluate your LLM question answering system with three evaluation metrics context, relevance, groundedness and answer relevance.', 'start': 44.022, 'duration': 9.263}, {'end': 62.268, 'text': "I'm excited to introduce Jerry Liu, co-founder and CEO of La Meritex, and Anupam Datta, co-founder and chief scientist of Truera.", 'start': 54.325, 'duration': 7.943}], 'summary': 'Course covers advanced retrieval methods for better context delivery to blm and evaluation metrics for llm question answering system.', 'duration': 40.706, 'max_score': 21.562, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg21562.jpg'}, {'end': 121.341, 'src': 'embed', 'start': 87.675, 'weight': 2, 'content': [{'end': 88.395, 'text': "It's great to be here.", 'start': 87.675, 'duration': 0.72}, {'end': 90.196, 'text': 'Great to be with you, Andrew.', 'start': 89.156, 'duration': 1.04}, {'end': 96.675, 'text': 'Sentence window retrieval gives an LLN better context by retrieving not just the most relevant sentence,', 'start': 91.013, 'duration': 5.662}, {'end': 100.356, 'text': 'but the window of sentences that occur before and after it in the document.', 'start': 96.675, 'duration': 3.681}, {'end': 108.518, 'text': "Auto-emerging retrieval organizes the document into a tree-like structure where each parent node's text is divided among its child nodes.", 'start': 101.496, 'duration': 7.022}, {'end': 116.08, 'text': "When enough child nodes are identified as relevant to a user's question, then the entire text of the parent node is provided as context for the LLN.", 'start': 109.078, 'duration': 7.002}, {'end': 121.341, 'text': "I know this sounds like a lot of steps, but don't worry, we'll go over it in detail on code later.", 'start': 117.119, 'duration': 4.222}], 'summary': 'Sentence window retrieval enhances context for lln by retrieving relevant sentences and their context in a document.', 'duration': 33.666, 'max_score': 87.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg87675.jpg'}, {'end': 149.629, 'src': 'embed', 'start': 127.643, 'weight': 3, 'content': [{'end': 137.767, 'text': "To evaluate RAG-based LLM apps, the RAG triad, a triad of metrics for the three main steps of a RAG's execution, is quite effective.", 'start': 127.643, 'duration': 10.124}, {'end': 144.108, 'text': "For example, we'll cover in detail how to compute context relevance,", 'start': 138.666, 'duration': 5.442}, {'end': 149.629, 'text': "which measures how relevant the retrieved chunks of text are to the user's question.", 'start': 144.108, 'duration': 5.521}], 'summary': 'Rag triad evaluates llm apps, measures context relevance', 'duration': 21.986, 'max_score': 127.643, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg127643.jpg'}], 'start': 2.27, 'title': 'Rag system development and rag-based llm apps overview', 'summary': 'Discusses the high cost of building a high-quality rag system, including retrieval techniques and evaluation framework, and introduces sentence window retrieval and auto-emerging retrieval for lln. it outlines evaluation metrics for rag-based llm apps, such as context relevance, groundedness, and answer relevance.', 'chapters': [{'end': 87.595, 'start': 2.27, 'title': 'Rag system development', 'summary': 'Discusses the high cost of building and productionizing a high-quality rag system, including the need for effective retrieval techniques and evaluation framework, and covers advanced retrieval methods and evaluation metrics for llm question answering systems.', 'duration': 85.325, 'highlights': ['The high cost of building and productionizing a high-quality RAG system is emphasized, particularly the need for effective retrieval techniques and evaluation framework.', 'The course covers advanced retrieval methods sentence window retrieval and auto-merging retrieval that deliver significantly better context to BLM than simpler methods.', 'It also covers how to evaluate LLM question answering system with three evaluation metrics context, relevance, groundedness, and answer relevance.', 'Introduction of Jerry Liu, co-founder and CEO of La Meritex, and Anupam Datta, co-founder and chief scientist of Truera, who have extensive experience in RAG practices and trustworthy AI research.']}, {'end': 621.389, 'start': 87.675, 'title': 'Rag-based llm apps overview', 'summary': 'Introduces the concept of sentence window retrieval and auto-emerging retrieval, emphasizing their use in providing coherent text chunks for lln, and outlines the evaluation metrics for rag-based llm apps, including context relevance, groundedness, and answer relevance for systematically analyzing system performance and iterating for improvement.', 'duration': 533.714, 'highlights': ['The chapter introduces the concept of sentence window retrieval and auto-emerging retrieval, emphasizing their use in providing coherent text chunks for LLN This method provides a way to dynamically retrieve more coherent chunks of text than simpler methods.', "Outlines the evaluation metrics for RAG-based LLM apps, including context relevance, groundedness, and answer relevance for systematically analyzing system performance and iterating for improvement The RAG triad, a triad of metrics for the three main steps of a RAG's execution, is quite effective. It helps identify and debug possible issues with the system's retrieval context for the LLM in the QA system.", 'The ingestion phase of the RAG pipeline involves loading a set of documents, splitting them into text chunks, generating embeddings for each chunk, and offloading them to an index This process is a crucial step in setting up the basic and advanced RAG pipeline with Lama Index, and using Truera to help set up an evaluation benchmark.']}], 'duration': 619.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg2270.jpg', 'highlights': ['Introduction of Jerry Liu, co-founder and CEO of La Meritex, and Anupam Datta, co-founder and chief scientist of Truera, who have extensive experience in RAG practices and trustworthy AI research.', 'The high cost of building and productionizing a high-quality RAG system is emphasized, particularly the need for effective retrieval techniques and evaluation framework.', 'The chapter introduces the concept of sentence window retrieval and auto-emerging retrieval, emphasizing their use in providing coherent text chunks for LLN.', 'Outlines the evaluation metrics for RAG-based LLM apps, including context relevance, groundedness, and answer relevance for systematically analyzing system performance and iterating for improvement.']}, {'end': 1571.605, 'segs': [{'end': 669.893, 'src': 'embed', 'start': 621.809, 'weight': 0, 'content': [{'end': 625.152, 'text': 'Now we can initialize the TrueLens modules to begin our evaluation process.', 'start': 621.809, 'duration': 3.343}, {'end': 634.038, 'text': "We've initialized the TrueLens module, and now we've reset the database.", 'start': 630.876, 'duration': 3.162}, {'end': 639.703, 'text': 'We can now initialize our evaluation modules.', 'start': 637.642, 'duration': 2.061}, {'end': 645.144, 'text': 'OwlLens are growing as a standard mechanism for evaluating generative AI applications at scale.', 'start': 640.883, 'duration': 4.261}, {'end': 649.465, 'text': 'Rather than relying on expensive human evaluation or set benchmarks,', 'start': 645.964, 'duration': 3.501}, {'end': 657.167, 'text': 'OwlLens allows us to evaluate our applications in a way that is custom to the domain in which we operate and dynamic to the changing demands for our application.', 'start': 649.465, 'duration': 7.702}, {'end': 661.508, 'text': "Here, we've prebuilt a TrueLens recorder to use for this example.", 'start': 658.427, 'duration': 3.081}, {'end': 666.469, 'text': "In the recorder, we've included the standard triad of evaluations for evaluating regs.", 'start': 662.228, 'duration': 4.241}, {'end': 669.893, 'text': 'Groundedness, context relevance, and answer relevance.', 'start': 667.472, 'duration': 2.421}], 'summary': 'Truelens and owllens modules enable custom evaluation of ai applications, including standard triad of evaluations for regs.', 'duration': 48.084, 'max_score': 621.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg621809.jpg'}, {'end': 764.892, 'src': 'embed', 'start': 725.68, 'weight': 2, 'content': [{'end': 730.264, 'text': 'You can also see the answer relevance, context relevance, and groundedness for each rub.', 'start': 725.68, 'duration': 4.584}, {'end': 737.31, 'text': 'In this dashboard, you can see your evaluation metrics like context relevance, answer relevance and groundedness, as well as average latency,', 'start': 730.905, 'duration': 6.405}, {'end': 739.192, 'text': 'total cost and more and a UI.', 'start': 737.31, 'duration': 1.882}, {'end': 747.686, 'text': 'Here, we see that the answer relevance and groundedness are decently high, but context relevance is pretty low.', 'start': 740.663, 'duration': 7.023}, {'end': 753.949, 'text': "Now let's see if we can improve these metrics with more advanced retrieval techniques like sentence window retrieval,", 'start': 748.406, 'duration': 5.543}, {'end': 755.329, 'text': 'as well as on emerging retrieval.', 'start': 753.949, 'duration': 1.38}, {'end': 759.051, 'text': "The first advanced technique we'll talk about is sentence window retrieval.", 'start': 756.19, 'duration': 2.861}, {'end': 764.892, 'text': 'This works by embedding and retrieving single sentences, so more granular chunks.', 'start': 760.011, 'duration': 4.881}], 'summary': 'Dashboard shows evaluation metrics, including context relevance, answer relevance, and groundedness. advanced retrieval techniques can improve metrics.', 'duration': 39.212, 'max_score': 725.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg725680.jpg'}, {'end': 818.451, 'src': 'embed', 'start': 787.477, 'weight': 4, 'content': [{'end': 789.437, 'text': "Now let's take a look at how to set it up.", 'start': 787.477, 'duration': 1.96}, {'end': 795.993, 'text': "First, we'll use OpenAI GPT 3.5 Turbo.", 'start': 793.351, 'duration': 2.642}, {'end': 801.518, 'text': "Next, we'll construct our sentence window index over the given document.", 'start': 796.974, 'duration': 4.544}, {'end': 811.386, 'text': "Just a reminder that we have a helper function for constructing the sentence window index and we'll do a deep dive in how this works under the hood in the next few lessons.", 'start': 802.879, 'duration': 8.507}, {'end': 818.451, 'text': "Similar to before, we'll get a query engine from the sentence window index.", 'start': 815.169, 'duration': 3.282}], 'summary': 'Utilize openai gpt 3.5 turbo to set up a sentence window index for document retrieval.', 'duration': 30.974, 'max_score': 787.477, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg787477.jpg'}, {'end': 1017.382, 'src': 'embed', 'start': 993.612, 'weight': 6, 'content': [{'end': 1001.936, 'text': 'The auto-merging retriever works by merging retrieve nodes into larger parent nodes, which means that during retrieval,', 'start': 993.612, 'duration': 8.324}, {'end': 1008.759, 'text': "if a parent actually has a majority of its children nodes retrieved, then we'll replace the children nodes with the parent node.", 'start': 1001.936, 'duration': 6.823}, {'end': 1013.198, 'text': 'So this allows us to hierarchically merge our retrieved nodes.', 'start': 1010.236, 'duration': 2.962}, {'end': 1017.382, 'text': 'The combination of all the child nodes is the same text as the parent node.', 'start': 1013.899, 'duration': 3.483}], 'summary': 'Auto-merging retriever hierarchically merges nodes to reduce redundancy.', 'duration': 23.77, 'max_score': 993.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg993612.jpg'}, {'end': 1078.864, 'src': 'embed', 'start': 1048.618, 'weight': 7, 'content': [{'end': 1050.419, 'text': "And let's try running an example query.", 'start': 1048.618, 'duration': 1.801}, {'end': 1057.523, 'text': 'How do I build a portfolio of AI projects? In the logs here, you actually see the merging process go on.', 'start': 1051.84, 'duration': 5.683}, {'end': 1064.127, 'text': "We're merging nodes into a parent node to basically retrieve the parent node as opposed to the child node.", 'start': 1057.883, 'duration': 6.244}, {'end': 1072.591, 'text': 'To build a portfolio of AI projects, it is important to start with simple undertakings and gradually progress to more complex ones.', 'start': 1067.028, 'duration': 5.563}, {'end': 1075.563, 'text': "Great, so we see that it's working.", 'start': 1074.383, 'duration': 1.18}, {'end': 1078.864, 'text': "Now let's benchmark results with TrueLens.", 'start': 1076.724, 'duration': 2.14}], 'summary': 'Start with simple ai projects, progress to complex ones, benchmark with truelens', 'duration': 30.246, 'max_score': 1048.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1048618.jpg'}, {'end': 1165.051, 'src': 'embed', 'start': 1142.534, 'weight': 5, 'content': [{'end': 1152.582, 'text': 'On top of the evaluation questions, we get 100% in terms of groundedness 94% in terms of answer relevance 43% in terms of context relevance,', 'start': 1142.534, 'duration': 10.048}, {'end': 1155.904, 'text': 'which is higher than both the sentence window and the baseline rack pipeline.', 'start': 1152.582, 'duration': 3.322}, {'end': 1161.749, 'text': 'And we get roughly equivalent total costs to the sentence window query engine,', 'start': 1157.225, 'duration': 4.524}, {'end': 1165.051, 'text': 'implying that the retrieval here is more efficient with equivalent latency.', 'start': 1161.749, 'duration': 3.302}], 'summary': 'Achieved 100% groundedness, 94% answer relevance, and 43% context relevance, outperforming the baseline and maintaining equivalent costs and latency.', 'duration': 22.517, 'max_score': 1142.534, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1142534.jpg'}, {'end': 1298.997, 'src': 'embed', 'start': 1266.946, 'weight': 10, 'content': [{'end': 1276.134, 'text': 'The OpenAI key is used for the completion step of the RAG and to implement the evaluations with TrueLens.', 'start': 1266.946, 'duration': 9.188}, {'end': 1280.238, 'text': "So here's a code snippet that does exactly that.", 'start': 1277.836, 'duration': 2.402}, {'end': 1286.125, 'text': "And you're now all set up with the OpenAI key.", 'start': 1282.441, 'duration': 3.684}, {'end': 1294.152, 'text': 'The next section, I will quickly recap the query engine construction with LAMA index.', 'start': 1287.086, 'duration': 7.066}, {'end': 1298.997, 'text': 'Jerry has already walked you through that in lesson one in some detail.', 'start': 1294.713, 'duration': 4.284}], 'summary': 'Openai key used for rag completion and truelens evaluations. query engine constructed with lama index.', 'duration': 32.051, 'max_score': 1266.946, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1266946.jpg'}], 'start': 621.809, 'title': 'Evaluating ai applications and techniques with truelens and rag pipeline', 'summary': 'Covers the initialization of truelens modules for evaluating generative ai applications, custom evaluation metrics, and techniques such as sentence window retrieval. it also discusses the evaluations of various ai techniques, including the performance of retrieval pipelines and the evaluation of basic and advanced rag pipelines, achieving specific metrics and equivalent total costs.', 'chapters': [{'end': 868.073, 'start': 621.809, 'title': 'Evaluating ai applications with truelens', 'summary': 'Covers the initialization of truelens modules for evaluating generative ai applications, including custom evaluation metrics and techniques such as sentence window retrieval, with an emphasis on improving context relevance and answer relevance while maintaining high groundedness.', 'duration': 246.264, 'highlights': ['The OwlLens is growing as a standard mechanism for evaluating generative AI applications at scale.', 'The TrueLens recorder includes the standard triad of evaluations for evaluating regs: Groundedness, context relevance, and answer relevance.', 'The dashboard displays evaluation metrics like context relevance, answer relevance, and groundedness, as well as average latency, total cost, and more.', 'The advanced retrieval techniques discussed include sentence window retrieval, which aims to improve both retrieval and synthesis performance by providing more context for better query answering.', 'The use of OpenAI GPT 3.5 Turbo and constructing a sentence window index are part of setting up the sentence window retrieval technique.']}, {'end': 1126.732, 'start': 876.077, 'title': 'Ai techniques and evaluations', 'summary': 'Discusses the evaluations of various ai techniques, highlighting the comparative performance of different retrieval pipelines and the application of advanced retriever techniques, including their impact on relevance, efficiency, and merging processes.', 'duration': 250.655, 'highlights': ['The sentence window retriever outperforms the baseline rag pipeline by 8 percentage points in general groundedness and also demonstrates better context relevance, while maintaining lower total cost. The sentence window retriever shows 8% improvement in general groundedness compared to the baseline, with better context relevance and lower total cost.', 'The auto-merging retriever employs a hierarchical merging process to combine retrieved nodes into larger parent nodes, utilizing GPT 3.5 Turbo for the LLM and the BGE model for the Embedded model. The auto-merging retriever utilizes a hierarchical merging process and incorporates GPT 3.5 Turbo and the BGE model for retrieval.', "The merging process in the auto-merging retriever is demonstrated through the example query 'How do I build a portfolio of AI projects?', resulting in the successful retrieval of relevant information. The merging process in the auto-merging retriever is exemplified by the successful retrieval of relevant information for the query 'How do I build a portfolio of AI projects?'"]}, {'end': 1571.605, 'start': 1127.472, 'title': 'Rag pipeline evaluation and setup', 'summary': 'Discusses the evaluation of basic and advanced rag pipelines, achieving 100% groundedness, 94% answer relevance, and 43% context relevance, along with equivalent total costs to the sentence window query engine, and the setup process involving openai key, truelens, and lama index.', 'duration': 444.133, 'highlights': ['Achieving 100% groundedness, 94% answer relevance, and 43% context relevance, demonstrating the effectiveness of the advanced RAG pipeline compared to the baseline. The advanced retrieval methods yield impressive results, with 100% groundedness, 94% answer relevance, and 43% context relevance, surpassing the baseline rack pipeline.', 'Equivalent total costs to the sentence window query engine, indicating higher efficiency with equivalent latency. The advanced RAG pipeline shows roughly equivalent total costs to the sentence window query engine, implying higher efficiency with equivalent latency.', 'Setting up the RAG pipeline involving OpenAI key, TrueLens, and LAMA index, with detailed steps on the setup process. The transcript provides detailed steps on setting up the RAG pipeline, involving the OpenAI key, TrueLens, and LAMA index, essential components for evaluation and implementation.']}], 'duration': 949.796, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg621809.jpg', 'highlights': ['The OwlLens is growing as a standard mechanism for evaluating generative AI applications at scale.', 'The TrueLens recorder includes the standard triad of evaluations for evaluating regs: Groundedness, context relevance, and answer relevance.', 'The dashboard displays evaluation metrics like context relevance, answer relevance, and groundedness, as well as average latency, total cost, and more.', 'The advanced retrieval techniques discussed include sentence window retrieval, which aims to improve both retrieval and synthesis performance by providing more context for better query answering.', 'The use of OpenAI GPT 3.5 Turbo and constructing a sentence window index are part of setting up the sentence window retrieval technique.', 'The sentence window retriever outperforms the baseline rag pipeline by 8 percentage points in general groundedness and also demonstrates better context relevance, while maintaining lower total cost.', 'The auto-merging retriever employs a hierarchical merging process to combine retrieved nodes into larger parent nodes, utilizing GPT 3.5 Turbo for the LLM and the BGE model for the Embedded model.', "The merging process in the auto-merging retriever is demonstrated through the example query 'How do I build a portfolio of AI projects?', resulting in the successful retrieval of relevant information.", 'Achieving 100% groundedness, 94% answer relevance, and 43% context relevance, demonstrating the effectiveness of the advanced RAG pipeline compared to the baseline.', 'Equivalent total costs to the sentence window query engine, indicating higher efficiency with equivalent latency.', 'Setting up the RAG pipeline involving OpenAI key, TrueLens, and LAMA index, with detailed steps on the setup process.']}, {'end': 3172.64, 'segs': [{'end': 1626.396, 'src': 'embed', 'start': 1595.591, 'weight': 1, 'content': [{'end': 1604.879, 'text': 'And this provider will be used to implement the different feedback functions or evaluations, such as context relevance,', 'start': 1595.591, 'duration': 9.288}, {'end': 1607.362, 'text': 'answer relevance and groundedness.', 'start': 1604.879, 'duration': 2.483}, {'end': 1623.094, 'text': "Now let's go deeper into each of the evaluations of the RAG triad and we'll go back and forth a bit between slides and the notebook to give you the full context.", 'start': 1608.408, 'duration': 14.686}, {'end': 1626.396, 'text': "First up, we'll discuss answer relevance.", 'start': 1624.015, 'duration': 2.381}], 'summary': 'Implementing feedback functions for rag triad evaluations.', 'duration': 30.805, 'max_score': 1595.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1595591.jpg'}, {'end': 1727.926, 'src': 'embed', 'start': 1693.837, 'weight': 2, 'content': [{'end': 1701.265, 'text': 'which indicates to the LLM evaluation that it is a meaningful and relevant answer.', 'start': 1693.837, 'duration': 7.428}, {'end': 1710.237, 'text': 'I also want to use this opportunity to introduce the abstraction of a feedback function.', 'start': 1702.293, 'duration': 7.944}, {'end': 1714.759, 'text': 'Answer relevance is a concrete example of a feedback function.', 'start': 1711.117, 'duration': 3.642}, {'end': 1727.926, 'text': "More generally, a feedback function provides a score on a scale of 0 to 1 after reviewing an LLM app's inputs, outputs, and intermediate results.", 'start': 1715.64, 'duration': 12.286}], 'summary': 'Introduction of feedback function for llm evaluation, providing a score on a 0 to 1 scale.', 'duration': 34.089, 'max_score': 1693.837, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1693837.jpg'}, {'end': 2150.706, 'src': 'embed', 'start': 2122.083, 'weight': 3, 'content': [{'end': 2129.348, 'text': 'Now you will notice that in the answer relevance feedback function, we had only made use of the original input,', 'start': 2122.083, 'duration': 7.265}, {'end': 2132.451, 'text': 'the prompt and the final response from the rag.', 'start': 2129.348, 'duration': 3.103}, {'end': 2144.203, 'text': 'In this feedback function, we are making use of the input or prompt from the user, as well as intermediate results,', 'start': 2133.999, 'duration': 10.204}, {'end': 2150.706, 'text': 'the set of retrieve contexts to assess the quality of the retrieval.', 'start': 2144.203, 'duration': 6.503}], 'summary': 'Feedback function analyzes user input and retrieval quality using prompt and intermediate results.', 'duration': 28.623, 'max_score': 2122.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg2122083.jpg'}, {'end': 2309.989, 'src': 'embed', 'start': 2280.019, 'weight': 0, 'content': [{'end': 2287.84, 'text': 'relevance feedback function gives a score of 0.7 on a scale of zero to one to this piece of retrieved context.', 'start': 2280.019, 'duration': 7.821}, {'end': 2303.883, 'text': 'And because we have also invoked the chain of thought reasoning on the evaluation LLM, it provides this justification for why the score is 0.7.', 'start': 2289.041, 'duration': 14.842}, {'end': 2309.989, 'text': 'Let me now show you the code snippet to set up the groundedness feedback function.', 'start': 2303.883, 'duration': 6.106}], 'summary': 'Relevance feedback function scores 0.7 on a scale of 0 to 1 for retrieved context.', 'duration': 29.97, 'max_score': 2280.019, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg2280019.jpg'}, {'end': 2626.408, 'src': 'embed', 'start': 2602.003, 'weight': 4, 'content': [{'end': 2611.549, 'text': 'We walked through three examples of evaluations or feedback functions, context relevance, answer relevance, and groundedness.', 'start': 2602.003, 'duration': 9.546}, {'end': 2619.834, 'text': 'In our notebook, all three were implemented with LLM evaluations.', 'start': 2612.99, 'duration': 6.844}, {'end': 2626.408, 'text': 'I do want to point out that feedback functions can be implemented in different ways.', 'start': 2621.824, 'duration': 4.584}], 'summary': 'Three evaluation or feedback functions implemented with llm evaluations.', 'duration': 24.405, 'max_score': 2602.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg2602003.jpg'}, {'end': 2884.552, 'src': 'embed', 'start': 2863.419, 'weight': 5, 'content': [{'end': 2874.966, 'text': 'While in the course, we have given you three examples of feedback, functions and evaluations answer relevance, context, relevance and groundedness.', 'start': 2863.419, 'duration': 11.547}, {'end': 2884.552, 'text': "TrueLens provides a much broader set of evaluations to ensure that the apps that you're building are honest, harmless and helpful.", 'start': 2874.966, 'duration': 9.586}], 'summary': 'Truelens offers a broader set of evaluations to ensure honesty, harmlessness, and helpfulness of apps.', 'duration': 21.133, 'max_score': 2863.419, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg2863419.jpg'}], 'start': 1571.645, 'title': 'Implementing feedback functions and evaluations', 'summary': 'Covers implementing feedback functions such as answer relevance, context relevance, and groundedness using openai gpt 3.5 turbo, introducing the groundedness feedback function with a score of 0.7, showcasing 80% agreement between human and llm evaluations, and discussing truelens integration with the lama index for evaluation.', 'chapters': [{'end': 2280.019, 'start': 1571.645, 'title': 'Implementing feedback functions for evaluation', 'summary': 'Discusses implementing feedback functions for evaluation, including answer relevance and context relevance, using openai gpt 3.5 turbo as the default provider, and provides examples of the evaluation output and the structure of feedback functions.', 'duration': 708.374, 'highlights': ['The RAG triad evaluations include answer relevance, which produces a score of 0.9 for a highly relevant answer and provides supporting evidence for the assessment. The answer relevance evaluation produces a score of 0.9 for a highly relevant answer and provides supporting evidence for the assessment.', "The feedback functions provide a score on a scale of 0 to 1 after reviewing LLM app's inputs, outputs, and intermediate results, such as answer relevance and context relevance. Feedback functions provide a score on a scale of 0 to 1 after reviewing LLM app's inputs, outputs, and intermediate results, such as answer relevance and context relevance.", 'The context relevance feedback function assesses the quality of retrieval by making use of the input prompt from the user and intermediate results, providing a final aggregate score. The context relevance feedback function assesses the quality of retrieval by making use of the input prompt from the user and intermediate results, providing a final aggregate score.']}, {'end': 2600.059, 'start': 2280.019, 'title': 'Rag application evaluation and iteration', 'summary': 'Introduces the groundedness feedback function with a score of 0.7, discusses the setup of the function, and highlights the importance of context relevance and groundedness in evaluating the rag application.', 'duration': 320.04, 'highlights': ['The groundedness feedback function provides a score of 0.7 and justifies the scores with chain of thought reasoning, leveraging OpenAI GPT 3.5 for evaluation.', 'The chapter emphasizes the importance of context relevance and groundedness in evaluating the RAG application, with a focus on improving context size to enhance these metrics.', 'The workflow involves iterating on the basic RAG with an advanced RAG technique, reevaluating with a focus on context relevance, and experimenting with different window sizes to optimize evaluation metrics.']}, {'end': 2862.001, 'start': 2602.003, 'title': 'Feedback functions and evaluations', 'summary': 'Discusses the implementation of feedback functions like context relevance, answer relevance, and groundedness, comparing human evaluations to ground truth evals and llm evaluations, showcasing the 80% agreement between human and llm evaluations, and highlighting the limitations of traditional nlp metrics in comparison to contextual evaluations.', 'duration': 259.998, 'highlights': ['The chapter discusses the implementation of feedback functions like context relevance, answer relevance, and groundedness, showcasing the 80% agreement between human and LLM evaluations, and highlighting the limitations of traditional NLP metrics in comparison to contextual evaluations.', "LLM evaluations are quite comparable to human evaluations for the benchmark data sets to which they have been applied, as there's about 80% agreement in ratings between humans and LLM evaluations.", 'Human evaluations and ground truth evals are compared, with human evaluations being helpful but hard to scale, while ground truth evals are expensive to collect but provide expert-curated ratings.', 'The chapter also mentions the ability of feedback functions to implement traditional NLP metrics such as rouge scores and blue scores, highlighting their limitations in providing meaningful evaluations due to their syntactic nature and the need for contextual evaluations using large language models like GPT-4 or BERT models.']}, {'end': 3172.64, 'start': 2863.419, 'title': 'Truelens integration and evaluation process', 'summary': 'Discusses the integration of truelens with the lama index, setting up feedback functions, and loading evaluation questions to run the lama index application and record the execution and evaluation results.', 'duration': 309.221, 'highlights': ['The integration of TrueLens with the LAMA index involves setting up feedback functions for honest, harmless, and helpful apps, and recording the execution of the application on various records, ensuring the depth of instrumentation in the application.', 'The process includes importing the True Lama class, creating an object to recorder of this class, and using the three feedback functions of the RAG triad to run the LAMA index application and evaluate the feedback, recording it in a local database.', 'The setup involves loading evaluation questions from a text file, adding custom questions, and then executing the sentence window engine on each question, running each record against the RAG triad, and recording the prompts, responses, intermediate results, and evaluation results in the true database.']}], 'duration': 1600.995, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg1571645.jpg', 'highlights': ['The groundedness feedback function provides a score of 0.7 and justifies the scores with chain of thought reasoning, leveraging OpenAI GPT 3.5 for evaluation.', 'The RAG triad evaluations include answer relevance, which produces a score of 0.9 for a highly relevant answer and provides supporting evidence for the assessment.', "The feedback functions provide a score on a scale of 0 to 1 after reviewing LLM app's inputs, outputs, and intermediate results, such as answer relevance and context relevance.", 'The context relevance feedback function assesses the quality of retrieval by making use of the input prompt from the user and intermediate results, providing a final aggregate score.', 'The chapter discusses the implementation of feedback functions like context relevance, answer relevance, and groundedness, showcasing the 80% agreement between human and LLM evaluations, and highlighting the limitations of traditional NLP metrics in comparison to contextual evaluations.', 'The integration of TrueLens with the LAMA index involves setting up feedback functions for honest, harmless, and helpful apps, and recording the execution of the application on various records, ensuring the depth of instrumentation in the application.']}, {'end': 3668.815, 'segs': [{'end': 3208.793, 'src': 'embed', 'start': 3172.64, 'weight': 0, 'content': [{'end': 3184.35, 'text': 'evaluation results and so forth can be quite valuable to identify failure modes in the apps and to inform iteration and improvement of the apps.', 'start': 3172.64, 'duration': 11.71}, {'end': 3195.099, 'text': 'All of this information is available in a flexible JSON format, so they can be exported and consumed by downstream processes.', 'start': 3184.95, 'duration': 10.149}, {'end': 3208.793, 'text': "Next up, let's look at some more human readable format for prompts, responses, and the feedback function evaluations.", 'start': 3196.698, 'duration': 12.095}], 'summary': 'Use evaluation results to identify failure modes and inform app improvement. data available in json format.', 'duration': 36.153, 'max_score': 3172.64, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3172640.jpg'}, {'end': 3339.488, 'src': 'embed', 'start': 3307.176, 'weight': 1, 'content': [{'end': 3310.858, 'text': 'The average context relevance is 0.56.', 'start': 3307.176, 'duration': 3.682}, {'end': 3314.501, 'text': 'Similarly, their average scores for groundedness, answer,', 'start': 3310.858, 'duration': 3.643}, {'end': 3321.805, 'text': 'relevance and latency across all the 10 records of questions that were asked of the RAG application.', 'start': 3314.501, 'duration': 7.304}, {'end': 3329.542, 'text': 'And then the cost is the total cost in dollars across these 10 records.', 'start': 3322.425, 'duration': 7.117}, {'end': 3339.488, 'text': "It's useful to get this aggregate view to see how well your app is performing and at what level of latency and cost.", 'start': 3330.223, 'duration': 9.265}], 'summary': 'Average context relevance is 0.56; assesses app performance across 10 records.', 'duration': 32.312, 'max_score': 3307.176, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3307176.jpg'}, {'end': 3444.659, 'src': 'embed', 'start': 3401.434, 'weight': 2, 'content': [{'end': 3404.034, 'text': 'The average latency is 3.55 seconds.', 'start': 3401.434, 'duration': 2.6}, {'end': 3413.98, 'text': 'We have the total cost, the total number of tokens that were processed by the LLMs, and then scores for the RAG triad.', 'start': 3406.115, 'duration': 7.865}, {'end': 3416.381, 'text': "For context relevance, it's 0.56.", 'start': 3414.18, 'duration': 2.201}, {'end': 3417.962, 'text': 'For groundedness, 0.86.', 'start': 3416.381, 'duration': 1.581}, {'end': 3420.684, 'text': 'And answer relevant is 0.92.', 'start': 3417.962, 'duration': 2.722}, {'end': 3427.468, 'text': 'We can select the app here to get a more detailed record level view of the evaluations.', 'start': 3420.684, 'duration': 6.784}, {'end': 3444.659, 'text': 'For each of the records, you can see that the user input the prompt, the response, this metadata, the timestamp and then scores for answer relevance,', 'start': 3431.67, 'duration': 12.989}], 'summary': 'Average latency is 3.55 seconds, with context relevance at 0.56, groundedness at 0.86, and answer relevance at 0.92.', 'duration': 43.225, 'max_score': 3401.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3401434.jpg'}, {'end': 3522.799, 'src': 'embed', 'start': 3492.791, 'weight': 3, 'content': [{'end': 3501.454, 'text': 'Down here, you can see that the answer relevance was viewed to be one on a scale of zero to one.', 'start': 3492.791, 'duration': 8.663}, {'end': 3506.536, 'text': "It's a relevant, quite a relevant answer to the question that was asked.", 'start': 3501.794, 'duration': 4.742}, {'end': 3513.938, 'text': 'Up here, you can see that context relevance, the average context relevance score is 0.8.', 'start': 3507.656, 'duration': 6.282}, {'end': 3522.799, 'text': 'For the two pieces of context that were retrieved, both of them individually got scores of 0.8.', 'start': 3513.938, 'duration': 8.861}], 'summary': 'Answer relevance score is 1, context relevance score is 0.8.', 'duration': 30.008, 'max_score': 3492.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3492791.jpg'}, {'end': 3648.522, 'src': 'embed', 'start': 3617.819, 'weight': 4, 'content': [{'end': 3627.167, 'text': 'If I scroll down to the groundedness evaluation, then both of the sentences in the final response have low groundedness score.', 'start': 3617.819, 'duration': 9.348}, {'end': 3634.533, 'text': "Let's pick one of these and look at why the groundedness score is low.", 'start': 3627.407, 'duration': 7.126}, {'end': 3640.477, 'text': 'So you can see this, the overall response got broken down into four statements.', 'start': 3635.413, 'duration': 5.064}, {'end': 3648.522, 'text': 'and the top two were good, but the bottom two did not have good supporting evidence in the retrieved pieces of context.', 'start': 3641.417, 'duration': 7.105}], 'summary': 'Final response has low groundedness score in both sentences.', 'duration': 30.703, 'max_score': 3617.819, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3617819.jpg'}], 'start': 3172.64, 'title': 'App and rag performance analysis', 'summary': "Delves into app evaluation, discussing json data export, aggregate performance scores, and costs across 10 records. it also evaluates rag application's performance, with an average latency of 3.55 seconds, context relevance score of 0.56, groundedness score of 0.86, and answer relevance score of 0.92.", 'chapters': [{'end': 3400.913, 'start': 3172.64, 'title': 'App evaluation and performance analysis', 'summary': 'Discusses the value of evaluation results in identifying failure modes, exporting data in json format, and analyzing aggregate performance scores and costs across 10 records, with an average context relevance score of 0.56, and the availability of a local dashboard for detailed app performance examination.', 'duration': 228.273, 'highlights': ['The chapter discusses the value of evaluation results in identifying failure modes Evaluation results are valuable in identifying failure modes in apps, informing iteration and improvement, and are available in flexible JSON format.', 'Analyzing aggregate performance scores and costs across 10 records The aggregate view across 10 records shows average context relevance score of 0.56, average scores for groundedness, answer relevance, and latency, and the total cost in dollars.', 'Availability of a local dashboard for detailed app performance examination TrueLens provides a local Streamlit app dashboard for examining applications, evaluation results, and detailed evaluation views into app performance.']}, {'end': 3668.815, 'start': 3401.434, 'title': 'Rag evaluation and latency analysis', 'summary': "Highlights the evaluation of rag application's performance by providing average latency of 3.55 seconds, context relevance score of 0.56, groundedness score of 0.86, and answer relevance score of 0.92, along with detailed examples of both successful and unsuccessful evaluations.", 'duration': 267.381, 'highlights': ['The average latency is 3.55 seconds, indicating the time taken for processing.', 'The answer relevance score is 0.92, denoting a high level of relevancy for the responses.', 'The context relevance score is 0.56, suggesting the need for improvement in capturing contextual information.', 'The groundedness evaluations provide detailed breakdowns of the scores, including examples of both successful and unsuccessful evaluations.']}], 'duration': 496.175, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3172640.jpg', 'highlights': ['Evaluation results are valuable in identifying failure modes in apps, informing iteration and improvement, and are available in flexible JSON format.', 'Analyzing aggregate performance scores and costs across 10 records The aggregate view across 10 records shows average context relevance score of 0.56, average scores for groundedness, answer relevance, and latency, and the total cost in dollars.', 'The average latency is 3.55 seconds, indicating the time taken for processing.', 'The answer relevance score is 0.92, denoting a high level of relevancy for the responses.', 'The groundedness evaluations provide detailed breakdowns of the scores, including examples of both successful and unsuccessful evaluations.']}, {'end': 4131.115, 'segs': [{'end': 3757.486, 'src': 'embed', 'start': 3702.087, 'weight': 0, 'content': [{'end': 3710.711, 'text': "does not do so well to get a feeling for the kinds of failure modes that are quite common when you're using RAG applications.", 'start': 3702.087, 'duration': 8.624}, {'end': 3719.315, 'text': 'And some of these will get addressed as we go into the sessions on more advanced RAG techniques,', 'start': 3711.711, 'duration': 7.604}, {'end': 3722.937, 'text': 'which can do better in terms of addressing these failure modes.', 'start': 3719.315, 'duration': 3.622}, {'end': 3725.91, 'text': 'Lesson two is a wrap with that.', 'start': 3723.829, 'duration': 2.081}, {'end': 3727.53, 'text': 'In the next lesson,', 'start': 3726.63, 'duration': 0.9}, {'end': 3748.596, 'text': 'we will walk through the mechanism for sentence window-based retrieval and advanced RAG technique and also show you how to evaluate the advanced technique leveraging the RAG triad and TrueLens.', 'start': 3727.53, 'duration': 21.066}, {'end': 3756.025, 'text': "In this lesson, we'll do a deep dive into an advanced RAG technique, our sentence window retrieval method.", 'start': 3750.424, 'duration': 5.601}, {'end': 3757.486, 'text': 'In this method,', 'start': 3756.746, 'duration': 0.74}], 'summary': 'The transcript discusses advanced rag techniques and addressing common failure modes.', 'duration': 55.399, 'max_score': 3702.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3702087.jpg'}, {'end': 3858.914, 'src': 'embed', 'start': 3820.368, 'weight': 2, 'content': [{'end': 3822.35, 'text': 'The various components will be covered in detail.', 'start': 3820.368, 'duration': 1.982}, {'end': 3827.955, 'text': 'At the end, Anupam will show you how to experiment with parameters and evaluation with truerum.', 'start': 3823.27, 'duration': 4.685}, {'end': 3836.083, 'text': "This is the same setup that you've used in the previous lessons, so make sure to install the relevant packages, such as Lama Index and TrueLines Evo.", 'start': 3828.856, 'duration': 7.227}, {'end': 3841.19, 'text': "For this quick start, you'll need an OpenAI key similar to previous lessons.", 'start': 3837.349, 'duration': 3.841}, {'end': 3846.351, 'text': 'This OpenAI key is used for embeddings, LLMs, and also the evaluation piece.', 'start': 3842.01, 'duration': 4.341}, {'end': 3858.914, 'text': 'Now we set up and inspect our documents to use for iteration and experimentation.', 'start': 3854.333, 'duration': 4.581}], 'summary': 'Learn to experiment with parameters and evaluation using truerum and openai key.', 'duration': 38.546, 'max_score': 3820.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3820368.jpg'}, {'end': 3937.684, 'src': 'embed', 'start': 3881.624, 'weight': 3, 'content': [{'end': 3882.865, 'text': 'There are 41 pages.', 'start': 3881.624, 'duration': 1.241}, {'end': 3885.847, 'text': 'The object schemas are document object.', 'start': 3884.086, 'duration': 1.761}, {'end': 3889.01, 'text': "And here's some sample texts from the first page.", 'start': 3886.688, 'duration': 2.322}, {'end': 3897.6, 'text': "The next piece is we'll merge these into a single document because it helps with overall text blending accuracy when using more advanced retrievers.", 'start': 3890.694, 'duration': 6.906}, {'end': 3908.049, 'text': "Now let's set up the sentence window retrieval method and we'll go through how to set this up more in depth.", 'start': 3903.305, 'duration': 4.744}, {'end': 3913.634, 'text': "We'll start with a window size of three and a top K value of six.", 'start': 3909.31, 'duration': 4.324}, {'end': 3920.814, 'text': "First, we'll import what we call a sentence window node parser.", 'start': 3917.412, 'duration': 3.402}, {'end': 3933.221, 'text': 'The sentence window node parser is an object that will split a document into individual sentences and then augment each sentence chunk with the surrounding context around that sentence.', 'start': 3922.475, 'duration': 10.746}, {'end': 3937.684, 'text': 'Here, we demonstrate how the node parser works with a small example.', 'start': 3934.642, 'duration': 3.042}], 'summary': 'A 41-page document with object schemas is merged for text blending accuracy. a sentence window retrieval method is set up with a window size of three and a top k value of six.', 'duration': 56.06, 'max_score': 3881.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3881624.jpg'}, {'end': 4131.115, 'src': 'embed', 'start': 4048.889, 'weight': 5, 'content': [{'end': 4052.454, 'text': 'This is a compact, fast, and accurate for its size embedding model.', 'start': 4048.889, 'duration': 3.565}, {'end': 4066.158, 'text': 'We can also use other embedding models.', 'start': 4064.557, 'duration': 1.601}, {'end': 4072.022, 'text': 'For instance, a related model is bge-large, which we have in the commented out code below.', 'start': 4066.598, 'duration': 5.424}, {'end': 4079.326, 'text': 'The next step is to set up a VectorStore index with a source document.', 'start': 4075.924, 'duration': 3.402}, {'end': 4084.129, 'text': "Because we've defined the node parser as part of the service context.", 'start': 4080.367, 'duration': 3.762}, {'end': 4091.414, 'text': 'what this will do is it will take the source document, transform it into a series of sentences augmented with surrounding context,', 'start': 4084.129, 'duration': 7.285}, {'end': 4093.435, 'text': 'and embed it and load it into the VectorStore.', 'start': 4091.414, 'duration': 2.021}, {'end': 4098.807, 'text': 'We can save the index to disk so that you can load it later without rebuilding it.', 'start': 4095.526, 'duration': 3.281}, {'end': 4105.59, 'text': "If you've already built the index, saved it and you don't want to rebuild it.", 'start': 4101.709, 'duration': 3.881}, {'end': 4111.113, 'text': 'here is a handy block of code that allows you to load the index from the existing file, if it exists.', 'start': 4105.59, 'duration': 5.523}, {'end': 4112.113, 'text': "otherwise it'll build it.", 'start': 4111.113, 'duration': 1}, {'end': 4114.814, 'text': 'The index is now built.', 'start': 4113.754, 'duration': 1.06}, {'end': 4117.375, 'text': 'The next step is to set up and run the query engine.', 'start': 4115.234, 'duration': 2.141}, {'end': 4124.037, 'text': "First, what we'll do is we'll define what we call a metadata replacement post processor.", 'start': 4119.475, 'duration': 4.562}, {'end': 4131.115, 'text': 'This takes a value stored in the metadata and replaces a node text with that value.', 'start': 4125.612, 'duration': 5.503}], 'summary': 'Fast, accurate embedding model setup and query engine execution with metadata replacement post processor.', 'duration': 82.226, 'max_score': 4048.889, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4048889.jpg'}], 'start': 3670.347, 'title': 'Advanced rag techniques', 'summary': 'Introduces an advanced rag technique - sentence window retrieval method for context matching, discusses document object schemas, and details setting up a vectorstore index and query engine with specific parameters and models.', 'chapters': [{'end': 3881.124, 'start': 3670.347, 'title': 'Advanced rag techniques overview', 'summary': 'Introduces an advanced rag technique - sentence window retrieval method, which decouples embedding and synthesis for better context matching, and also provides guidance on setting it up and evaluating it, with a focus on addressing common failure modes.', 'duration': 210.777, 'highlights': ['The chapter introduces an advanced RAG technique - sentence window retrieval method, which decouples embedding and synthesis for better context matching. This method retrieves based on smaller sentences to better match the relevant context and then synthesizes based on an expanded context window around the sentence.', 'The chapter provides guidance on setting up and evaluating the advanced technique, with a focus on addressing common failure modes. The session will address failure modes commonly encountered when using RAG applications and will also provide guidance on more advanced RAG techniques that can better address these failure modes.', 'The chapter encourages experimentation with parameters and evaluation using Truerum, with a requirement to install relevant packages such as Lama Index and TrueLines Evo. Anupam will demonstrate how to experiment with parameters and evaluation using Truerum, with the requirement of installing relevant packages such as Lama Index and TrueLines Evo.']}, {'end': 4002.931, 'start': 3881.624, 'title': 'Document object schemas and sentence window retrieval method', 'summary': 'Discusses the document object schemas, merging texts into a single document for enhanced text blending accuracy, and setting up the sentence window retrieval method with a window size of three and a top k value of six, along with the demonstration of how the node parser works.', 'duration': 121.307, 'highlights': ['The document object schemas and merging texts into a single document for improved text blending accuracy are discussed.', 'The process of setting up the sentence window retrieval method with a window size of three and a top K value of six is explained.', 'The demonstration of how the node parser works and splits a document into individual sentences, augmenting each sentence chunk with the surrounding context, is provided.']}, {'end': 4131.115, 'start': 4003.792, 'title': 'Setting up vectorstore index and query engine', 'summary': 'Details the process of setting up a vectorstore index using an lln embedding model and a node parser, and running the query engine, with mention of the specific models used and the option to save the index to disk.', 'duration': 127.323, 'highlights': ['The chapter details the process of setting up a VectorStore index using an LLN embedding model and a node parser.', 'Mention of the embedding model specified as the BGE small model and the option to use other models like bge-large.', 'Description of transforming the source document into sentences with surrounding context, embedding it, and loading it into the VectorStore.', 'Option to save the index to disk for later use without rebuilding it.', 'Explanation of defining a metadata replacement post processor for the query engine.']}], 'duration': 460.768, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg3670347.jpg', 'highlights': ['The chapter introduces an advanced RAG technique - sentence window retrieval method for better context matching.', 'The chapter provides guidance on setting up and evaluating the advanced technique, addressing common failure modes.', 'The chapter encourages experimentation with parameters and evaluation using Truerum, requiring the installation of relevant packages such as Lama Index and TrueLines Evo.', 'The document object schemas and merging texts into a single document for improved text blending accuracy are discussed.', 'The process of setting up the sentence window retrieval method with a window size of three and a top K value of six is explained.', 'The chapter details the process of setting up a VectorStore index using an LLN embedding model and a node parser.', 'Mention of the embedding model specified as the BGE small model and the option to use other models like bge-large.', 'Option to save the index to disk for later use without rebuilding it.', 'Explanation of defining a metadata replacement post processor for the query engine.']}, {'end': 4852.244, 'segs': [{'end': 4243.533, 'src': 'embed', 'start': 4132.256, 'weight': 0, 'content': [{'end': 4137.078, 'text': 'And so this is done after retrieving the nodes and before sending the nodes to the outline.', 'start': 4132.256, 'duration': 4.822}, {'end': 4139.96, 'text': "We'll first walk through how this works.", 'start': 4138.658, 'duration': 1.302}, {'end': 4154.448, 'text': 'Using the nodes we created with the sentence window node parser, we can test this post-processor.', 'start': 4149.564, 'duration': 4.884}, {'end': 4157.964, 'text': 'Note that we made a backup of the original nodes.', 'start': 4155.942, 'duration': 2.022}, {'end': 4160.426, 'text': "Let's take a look at the second node again.", 'start': 4158.804, 'duration': 1.622}, {'end': 4169.493, 'text': "Great Now let's apply the post processor on top of these nodes.", 'start': 4165.069, 'duration': 4.424}, {'end': 4179.661, 'text': "If we now take a look at the text of the second node, we see that it's been replaced with a full context,", 'start': 4172.636, 'duration': 7.025}, {'end': 4182.884, 'text': 'including the sentences that occurred before and after the current node.', 'start': 4179.661, 'duration': 3.223}, {'end': 4187.671, 'text': 'The next step is to add the sentence transformer re-rank model.', 'start': 4184.827, 'duration': 2.844}, {'end': 4194.72, 'text': 'This takes the query and retrieved nodes and reorders the nodes in order of relevance using a specialized model for the task.', 'start': 4188.532, 'duration': 6.188}, {'end': 4204.36, 'text': 'Generally you would make the initial similarity topk larger and then the reranker will rescore the nodes and return a smaller topend,', 'start': 4196.058, 'duration': 8.302}, {'end': 4205.86, 'text': "so it'll filter out a smaller set.", 'start': 4204.36, 'duration': 1.5}, {'end': 4209.081, 'text': 'An example of a reranker is bge-reranker-base.', 'start': 4206.2, 'duration': 2.881}, {'end': 4212.262, 'text': 'This is a reranker based on the bge embeddings.', 'start': 4209.901, 'duration': 2.361}, {'end': 4217.863, 'text': "This string represents the model's name from Hugging Face, and you can find more details on the model from Hugging Face.", 'start': 4212.562, 'duration': 5.301}, {'end': 4220.886, 'text': "Let's take a look at how this re-ranker works.", 'start': 4219.025, 'duration': 1.861}, {'end': 4229.331, 'text': "We'll input some toy data and then see how the re-ranker can actually re-rank the initial set of nodes to a new set of nodes.", 'start': 4221.666, 'duration': 7.665}, {'end': 4232.653, 'text': "Let's assume the original query is I want a dog.", 'start': 4229.731, 'duration': 2.922}, {'end': 4240.077, 'text': 'And the initial set of score nodes is this is a cat with a score of 0.6.', 'start': 4234.193, 'duration': 5.884}, {'end': 4243.533, 'text': 'And then this is a dog with a score of 0.4.', 'start': 4240.077, 'duration': 3.456}], 'summary': 'Post-processor adds context to nodes, reranker reorders nodes using a specialized model.', 'duration': 111.277, 'max_score': 4132.256, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4132256.jpg'}, {'end': 4296.973, 'src': 'embed', 'start': 4271.744, 'weight': 4, 'content': [{'end': 4277.186, 'text': 'we chose for the re-ranker in order to give the re-ranker a fair chance at surfacing the proper information.', 'start': 4271.744, 'duration': 5.442}, {'end': 4281.887, 'text': 'We set the topK equal to six and topN equals to two,', 'start': 4278.306, 'duration': 3.581}, {'end': 4291.251, 'text': 'which means that we first fetch the six most similar chunks using the sentence window retrieval and then we filter for the top two most relevant chunks using the sentence re-ranker.', 'start': 4281.887, 'duration': 9.364}, {'end': 4296.973, 'text': "Now that we have the full query engine set up, let's run through a basic example.", 'start': 4292.851, 'duration': 4.122}], 'summary': 'The re-ranker is given a fair chance with topk=6 and topn=2 for retrieving and filtering relevant chunks.', 'duration': 25.229, 'max_score': 4271.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4271744.jpg'}, {'end': 4476.257, 'src': 'embed', 'start': 4441.711, 'weight': 5, 'content': [{'end': 4464.507, 'text': 'Let us now see how we can evaluate and iterate on the sentence window size parameter to make the right trade-offs between the evaluation metrics or the quality of the app and the cost of running the application and evaluation.', 'start': 4441.711, 'duration': 22.796}, {'end': 4469.95, 'text': 'We will gradually increase the sentence window size, starting with one.', 'start': 4465.305, 'duration': 4.645}, {'end': 4476.257, 'text': 'Evaluate the successive app versions with TrueLens and the RAG triad.', 'start': 4469.97, 'duration': 6.287}], 'summary': 'Iterate on sentence window size to optimize evaluation metrics and app quality.', 'duration': 34.546, 'max_score': 4441.711, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4441711.jpg'}, {'end': 4716.473, 'src': 'embed', 'start': 4681.772, 'weight': 6, 'content': [{'end': 4686.415, 'text': 'for each question in that preloaded set of evaluation questions.', 'start': 4681.772, 'duration': 4.643}, {'end': 4699.004, 'text': 'And then with the true recorder object, we record the prompts, the responses, the intermediate results of the application,', 'start': 4687.356, 'duration': 11.648}, {'end': 4703.608, 'text': 'as well as the evaluation results in the true database.', 'start': 4699.004, 'duration': 4.604}, {'end': 4716.473, 'text': "Let's now adjust the sentence window size parameter and look at the impact of that on the different RAG triad evaluation metrics.", 'start': 4705.31, 'duration': 11.163}], 'summary': 'Recording prompts, responses, and evaluation results in the true database for analysis.', 'duration': 34.701, 'max_score': 4681.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4681772.jpg'}, {'end': 4852.244, 'src': 'embed', 'start': 4824.069, 'weight': 7, 'content': [{'end': 4836.723, 'text': 'so the app leaderboard shows us the aggregate metrics for all the 21 records that we ran through and evaluated with TrueLens.', 'start': 4824.069, 'duration': 12.654}, {'end': 4840.684, 'text': 'The average latency here is 4.57 seconds.', 'start': 4837.543, 'duration': 3.141}, {'end': 4844.446, 'text': 'The total cost is about two cents.', 'start': 4842.245, 'duration': 2.201}, {'end': 4850.322, 'text': 'Total number of tokens processed is about 9, 000.', 'start': 4846.38, 'duration': 3.942}, {'end': 4852.244, 'text': 'And you can see the evaluation metrics.', 'start': 4850.322, 'duration': 1.922}], 'summary': 'Truelens app processed 21 records, with average latency of 4.57 seconds, costing two cents for 9,000 tokens.', 'duration': 28.175, 'max_score': 4824.069, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4824069.jpg'}], 'start': 4132.256, 'title': 'Node retrieval and rag triad optimization', 'summary': 'Covers post-processing and re-ranking in node retrieval, focusing on full context inclusion and relevance-based re-ranking. it also discusses optimizing the sentence window size for rag triad, emphasizing context relevance, groundedness, and impact on evaluation metrics.', 'chapters': [{'end': 4291.251, 'start': 4132.256, 'title': 'Post-processing and re-ranking in node retrieval', 'summary': 'Discusses the post-processing of nodes to include full context and the re-ranking of nodes based on relevance using a specialized model, with specific details on the process and parameters.', 'duration': 158.995, 'highlights': ['The post-processor replaces the text of the second node with full context, including sentences before and after, after creating nodes with the sentence window node parser. Post-processor enhances nodes with full context, created using sentence window node parser.', 'The sentence transformer re-rank model reorders nodes based on relevance using a specialized model, with the ability to adjust topk and topend parameters to filter nodes. Sentence transformer re-rank model reorders nodes based on relevance, with adjustable topk and topend parameters.', 'The example of a reranker, bge-reranker-base, is based on bge embeddings and can be accessed through Hugging Face for more details. Example of a reranker, bge-reranker-base, based on bge embeddings accessible through Hugging Face.', 'The re-ranker is demonstrated using toy data, showing how it re-ranks an initial set of nodes to a new set based on relevance to a given query. Demonstration of the re-ranker using toy data to show re-ranking of nodes based on query relevance.', 'Specific parameters such as topK and topN are set to control the node retrieval and filtering process for the re-ranker, ensuring proper surfacing of relevant information. Setting specific parameters like topK and topN for node retrieval and filtering in the re-ranker process.']}, {'end': 4852.244, 'start': 4292.851, 'title': 'Optimizing sentence window size for rag triad', 'summary': 'Introduces the process of setting up the sentence window query engine and optimizing the sentence window size for rag triad, gradually increasing it to observe its impact on evaluation metrics, with an emphasis on context relevance, groundedness, and cost of running the application.', 'duration': 559.393, 'highlights': ['Gradually increasing the sentence window size to observe its impact on evaluation metrics, emphasizing context relevance, groundedness, and cost of running the application. Observing impact on evaluation metrics, emphasizing context relevance, groundedness, and cost of running the application.', 'Running evaluations with TrueLens for different sentence window sizes, recording prompts, responses, and evaluation results in the true database. Running evaluations with TrueLens, recording prompts, responses, and evaluation results in the true database.', 'Setting up the sentence window size to one, running evaluations and logging relevant data into the true database, resulting in an average latency of 4.57 seconds, a total cost of about two cents, and processing about 9,000 tokens. Setting up the sentence window size to one, resulting in average latency of 4.57 seconds, total cost of about two cents, and processing about 9,000 tokens.']}], 'duration': 719.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4132256.jpg', 'highlights': ['The post-processor enhances nodes with full context, created using sentence window node parser.', 'The sentence transformer re-rank model reorders nodes based on relevance, with adjustable topk and topend parameters.', 'Example of a reranker, bge-reranker-base, based on bge embeddings accessible through Hugging Face.', 'Demonstration of the re-ranker using toy data to show re-ranking of nodes based on query relevance.', 'Setting specific parameters like topK and topN for node retrieval and filtering in the re-ranker process.', 'Observing impact on evaluation metrics, emphasizing context relevance, groundedness, and cost of running the application.', 'Running evaluations with TrueLens, recording prompts, responses, and evaluation results in the true database.', 'Setting up the sentence window size to one, resulting in average latency of 4.57 seconds, total cost of about two cents, and processing about 9,000 tokens.']}, {'end': 5998.1, 'segs': [{'end': 5130.892, 'src': 'embed', 'start': 5102.821, 'weight': 1, 'content': [{'end': 5116.465, 'text': 'groundedness also becomes low because the LLM starts making use of its pre-existing knowledge from its training phase to start answering questions instead of just relying on the supplied context.', 'start': 5102.821, 'duration': 13.644}, {'end': 5121.247, 'text': "Now that I've shown you a failure mode with sentence windows set to one,", 'start': 5117.306, 'duration': 3.941}, {'end': 5130.892, 'text': 'I want to walk through a few more steps to see how the metrics improve as we change the sentence window size.', 'start': 5122.682, 'duration': 8.21}], 'summary': "Llm's groundedness decreases as it relies on pre-existing knowledge, demonstrating a failure mode with a sentence window size of one, and aims to improve metrics by changing the window size.", 'duration': 28.071, 'max_score': 5102.821, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg5102821.jpg'}, {'end': 5366.348, 'src': 'embed', 'start': 5329.605, 'weight': 0, 'content': [{'end': 5339.032, 'text': "We'd see that By finding supporting evidence across these two pieces of highly relevant context,", 'start': 5329.605, 'duration': 9.427}, {'end': 5342.755, 'text': 'the groundedness score actually goes up all the way to one.', 'start': 5339.032, 'duration': 3.723}, {'end': 5354.304, 'text': 'So increasing the sentence window size from one to three led to a substantial improvement in the evaluation metrics.', 'start': 5344.396, 'duration': 9.908}, {'end': 5356.485, 'text': 'of the RAG triad.', 'start': 5355.445, 'duration': 1.04}, {'end': 5366.348, 'text': 'Both groundedness and context relevance went up significantly, as did answer relevance.', 'start': 5357.266, 'duration': 9.082}], 'summary': 'Increasing sentence window size from 1 to 3 led to a groundedness score of 1, improving rag triad metrics significantly.', 'duration': 36.743, 'max_score': 5329.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg5329605.jpg'}, {'end': 5732.096, 'src': 'embed', 'start': 5685.399, 'weight': 2, 'content': [{'end': 5687.16, 'text': 'This will consist of a few different components.', 'start': 5685.399, 'duration': 1.761}, {'end': 5691.182, 'text': 'And the first step is to define what we call a hierarchical node parser.', 'start': 5687.38, 'duration': 3.802}, {'end': 5697.081, 'text': 'In order to use an auto-emerging retriever, we need to parse our nodes in a hierarchical fashion.', 'start': 5692.515, 'duration': 4.566}, {'end': 5703.048, 'text': 'This means that nodes are parsed in decreasing sizes and contain relationships to their parent node.', 'start': 5697.862, 'duration': 5.186}, {'end': 5707.033, 'text': 'Here we demonstrate how the node parser works with a small example.', 'start': 5704.11, 'duration': 2.923}, {'end': 5710.91, 'text': 'We create a toy parser with small chunk sizes to demonstrate.', 'start': 5707.967, 'duration': 2.943}, {'end': 5717.077, 'text': 'Note that the chunk sizes we use are 2048, 512, and 128.', 'start': 5711.631, 'duration': 5.446}, {'end': 5720.941, 'text': "You can change the chunk sizes to any sort of decreasing order that you'd like.", 'start': 5717.077, 'duration': 3.864}, {'end': 5723.024, 'text': 'Here we do it by a factor of four.', 'start': 5721.522, 'duration': 1.502}, {'end': 5726.928, 'text': "Now let's get the set of nodes from the document.", 'start': 5725.186, 'duration': 1.742}, {'end': 5732.096, 'text': 'What this does is this actually returns all nodes.', 'start': 5729.554, 'duration': 2.542}], 'summary': 'Demonstration of hierarchical node parser with chunk sizes 2048, 512, and 128, parsed in decreasing order.', 'duration': 46.697, 'max_score': 5685.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg5685399.jpg'}, {'end': 5858.405, 'src': 'embed', 'start': 5828.945, 'weight': 3, 'content': [{'end': 5834.687, 'text': "We'll also define a service context object containing the LLN, embedding model, and the hierarchical node parser.", 'start': 5828.945, 'duration': 5.742}, {'end': 5841.698, 'text': "As with the previous notebooks, we'll use the BGE small n embedding model.", 'start': 5837.856, 'duration': 3.842}, {'end': 5844.699, 'text': 'The next step is to construct our index.', 'start': 5842.358, 'duration': 2.341}, {'end': 5851.102, 'text': 'The way the index works is that we actually construct a vector index on specifically the leaf nodes.', 'start': 5845.759, 'duration': 5.343}, {'end': 5858.405, 'text': 'All other intermediate and parent nodes are stored in a doc store and are retrieved dynamically during retrieval.', 'start': 5852.222, 'duration': 6.183}], 'summary': 'Constructing index with bge small n embedding model, storing intermediate and parent nodes in doc store.', 'duration': 29.46, 'max_score': 5828.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg5828945.jpg'}], 'start': 4852.864, 'title': 'Application evaluation and improvement', 'summary': "Evaluates an application's performance in answer relevance, groundedness, and context relevance, highlighting specific weaknesses and providing examples of low scores, including an increase in context relevance from 0.57 to 0.9 when the sentence window size is changed from one to three in the rag triad evaluation metrics.", 'chapters': [{'end': 5051.482, 'start': 4852.864, 'title': 'Application evaluation and improvement', 'summary': "Discusses the evaluation of an application's performance in answer relevance, groundedness, and context relevance, highlighting the specific weaknesses in context relevance and providing examples of low scores in context relevance and groundedness.", 'duration': 198.618, 'highlights': ["The application's performance in context relevance is quite poor, as highlighted by specific examples of low context relevance scores.", 'The groundedness scores for the retrieved context are quite low, with specific examples showing a mix of high and low scores based on the supporting evidence in the retrieved context.', 'The chapter emphasizes the importance of context relevance and groundedness in the evaluation process, with specific examples demonstrating the impact of the size and supporting evidence of the retrieved context on the scores.']}, {'end': 5638.721, 'start': 5052.861, 'title': 'Rag technique: auto-merging', 'summary': 'Discusses the impact of changing the sentence window size on the rag triad evaluation metrics, the increase in context relevance from 0.57 to 0.9 when the sentence window size is changed from one to three, and the reduction in groundedness with a further increase to a size of five.', 'duration': 585.86, 'highlights': ['The increase in context relevance from 0.57 to 0.9 when the sentence window size is changed from one to three. The context relevance score increased from 0.57 to 0.9 when the sentence window size was changed from one to three, indicating a substantial improvement in contextual understanding.', 'The reduction in groundedness with a further increase to a size of five. The groundedness score dropped with the increase in the sentence window size to five, indicating that beyond a certain point, the LLM can get overwhelmed with too much information, leading to a decrease in groundedness.', 'Discussion of auto-merging retrieval as an advanced RAG technique to address issues with fragmented context chunks. The transcript provides a deep dive into auto-merging retrieval as an advanced RAG technique to address issues with fragmented context chunks, ultimately ensuring a more coherent context for the LLM to synthesize over.']}, {'end': 5998.1, 'start': 5639.581, 'title': 'Auto-merging retriever setup', 'summary': 'Covers the setup of an auto-merging retriever, which includes defining a hierarchical node parser, constructing a vector index on leaf nodes, and setting up the retriever to control the merging logic with a large top k for leaf nodes and applying a re-ranker to reduce token usage.', 'duration': 358.519, 'highlights': ['The chapter covers the setup of an auto-merging retriever, which includes defining a hierarchical node parser, constructing a vector index on leaf nodes, and setting up the retriever to control the merging logic with a large top k for leaf nodes and applying a re-ranker to reduce token usage.', 'We load in 41 document objects and merge them into a single large document, making it more amenable for text blending with advanced retrieval methods.', 'The auto-merging retriever controls the merging logic by swapping out parent nodes for retrieved children nodes, with a majority of children nodes, and applies a re-ranker to reduce token usage after merging.', 'The leaf nodes are specifically indexed and embedded using the BGE small n embedding model, while intermediate and parent nodes are stored in a doc store and retrieved dynamically during retrieval.', 'The importance of networking in AI is highlighted, stating that it allows individuals to build a strong professional network and more.', 'A toy parser with small chunk sizes (2048, 512, 128) demonstrates how the node parser works hierarchically, with the ability to change chunk sizes in decreasing order.', 'The leaf nodes, intermediate nodes, and parent nodes are retrieved, with a function available to retrieve only the leaf nodes, demonstrating a decent amount of overlap of information and content between them.', 'An example of a leaf node containing a small chunk size of 128 tokens is shown, emphasizing the smallest chunk size in the hierarchy and how it can be utilized for specific tasks.', 'The relationships between leaf and parent nodes are explored, showcasing the hierarchical nature of the nodes and how parent nodes contain multiple leaf nodes based on chunk sizes.', 'The construction of the index involves initializing the LLN, embedding model, and hierarchical node parser in a service context object, and using the BGE small n embedding model for leaf nodes in the index.']}], 'duration': 1145.236, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg4852864.jpg', 'highlights': ['The context relevance score increased from 0.57 to 0.9 when the sentence window size was changed from one to three, indicating a substantial improvement in contextual understanding.', 'The groundedness score dropped with the increase in the sentence window size to five, indicating that beyond a certain point, the LLM can get overwhelmed with too much information, leading to a decrease in groundedness.', 'The chapter covers the setup of an auto-merging retriever, which includes defining a hierarchical node parser, constructing a vector index on leaf nodes, and setting up the retriever to control the merging logic with a large top k for leaf nodes and applying a re-ranker to reduce token usage.', 'The leaf nodes are specifically indexed and embedded using the BGE small n embedding model, while intermediate and parent nodes are stored in a doc store and retrieved dynamically during retrieval.', 'A toy parser with small chunk sizes (2048, 512, 128) demonstrates how the node parser works hierarchically, with the ability to change chunk sizes in decreasing order.']}, {'end': 6900.089, 'segs': [{'end': 6180.026, 'src': 'embed', 'start': 6148.536, 'weight': 1, 'content': [{'end': 6156.78, 'text': "One reason you may want to experiment with the two layer auto merging structure is that it's simpler.", 'start': 6148.536, 'duration': 8.244}, {'end': 6171.927, 'text': 'Less work is needed to create the index as well as in the retrieval step, there is less work needed because all the third layer checks go away.', 'start': 6158.1, 'duration': 13.827}, {'end': 6180.026, 'text': 'If it performs comparatively well, then ideally we want to work with a simpler structure.', 'start': 6172.827, 'duration': 7.199}], 'summary': 'Experimenting with two-layer auto-merging structure for simpler indexing and retrieval steps may reduce third layer checks.', 'duration': 31.49, 'max_score': 6148.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6148536.jpg'}, {'end': 6360.589, 'src': 'embed', 'start': 6332.507, 'weight': 2, 'content': [{'end': 6334.689, 'text': "Let's examine the app leaderboard.", 'start': 6332.507, 'duration': 2.182}, {'end': 6345.54, 'text': 'You can see here that after processing 24 records, the context relevance at an aggregate level is quite low,', 'start': 6336.311, 'duration': 9.229}, {'end': 6349.323, 'text': 'although the app is doing better on answer relevance and groundedness.', 'start': 6345.54, 'duration': 3.783}, {'end': 6351.466, 'text': 'I can select the app.', 'start': 6350.405, 'duration': 1.061}, {'end': 6360.589, 'text': "Let's now look at the individual records of app zero and see how the evaluation scores are for the various records.", 'start': 6352.486, 'duration': 8.103}], 'summary': 'App leaderboard shows low context relevance after processing 24 records.', 'duration': 28.082, 'max_score': 6332.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6332507.jpg'}, {'end': 6503.326, 'src': 'embed', 'start': 6453.29, 'weight': 0, 'content': [{'end': 6462.294, 'text': "Let's now compare the previous app to the auto merging setup that Jerry introduced earlier.", 'start': 6453.29, 'duration': 9.004}, {'end': 6470.918, 'text': 'We will have three layers now in the hierarchy, starting with 128 tokens at the leaf node level, 512 one layer up, and 2, 048 at the highest layer.', 'start': 6463.094, 'duration': 7.824}, {'end': 6473.379, 'text': 'So at each layer, each parent has four children.', 'start': 6470.938, 'duration': 2.441}, {'end': 6503.326, 'text': "Now let's set up the query engine for this app setup, the true recorder, all identical steps as the one for the previous app.", 'start': 6486.539, 'duration': 16.787}], 'summary': 'Comparison between previous app and auto merging setup with three layers and specified token counts, followed by setting up the query engine.', 'duration': 50.036, 'max_score': 6453.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6453290.jpg'}, {'end': 6656.13, 'src': 'embed', 'start': 6631.382, 'weight': 3, 'content': [{'end': 6639.465, 'text': 'We walked you through an approach to evaluate and iterate with the auto retrieval advanced RAG technique.', 'start': 6631.382, 'duration': 8.083}, {'end': 6648.468, 'text': 'And, in particular, we showed you how to iterate with different hierarchical structures the number of levels,', 'start': 6640.925, 'duration': 7.543}, {'end': 6651.469, 'text': 'the number of child nodes and chunk sizes.', 'start': 6648.468, 'duration': 3.001}, {'end': 6656.13, 'text': 'And for these different app versions you can,', 'start': 6653.969, 'duration': 2.161}], 'summary': 'Approach to evaluate and iterate with advanced rag technique for different hierarchical structures and app versions.', 'duration': 24.748, 'max_score': 6631.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6631382.jpg'}, {'end': 6847.977, 'src': 'embed', 'start': 6822.812, 'weight': 7, 'content': [{'end': 6829.519, 'text': "you'll need to learn some of these core development principles so that you can be a rock star AI engineer who can build robust LLM software systems.", 'start': 6822.812, 'duration': 6.707}, {'end': 6837.287, 'text': 'Reducing LLM hallucination is going to be the top priority for every developer as the field evolves.', 'start': 6830.2, 'duration': 7.087}, {'end': 6847.977, 'text': 'We are excited to see the base models get better and larger scale evaluations become cheaper and more accessible for everyone to set up and run.', 'start': 6837.928, 'duration': 10.049}], 'summary': 'Learn core development principles to be a top ai engineer, prioritize reducing llm hallucination, and anticipate improved base models and cheaper evaluations.', 'duration': 25.165, 'max_score': 6822.812, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6822812.jpg'}, {'end': 6897.767, 'src': 'embed', 'start': 6868.589, 'weight': 5, 'content': [{'end': 6874.673, 'text': 'The RAG triad is an excellent place to start with evaluating your RAG-based LLM apps.', 'start': 6868.589, 'duration': 6.084}, {'end': 6882.938, 'text': 'As a next step, I encourage you to dig deeper into the area of evaluating LLMs and the apps that they power.', 'start': 6875.513, 'duration': 7.425}, {'end': 6892.684, 'text': 'This includes topics such as assessing model, confidence, calibration, uncertainty, explainability, privacy,', 'start': 6883.838, 'duration': 8.846}, {'end': 6897.767, 'text': 'fairness and toxicity in both benign and adversarial settings.', 'start': 6892.684, 'duration': 5.083}], 'summary': 'Evaluate rag-based llm apps using rag triad, then assess model, confidence, calibration, uncertainty, explainability, privacy, fairness, and toxicity in various settings.', 'duration': 29.178, 'max_score': 6868.589, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg6868589.jpg'}], 'start': 5998.48, 'title': 'Advanced rag techniques', 'summary': 'Demonstrates evaluating and iterating with advanced rag techniques to improve applications, showcasing significant improvement and a call to explore further techniques and areas for evaluation.', 'chapters': [{'end': 6145.861, 'start': 5998.48, 'title': 'Auto-merging index and query engine', 'summary': 'Demonstrates the creation of high-level functions for building auto-merging index and query engine, leveraging hierarchical node parser, vector store index creation, auto merging retriever, and parameter iteration, with a recommendation to experiment and provide feedback.', 'duration': 147.381, 'highlights': ['The second function, getAutoMergingQueryEngine, leverages our auto merging retriever, which is able to dynamically merge leaf nodes into parent nodes and also use our re-rank module and then combine it with the overall retriever query engine.', 'The first function, buildAutoMergingIndex, involves using the hierarchical node parser to parse out the hierarchy of child to parent nodes, defining the service context, creating a vector store index from the leaf nodes, and linking to the document store of all the nodes.', 'The chapter encourages experimentation and iteration on parameters of AutoEmergingRetrieval, such as changing the trunk sizes, top K, or top end for the re-ranker, with a call to try out own questions and provide feedback.', 'Setting up the auto-merging new index involves creating two layers, where the lowest layer chunk has a chunk size of 512 and the next layer in the hierarchy has a chunk size of 2048, with each parent having four leaf nodes of 512 tokens each.', 'The next step includes evaluating the AutoEmergingRetriever and iterating on parameters using TreeArg, while comparing its performance to the basic RAG with experiment tracking.']}, {'end': 6630.542, 'start': 6148.536, 'title': 'Auto merging structure evaluation', 'summary': 'Discusses the evaluation of a two-layer auto merging structure, comparing it to a three-layer hierarchy, showcasing reduced processing cost and increased context relevance by approximately 20% in the latter.', 'duration': 482.006, 'highlights': ['The three-layer auto merging hierarchy app setup processed about half the number of tokens and incurred about half the total cost as compared to the two-layer auto merging structure, resulting in a cost reduction.', 'Context relevance increased by about 20% in the three-layer auto merging hierarchy setup, indicating improved performance in merging and evaluation of context relevance.', 'The detailed evaluation of individual records in the three-layer auto merging hierarchy showed considerable improvement in context relevance and groundedness, reflecting the effectiveness of the new app setup.']}, {'end': 6900.089, 'start': 6631.382, 'title': 'Advanced rag techniques evaluation', 'summary': 'Demonstrated how to evaluate and iterate with advanced rag techniques using auto-retrieval, hierarchical structures, and experiment tracking to significantly improve rag applications, with a call to explore further techniques and areas for evaluation.', 'duration': 268.707, 'highlights': ['The chapter demonstrated how to evaluate and iterate with advanced RAG techniques using auto-retrieval, hierarchical structures, and experiment tracking to significantly improve RAG applications. The process showed how to iterate with different hierarchical structures, number of levels, number of child nodes, and chunk sizes for evaluating app versions, aiming to pick the best structure for the use case.', 'The chapter emphasized the importance of evaluating LLM applications and exploring additional techniques and areas for evaluation. It highlighted the need to explore other evaluations, such as assessing model confidence, calibration, uncertainty, explainability, privacy, fairness, and toxicity in benign and adversarial settings.', 'The chapter stressed the need for developers to focus on reducing LLM hallucination and improving RAG performance. It emphasized the importance of understanding data pipeline retrieval strategy and LLM prompts to help improve RAG performance, with a recommendation to delve deeper into evaluating LLMs and the apps they power.']}], 'duration': 901.609, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/rrW1U7tt_Xg/pics/rrW1U7tt_Xg5998480.jpg', 'highlights': ['The three-layer auto merging hierarchy app setup processed about half the number of tokens and incurred about half the total cost as compared to the two-layer auto merging structure, resulting in a cost reduction.', 'Context relevance increased by about 20% in the three-layer auto merging hierarchy setup, indicating improved performance in merging and evaluation of context relevance.', 'The detailed evaluation of individual records in the three-layer auto merging hierarchy showed considerable improvement in context relevance and groundedness, reflecting the effectiveness of the new app setup.', 'The chapter demonstrated how to evaluate and iterate with advanced RAG techniques using auto-retrieval, hierarchical structures, and experiment tracking to significantly improve RAG applications.', 'The process showed how to iterate with different hierarchical structures, number of levels, number of child nodes, and chunk sizes for evaluating app versions, aiming to pick the best structure for the use case.', 'The chapter emphasized the importance of evaluating LLM applications and exploring additional techniques and areas for evaluation.', 'It highlighted the need to explore other evaluations, such as assessing model confidence, calibration, uncertainty, explainability, privacy, fairness, and toxicity in benign and adversarial settings.', 'The chapter stressed the need for developers to focus on reducing LLM hallucination and improving RAG performance.']}], 'highlights': ['1 Introduction of Jerry Liu and Anupam Datta, experienced in RAG practices and trustworthy AI research.', '1 Emphasis on the high cost of building and productionizing a high-quality RAG system.', '1 Introduction of sentence window retrieval and auto-emerging retrieval for coherent text chunks.', '1 Outlines evaluation metrics for RAG-based LLM apps: context relevance, groundedness, and answer relevance.', '2 OwlLens as a standard mechanism for evaluating generative AI applications at scale.', '2 Discussion of TrueLens recorder and dashboard displaying evaluation metrics and costs.', '2 Advanced retrieval techniques: sentence window retrieval, OpenAI GPT 3.5 Turbo, and auto-merging retriever.', '2 Demonstration of advanced RAG pipeline outperforming baseline in groundedness, context relevance, and answer relevance.', '3 Implementation of feedback functions: groundedness, answer relevance, and context relevance.', '3 Integration of TrueLens with LAMA index and setting up feedback functions for app instrumentation.', '4 Value of evaluation results in identifying failure modes, informing iteration, and available in flexible JSON format.', '4 Analysis of aggregate performance scores and costs across 10 records, including latency and answer relevance.', '5 Introduction of advanced RAG technique: sentence window retrieval method for better context matching.', '5 Guidance on setting up and evaluating the advanced technique, addressing common failure modes.', '6 Enhancement of nodes with full context using post-processor and re-rank model for node relevance.', '6 Observation of impact on evaluation metrics, emphasizing context relevance, groundedness, and cost.', '7 Setup of auto-merging retriever, impact of sentence window size on context relevance and groundedness.', '8 Demonstration of evaluating and iterating with advanced RAG techniques using auto-retrieval and hierarchical structures.', '8 Importance of exploring additional evaluations, such as model confidence, calibration, and fairness.', '8 Emphasis on reducing LLM hallucination and improving RAG performance.']}