title
Markov Decision Processes 1 - Value Iteration | Stanford CS221: AI (Autumn 2019)

description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3pUNqG7 Topics: MDP1, Search review, Project Percy Liang, Associate Professor & Dorsa Sadigh, Assistant Professor - Stanford University http://onlinehub.stanford.edu/ Associate Professor Percy Liang Associate Professor of Computer Science and Statistics (courtesy) https://profiles.stanford.edu/percy-liang Assistant Professor Dorsa Sadigh Assistant Professor in the Computer Science Department & Electrical Engineering Department https://profiles.stanford.edu/dorsa-sadigh To follow along with the course schedule and syllabus, visit: https://stanford-cs221.github.io/autumn2019/#schedule Chapters: 0:00 intro 2:12 Course Plan 3:45 Applications 10:48 Rewards 18:46 Markov Decision process 19:33 Transitions 20:45 Transportation Example 29:28 What is a Solution? 30:58 Roadmap 36:36 Evaluating a policy: volcano crossing 37:38 Discounting 53:21 Policy evaluation computation 55:23 Complexity 57:10 Summary so far #artificialintelligencecourse

detail
{'title': 'Markov Decision Processes 1 - Value Iteration | Stanford CS221: AI (Autumn 2019)', 'heatmap': [{'end': 3093.708, 'start': 3040.825, 'weight': 0.951}, {'end': 4976.098, 'start': 4936.742, 'weight': 1}], 'summary': 'Series covers mdps, decision making under uncertainty, expected utility, mdp formalism, value iteration, and policy evaluation, providing insights into their applications in transportation and robotics.', 'chapters': [{'end': 278.132, 'segs': [{'end': 172.424, 'src': 'embed', 'start': 141.965, 'weight': 0, 'content': [{'end': 143.607, 'text': 'which are Markov decision processes.', 'start': 141.965, 'duration': 1.642}, {'end': 148.993, 'text': 'And the idea of it is you take actions but you might not actually end up where you expect it to,', 'start': 143.927, 'duration': 5.066}, {'end': 155.038, 'text': "because there's this nature around you and there's this world around you that's going to be uncertain and do stuff that you didn't expect.", 'start': 148.993, 'duration': 6.045}, {'end': 158.239, 'text': "Okay? So, so, so far we've talked about search problems.", 'start': 155.058, 'duration': 3.181}, {'end': 165.441, 'text': 'The idea of it is you start with a state and then you take an action and you deterministically end up in a new state.', 'start': 158.699, 'duration': 6.742}, {'end': 172.424, 'text': 'If you remember the successor function, successor of S and A would always give us S prime and we would deterministically end up in S prime.', 'start': 165.761, 'duration': 6.663}], 'summary': 'Markov decision processes deal with uncertain environments and deterministic state transitions.', 'duration': 30.459, 'max_score': 141.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co141965.jpg'}, {'end': 240.538, 'src': 'embed', 'start': 214.351, 'weight': 1, 'content': [{'end': 218.593, 'text': 'And again, because there is just so many other things that are happening in the world and you need to.', 'start': 214.351, 'duration': 4.242}, {'end': 222.154, 'text': 'you need to worry about that randomness and make decisions based on that.', 'start': 218.593, 'duration': 3.561}, {'end': 226.976, 'text': 'And- and this actually comes up pretty much like everywhere in every application.', 'start': 223.655, 'duration': 3.321}, {'end': 229.196, 'text': 'So, uh, this comes up in robotics.', 'start': 227.436, 'duration': 1.76}, {'end': 234.717, 'text': 'So for example, if you have a robot that wants to go and pick up an object, you decide on your strategy, everything is great.', 'start': 229.256, 'duration': 5.461}, {'end': 240.538, 'text': 'But like when it comes to actually moving the robot and getting the robot to do the task, like the actuators can fail,', 'start': 235.037, 'duration': 5.501}], 'summary': 'Randomness impacts decision-making in robotics applications.', 'duration': 26.187, 'max_score': 214.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co214351.jpg'}, {'end': 268.808, 'src': 'embed', 'start': 243.779, 'weight': 2, 'content': [{'end': 250.6, 'text': "So there is uncertainty about the environment or uncertainty about your model, like your actuators, that- that you didn't necessarily think about.", 'start': 243.779, 'duration': 6.821}, {'end': 254.341, 'text': "And in reality, they're affecting your decisions and where you're ending up at.", 'start': 250.92, 'duration': 3.421}, {'end': 257.723, 'text': 'This comes up in other settings like resource allocation.', 'start': 254.902, 'duration': 2.821}, {'end': 261.524, 'text': "So in resource allocation, maybe you're deciding what to produce.", 'start': 258.123, 'duration': 3.401}, {'end': 263.305, 'text': 'What is a product you would want to produce?', 'start': 261.745, 'duration': 1.56}, {'end': 266.707, 'text': 'And and that kind of depends on what is the customer demand.', 'start': 263.746, 'duration': 2.961}, {'end': 268.808, 'text': 'and and you might not have a good model of that.', 'start': 266.707, 'duration': 2.101}], 'summary': 'Uncertainty in environment and model affects decisions. example: resource allocation and product demand.', 'duration': 25.029, 'max_score': 243.779, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co243779.jpg'}], 'start': 5.792, 'title': 'Mdps and project plan discussion', 'summary': 'Covers catching up with mdps in the first hour, reviewing previous lecture for 10 minutes, and discussing the project plan for the remaining time. it also explains mdps, addressing uncertainties in decision-making and its applications in transportation and robotics, emphasizing the need to consider randomness in decision-making.', 'chapters': [{'end': 49.203, 'start': 5.792, 'title': 'Mdps and project plan discussion', 'summary': 'Focuses on catching up with a plan to cover markov decision processes in the first hour, review previous lecture for 10 minutes, and discuss the project plan for the remaining time.', 'duration': 43.411, 'highlights': ['The plan is to cover Markov Decision Processes in the first hour.', 'Review of the previous lecture will take place for 10 minutes.', 'Discussion on the project plan will occur in the last 10 minutes.']}, {'end': 278.132, 'start': 49.263, 'title': 'Markov decision processes', 'summary': 'Explains markov decision processes, addressing uncertainties in decision-making and how it applies to various scenarios such as transportation and robotics, emphasizing the need to consider randomness in decision-making.', 'duration': 228.869, 'highlights': ['Markov Decision Processes involve uncertainties in decision-making due to unpredictable outcomes in various scenarios such as transportation and robotics. uncertainties, decision-making, transportation, robotics', 'The chapter emphasizes the need to consider randomness in decision-making, which is applicable in various real-life scenarios, including resource allocation and robotics. randomness, decision-making, resource allocation, robotics', 'Examples of uncertainties are highlighted, such as the possibility of a flat tire when biking, traffic when driving, and actuator failure in robotics, demonstrating the relevance of considering uncertainties in decision-making processes. uncertainties examples, flat tire, traffic, actuator failure']}], 'duration': 272.34, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co5792.jpg', 'highlights': ['Markov Decision Processes involve uncertainties in decision-making due to unpredictable outcomes in transportation and robotics.', 'The chapter emphasizes the need to consider randomness in decision-making, applicable in real-life scenarios like resource allocation and robotics.', 'Examples of uncertainties are highlighted, such as the possibility of a flat tire when biking, traffic when driving, and actuator failure in robotics.']}, {'end': 1133.781, 'segs': [{'end': 327.4, 'src': 'embed', 'start': 297.534, 'weight': 0, 'content': [{'end': 300.216, 'text': "So there's a lot of uncertainty in these decisions that we make.", 'start': 297.534, 'duration': 2.682}, {'end': 308.643, 'text': 'And and they make these problems to to go beyond search problems and become problems where where we have uncertainty and we need to make decisions under uncertainty.', 'start': 300.657, 'duration': 7.986}, {'end': 311.105, 'text': 'Okay? All right.', 'start': 309.123, 'duration': 1.982}, {'end': 312.787, 'text': "So let's look at another example.", 'start': 311.185, 'duration': 1.602}, {'end': 315.389, 'text': 'So this is a volcano crossing example.', 'start': 312.947, 'duration': 2.442}, {'end': 318.372, 'text': 'So, so we have an island and we are on one side of the island.', 'start': 315.429, 'duration': 2.943}, {'end': 321.034, 'text': "And what we wanna do, so we're in that black square over there.", 'start': 318.812, 'duration': 2.222}, {'end': 327.4, 'text': 'And what we wanna do is we wanna go from this black square to this side of the island, where we have the scenic view,', 'start': 321.495, 'duration': 5.905}], 'summary': 'Decisions under uncertainty, illustrated by a volcano crossing example.', 'duration': 29.866, 'max_score': 297.534, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co297534.jpg'}, {'end': 501.233, 'src': 'embed', 'start': 475.917, 'weight': 1, 'content': [{'end': 482.241, 'text': 'What we want to do in this lecture is we are going to, like again, model these, uh, types of systems as Markov decision processes.', 'start': 475.917, 'duration': 6.324}, {'end': 485.142, 'text': 'Then we are going to talk about inference type algorithm.', 'start': 482.761, 'duration': 2.381}, {'end': 486.443, 'text': 'So how do we do inference??', 'start': 485.162, 'duration': 1.281}, {'end': 488.405, 'text': 'How do we come up with this best strategy path?', 'start': 486.463, 'duration': 1.942}, {'end': 495.929, 'text': "Um, and in the middle, I'm going to talk about policy evaluation, which is not an inference algorithm, but it's kind of a step towards it.", 'start': 489.085, 'duration': 6.844}, {'end': 501.233, 'text': "And it's basically this idea of if someone tells me this is a policy, can I evaluate how good it is?", 'start': 496.009, 'duration': 5.224}], 'summary': 'Model systems as markov decision processes, discuss inference algorithms, and evaluate policies.', 'duration': 25.316, 'max_score': 475.917, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co475917.jpg'}, {'end': 760.521, 'src': 'embed', 'start': 712.818, 'weight': 2, 'content': [{'end': 716.539, 'text': 'But- but in general, in expectation, you should decide to stay okay?', 'start': 712.818, 'duration': 3.721}, {'end': 721.519, 'text': 'And- and we actually wanna spend a little bit of time in this lecture thinking about how we get that 12..', 'start': 716.939, 'duration': 4.58}, {'end': 729.645, 'text': 'and then how to go about computing this expected utility, and- and based on that, how to decide what policy to use, right? Okay.', 'start': 721.519, 'duration': 8.126}, {'end': 733.548, 'text': 'And then, if you decide to- to quit then- then expected utility.', 'start': 730.246, 'duration': 3.302}, {'end': 734.849, 'text': 'there is kind of obvious right?', 'start': 733.548, 'duration': 1.301}, {'end': 739.393, 'text': "Because that- that you're quitting and that's with probability one, you're getting $10..", 'start': 734.889, 'duration': 4.504}, {'end': 743.175, 'text': "So you're just gonna get $10 and that is the expected utility of quitting.", 'start': 739.393, 'duration': 3.782}, {'end': 757.518, 'text': 'Yes Uh, so, so when you, when I said, when you roll a die, I said if you get one or two, you, you, you say, yeah.', 'start': 743.816, 'duration': 13.702}, {'end': 760.521, 'text': 'And then if you get the other, so the two-thirds of it, you continue.', 'start': 757.578, 'duration': 2.943}], 'summary': 'Calculating expected utility for decision making, $10 for quitting, and 2/3 chance to continue.', 'duration': 47.703, 'max_score': 712.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co712818.jpg'}, {'end': 1010.365, 'src': 'embed', 'start': 983.507, 'weight': 4, 'content': [{'end': 987.329, 'text': 'So- so more formally, we have a bunch of things when we define an MDP.', 'start': 983.507, 'duration': 3.822}, {'end': 991.832, 'text': 'Similar to search problems, we- like we- we now need to define the same set of things.', 'start': 987.75, 'duration': 4.082}, {'end': 993.593, 'text': 'So- so we have a set of states.', 'start': 991.892, 'duration': 1.701}, {'end': 998.937, 'text': 'In this case, my states are in and end, okay? We have a start state.', 'start': 994.114, 'duration': 4.823}, {'end': 1001.619, 'text': "I'm starting with in, so that's my start state.", 'start': 999.297, 'duration': 2.322}, {'end': 1004.701, 'text': 'I have actions as a function of states.', 'start': 1002.519, 'duration': 2.182}, {'end': 1010.365, 'text': 'So when I ask what are the actions of the state, my actions are going to be stay or quit.', 'start': 1005.121, 'duration': 5.244}], 'summary': 'Mdp defined with states (in, end), start state (in), and actions (stay, quit).', 'duration': 26.858, 'max_score': 983.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co983507.jpg'}], 'start': 278.132, 'title': 'Decision making under uncertainty and expected utility', 'summary': 'Covers decision making under uncertainty in resource allocation, providing examples from agriculture and a volcano crossing scenario. it also delves into the concept of expected utility in decision-making, using a game example to illustrate. additionally, it introduces tools like markov decision processes and reinforcement learning for modeling and solving such problems.', 'chapters': [{'end': 674.051, 'start': 278.132, 'title': 'Decision making under uncertainty', 'summary': 'Discusses decision making under uncertainty in resource allocation, using examples such as agriculture and a volcano crossing scenario, and introduces markov decision processes and reinforcement learning as tools for modeling and solving such problems.', 'duration': 395.919, 'highlights': ['The chapter introduces the concept of decision making under uncertainty in resource allocation, using examples from agriculture and a volcano crossing scenario, where the uncertainty of weather and crop yield impact decision making (e.g., 10% chance of falling into the volcano affecting the decision to cross the island).', 'The lecturer explains the concept of Markov decision processes (MDP) as a tool for modeling decision-making problems under uncertainty, and discusses the value iteration algorithm for determining the best policy to take in such scenarios.', 'An example of decision-making in a game scenario is presented, where participants can choose to stay or quit, and the lecturer emphasizes the calculation of expected utility to determine the optimal policy for decision-making.']}, {'end': 1133.781, 'start': 674.572, 'title': 'Expected utility in decision making', 'summary': 'Discusses the concept of expected utility in decision-making, using an example of a game where staying yields an expected utility of 12, while quitting gives an obvious expected utility of $10. it also introduces the formalization of the problem using markov decision processes (mdp) and the key components such as states, actions, transition probabilities, and rewards.', 'duration': 459.209, 'highlights': ['The expected utility of staying in the game is 12, while quitting yields an obvious expected utility of $10. The chapter emphasizes that in this particular problem, the expected utility of staying is 12, which is the value obtained in expectation. This provides a clear comparison with the expected utility of quitting, which is $10.', 'Introduction of the formalization of the problem using Markov decision processes (MDP) and defining key components such as states, actions, transition probabilities, and rewards. The transcript introduces the formalization of the problem using Markov decision processes (MDP) and defines key components such as states, actions, transition probabilities, and rewards. This formalization enables a structured approach to decision-making in the given scenario.']}], 'duration': 855.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co278132.jpg', 'highlights': ['The chapter introduces decision making under uncertainty in resource allocation, using examples from agriculture and a volcano crossing scenario.', 'The lecturer explains Markov decision processes (MDP) as a tool for modeling decision-making problems under uncertainty.', 'An example of decision-making in a game scenario is presented, emphasizing the calculation of expected utility to determine the optimal policy.', 'The expected utility of staying in the game is 12, while quitting yields an obvious expected utility of $10.', 'Introduction of the formalization of the problem using Markov decision processes (MDP) and defining key components such as states, actions, transition probabilities, and rewards.']}, {'end': 1523.084, 'segs': [{'end': 1212.366, 'src': 'embed', 'start': 1149.231, 'weight': 0, 'content': [{'end': 1156.018, 'text': 'these Ts that- that basically term was the probability of starting in S, taking action A and ending up in S prime,', 'start': 1149.231, 'duration': 6.787}, {'end': 1157.579, 'text': 'and then the cost just became reward.', 'start': 1156.018, 'duration': 1.561}, {'end': 1164.804, 'text': 'Okay? So- so those are kind of the major differences between search and MDP, because things are- things are not deterministic here.', 'start': 1157.879, 'duration': 6.925}, {'end': 1167.527, 'text': 'Okay? All right.', 'start': 1165.445, 'duration': 2.082}, {'end': 1169.228, 'text': 'So- so that was the formalism.', 'start': 1167.627, 'duration': 1.601}, {'end': 1174.072, 'text': 'Now- now I can define any- any MDP model, uh, any Markov decision process.', 'start': 1169.308, 'duration': 4.764}, {'end': 1175.962, 'text': 'And then one thing.', 'start': 1175.222, 'duration': 0.74}, {'end': 1178.763, 'text': 'just one thing to point out is this transition probabilities.', 'start': 1175.962, 'duration': 2.801}, {'end': 1187.126, 'text': 'this T basically specifies the probability of ending up in in state S, prime if you take action A in state S.', 'start': 1178.763, 'duration': 8.363}, {'end': 1194.288, 'text': "So, so these are probabilities, right? So, so for example, again, like we have done this example, but let's just do it on the, on the slides again.", 'start': 1187.126, 'duration': 7.162}, {'end': 1198.609, 'text': "If I'm in state N, I take action quit, I end up in end.", 'start': 1194.608, 'duration': 4.001}, {'end': 1201.05, 'text': "What's the probability of that? One.", 'start': 1199.049, 'duration': 2.001}, {'end': 1207.705, 'text': "And then if I'm state n, I take action stay, I end up in state n again.", 'start': 1201.943, 'duration': 5.762}, {'end': 1212.366, 'text': "What's the probability of that? I ended up in n again, two-thirds.", 'start': 1207.845, 'duration': 4.521}], 'summary': 'Mdp models involve transition probabilities for actions and states.', 'duration': 63.135, 'max_score': 1149.231, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1149231.jpg'}, {'end': 1300.98, 'src': 'embed', 'start': 1274.758, 'weight': 2, 'content': [{'end': 1279.261, 'text': "these new states that I'm gonna end up at the transition probabilities need to add up to 1, okay?", 'start': 1274.758, 'duration': 4.503}, {'end': 1286.026, 'text': "Because they're basically probabilities that tell me what are the-, what are the things that can happen if I take an action?", 'start': 1279.902, 'duration': 6.124}, {'end': 1294.253, 'text': 'okay?. And then, uh, this transition probabilities are going to be, uh, non-negative, because they are probabilities.', 'start': 1286.026, 'duration': 8.227}, {'end': 1299.317, 'text': "So that's also another property, okay? So usual things.", 'start': 1294.473, 'duration': 4.844}, {'end': 1300.98, 'text': 'All right.', 'start': 1300.6, 'duration': 0.38}], 'summary': 'Transition probabilities must add up to 1, non-negative, and indicate possible outcomes of actions.', 'duration': 26.222, 'max_score': 1274.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1274758.jpg'}, {'end': 1362.324, 'src': 'embed', 'start': 1318.715, 'weight': 5, 'content': [{'end': 1326.401, 'text': 'I can either walk from state s to state s plus 1, or I can take the magic tram that takes me from state s to state 2s.', 'start': 1318.715, 'duration': 7.686}, {'end': 1333.369, 'text': 'If I walk, that costs one minutes, okay? means reward of that is minus 1.', 'start': 1326.821, 'duration': 6.548}, {'end': 1339.415, 'text': 'If I- if I take the tram that costs two minutes uh, that means that the reward of that is minus 2, okay?', 'start': 1333.369, 'duration': 6.046}, {'end': 1346.48, 'text': 'Uh. and then the question was how- like how do we want to travel from- from 1 to n in a- in the least amount of time?', 'start': 1340.539, 'duration': 5.941}, {'end': 1349.781, 'text': 'So- so nothing here is- is probabilistic yet right?', 'start': 1346.54, 'duration': 3.241}, {'end': 1356.443, 'text': "So I'm gonna add an extra thing here, which says the tram is going to fail with probability 0.5..", 'start': 1350.181, 'duration': 6.262}, {'end': 1362.324, 'text': "So I'm gonna decide maybe take- take a tram at some point and that tram can- can fail with probability 0.5.", 'start': 1356.443, 'duration': 5.881}], 'summary': 'Travel from state 1 to n in least time, with tram failure probability 0.5.', 'duration': 43.609, 'max_score': 1318.715, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1318715.jpg'}, {'end': 1415.513, 'src': 'embed', 'start': 1386.35, 'weight': 6, 'content': [{'end': 1387.491, 'text': "We're gonna just copy that.", 'start': 1386.35, 'duration': 1.141}, {'end': 1391.792, 'text': 'So, all right.', 'start': 1389.051, 'duration': 2.741}, {'end': 1394.524, 'text': 'So this was what we had from last time.', 'start': 1392.622, 'duration': 1.902}, {'end': 1399.529, 'text': 'We had this transportation problem, and we had all these like algorithms to solve the search problem.', 'start': 1394.564, 'duration': 4.965}, {'end': 1401.631, 'text': "We don't really need them because we have a new problem.", 'start': 1399.569, 'duration': 2.062}, {'end': 1402.912, 'text': "So let's just get rid of them.", 'start': 1401.951, 'duration': 0.961}, {'end': 1406.556, 'text': 'And now I just wanna formalize an MDP.', 'start': 1403.933, 'duration': 2.623}, {'end': 1412.032, 'text': "So- so it's a transportation MDP, okay? The initialization looks okay.", 'start': 1406.916, 'duration': 5.116}, {'end': 1414.032, 'text': 'Start state looks okay.', 'start': 1412.552, 'duration': 1.48}, {'end': 1415.513, 'text': "I'm starting from 1.", 'start': 1414.092, 'duration': 1.421}], 'summary': 'Formalizing an mdp for a transportation problem.', 'duration': 29.163, 'max_score': 1386.35, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1386350.jpg'}], 'start': 1134.001, 'title': 'Mdp formalism and transition probabilities', 'summary': 'Explains the transition to transition probabilities and rewards in markov decision processes, and discusses the probabilities of state transition, formalizing a transportation mdp with a tram problem and aiming to minimize travel time from state 1 to state n.', 'chapters': [{'end': 1187.126, 'start': 1134.001, 'title': 'Mdp formalism and transition probabilities', 'summary': 'Explains the transition from successor function and cost function to transition probabilities and rewards in markov decision processes (mdp), highlighting the shift from deterministic to probabilistic nature of the process.', 'duration': 53.125, 'highlights': ['The transition from successor function and cost function to transition probabilities and rewards is a major difference in MDP formalism, as it shifts from deterministic to probabilistic nature of the process.', 'Transition probabilities in MDP specify the probability of ending up in state S prime if action A is taken in state S.']}, {'end': 1299.317, 'start': 1187.126, 'title': 'Probability in state transition', 'summary': 'Discusses the probabilities of transitioning between different states in a system, with examples showing the probabilities adding up to 1 and the transition probabilities being non-negative.', 'duration': 112.191, 'highlights': ['The probabilities of transitioning to different states need to add up to 1, ensuring that all possible outcomes are accounted for.', 'Transition probabilities are non-negative, following the standard property of probabilities.', "When in state N and taking action 'stay', the probability of ending up in state N again is two-thirds, while taking action 'quit' results in a probability of one.", 'The probabilities of transitioning to different states need to add up to 1, ensuring that all possible outcomes are accounted for.', 'Transition probabilities are non-negative, following the standard property of probabilities.']}, {'end': 1523.084, 'start': 1300.6, 'title': 'Formalizing transportation mdp', 'summary': 'Discusses formalizing a transportation mdp to model a tram problem with two possible actions, walking and taking the tram, and adding a probability for the tram to fail with 0.5 chance, aiming to minimize travel time from state 1 to state n.', 'duration': 222.484, 'highlights': ['The chapter discusses formalizing a transportation MDP to model a tram problem with two possible actions, walking and taking the tram, and adding a probability for the tram to fail with 0.5 chance. The problem involves modeling a tram problem with two possible actions, walking and taking the tram, and introducing a 0.5 probability for the tram to fail.', 'Aiming to minimize travel time from state 1 to state n. The objective is to minimize travel time from state 1 to state n within the formalized transportation MDP.', 'Walking from state s to state s+1 costs one minute, while taking the tram costs two minutes. Walking from state s to state s+1 incurs a cost of one minute, while taking the tram costs two minutes.']}], 'duration': 389.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1134001.jpg', 'highlights': ['The transition from successor function and cost function to transition probabilities and rewards is a major difference in MDP formalism, as it shifts from deterministic to probabilistic nature of the process.', 'Transition probabilities in MDP specify the probability of ending up in state S prime if action A is taken in state S.', 'The probabilities of transitioning to different states need to add up to 1, ensuring that all possible outcomes are accounted for.', 'Transition probabilities are non-negative, following the standard property of probabilities.', "When in state N and taking action 'stay', the probability of ending up in state N again is two-thirds, while taking action 'quit' results in a probability of one.", 'The chapter discusses formalizing a transportation MDP to model a tram problem with two possible actions, walking and taking the tram, and adding a probability for the tram to fail with 0.5 chance.', 'Aiming to minimize travel time from state 1 to state n. The objective is to minimize travel time from state 1 to state n within the formalized transportation MDP.', 'Walking from state s to state s+1 costs one minute, while taking the tram costs two minutes.']}, {'end': 2446.307, 'segs': [{'end': 1572.406, 'src': 'embed', 'start': 1523.985, 'weight': 0, 'content': [{'end': 1532.189, 'text': "What happens for action walk? what's new state I'm gonna end up at? Well, I'm gonna end up at S plus 1.", 'start': 1523.985, 'duration': 8.204}, {'end': 1536.21, 'text': "It's a deterministic action, so I'm gonna end up there with probability 1.", 'start': 1532.189, 'duration': 4.021}, {'end': 1542.691, 'text': "And what's the reward of that? Minus 1, because it's one minute cost, so it's minus 1 reward.", 'start': 1536.21, 'duration': 6.481}, {'end': 1548.013, 'text': 'Then for action tram, we kind of do the same thing, but we have two options here.', 'start': 1544.232, 'duration': 3.781}, {'end': 1550.921, 'text': 'I can- I can end up in 2S.', 'start': 1549.301, 'duration': 1.62}, {'end': 1552.042, 'text': "Tram doesn't fail.", 'start': 1551.241, 'duration': 0.801}, {'end': 1554.842, 'text': 'I end up in 2S with probability 0.5.', 'start': 1552.282, 'duration': 2.56}, {'end': 1558.463, 'text': 'That cost- that reward of that is minus 2.', 'start': 1554.842, 'duration': 3.621}, {'end': 1562.464, 'text': "Or the other option is I'm going to end up in state S because I didn't go anywhere.", 'start': 1558.463, 'duration': 4.001}, {'end': 1565.204, 'text': 'Because with probability 0.5, the tram did fail.', 'start': 1562.744, 'duration': 2.46}, {'end': 1569.565, 'text': 'And that- that- the reward of that is minus 2.', 'start': 1566.205, 'duration': 3.36}, {'end': 1570.285, 'text': "And that's pretty much it.", 'start': 1569.565, 'duration': 0.72}, {'end': 1572.406, 'text': 'That- that is my- my MDP.', 'start': 1570.425, 'duration': 1.981}], 'summary': 'In the mdp, action walk leads to state s plus 1 with a reward of -1, while action tram has two outcomes with probabilities 0.5 and rewards -2.', 'duration': 48.421, 'max_score': 1523.985, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1523985.jpg'}, {'end': 2055.665, 'src': 'embed', 'start': 2023.287, 'weight': 2, 'content': [{'end': 2027.709, 'text': 'So, so the value of a policy is the expected utility of that policy.', 'start': 2023.287, 'duration': 4.422}, {'end': 2029.871, 'text': "and then that's not a random variable anymore.", 'start': 2028.129, 'duration': 1.742}, {'end': 2033.093, 'text': "that's actually like a number and I can- I can compute that number.", 'start': 2029.871, 'duration': 3.222}, {'end': 2037.257, 'text': 'I can compute that number for every state and then just play around with value.', 'start': 2033.093, 'duration': 4.164}, {'end': 2046.265, 'text': 'Okay? Does that policy mean defined for all possible states or a particular state? For all possible.', 'start': 2037.277, 'duration': 8.988}, {'end': 2055.665, 'text': 'So- so the question is yeah, so, when you say value of policy, is the policy basically telling me um, is a policy basically telling me uh, what,', 'start': 2046.325, 'duration': 9.34}], 'summary': 'The value of a policy is the expected utility, computed for every state.', 'duration': 32.378, 'max_score': 2023.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2023287.jpg'}, {'end': 2381.089, 'src': 'embed', 'start': 2353.986, 'weight': 4, 'content': [{'end': 2358.429, 'text': "It actually discounts values in the future because future maybe doesn't have the same value as now.", 'start': 2353.986, 'duration': 4.443}, {'end': 2363.992, 'text': 'But, um, but we still value things in the future, like $4 is still something in the future.', 'start': 2358.889, 'duration': 5.103}, {'end': 2367.746, 'text': "And then that's where we pick like a gamma that's between 0 and 1.", 'start': 2364.505, 'duration': 3.241}, {'end': 2369.726, 'text': 'So, so that is kind of a design choice.', 'start': 2367.746, 'duration': 1.98}, {'end': 2372.687, 'text': "Like depending on what problem you're in, you might want to choose a different gamma.", 'start': 2369.766, 'duration': 2.921}, {'end': 2376.488, 'text': 'Question?. So, is discounting usually like that?', 'start': 2373.207, 'duration': 3.281}, {'end': 2381.089, 'text': 'Is it a assessment of risk or is there like a different way where you can assess how much risk you want to take??', 'start': 2376.548, 'duration': 4.541}], 'summary': 'Discounting values in the future, using gamma between 0 and 1, is a design choice depending on the problem.', 'duration': 27.103, 'max_score': 2353.986, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2353986.jpg'}], 'start': 1523.985, 'title': 'Markov decision processes', 'summary': "Explains mdp state transitions and rewards, with 'walk' resulting in a deterministic transition to state s plus 1 with a reward of -1, and 'tram' offering two potential outcomes - either reaching state 2s with a reward of -2 or remaining in state s with a reward of -2. it also discusses formalization of mdp for a city with 10 blocks, mdp evaluation and utility, definition of utility, the concept of value, and the role of discount factor in determining future values in mdp.", 'chapters': [{'end': 1572.406, 'start': 1523.985, 'title': 'Mdp state transition and rewards', 'summary': "Explains the state transitions and rewards in a markov decision process (mdp), with action 'walk' resulting in a deterministic transition to state s plus 1 with a reward of minus 1, and action 'tram' offering two potential outcomes - either reaching state 2s with a reward of minus 2 or remaining in state s with a reward of minus 2 due to a tram failure.", 'duration': 48.421, 'highlights': ["The action 'walk' results in a deterministic transition to state S plus 1 with a reward of minus 1.", "The action 'tram' offers two potential outcomes - reaching state 2S with a reward of minus 2 or remaining in state S with a reward of minus 2 due to a tram failure."]}, {'end': 1881.66, 'start': 1573.466, 'title': 'Defining markov decision process', 'summary': 'Discusses the formalization of a markov decision process (mdp) for a city with 10 blocks, defining actions, successor functions, probabilities, and rewards, and the concept of policies as solutions.', 'duration': 308.194, 'highlights': ['The chapter discusses the formalization of a Markov Decision Process (MDP) for a city with 10 blocks, defining actions, successor functions, probabilities, and rewards. The transcript involves defining a Markov Decision Process (MDP) for a city with 10 blocks, discussing actions from states, successor functions, and evaluating probabilities and rewards for different actions.', 'Introducing the concept of policies as solutions for MDPs, where a policy is defined as a function mapping each state to an action. The concept of policies as solutions for Markov Decision Processes is introduced, where a policy is described as a function mapping each state to a potential action, emphasizing the probabilistic nature of MDP solutions.', 'Discussing the evaluation of policies to determine their effectiveness in solving MDPs, highlighting the intention to assess and compare policies. The discussion involves the intention to evaluate and compare policies to determine their effectiveness in solving Markov Decision Processes, indicating a focus on assessing the quality of policies.']}, {'end': 2446.307, 'start': 1881.68, 'title': 'Mdp evaluation and utility', 'summary': 'Discusses the evaluation of policies in markov decision processes (mdp), including the definition of utility, the concept of value, and the role of discount factor in determining future values in mdp, with emphasis on the expected utility as a key metric, and the design choice of the discount factor.', 'duration': 564.627, 'highlights': ["The expected utility of a policy is the key metric for evaluating policies in MDP, computed as the sum of rewards over random paths generated by the policy. The expected utility of a policy is the sum of rewards over random paths, providing a measure of the policy's effectiveness in achieving desirable outcomes.", 'The value of a policy, as the expected utility, becomes a non-random number that can be computed for every state, enabling optimization and comparison of policies. The value of a policy, as the expected utility, provides a non-random measure that can be computed for each state, facilitating policy optimization and comparison.', 'The discount factor, denoted as gamma, influences the balance between present and future rewards, with a gamma close to 1 indicating a focus on long-term goals, while a gamma of 0 signifies a myopic view with no consideration for future rewards. The discount factor, gamma, determines the trade-off between present and future rewards, with a value close to 1 emphasizing future rewards and 0 indicating a myopic perspective with no regard for the future.', 'The choice of gamma is a design decision based on the problem context, with common values such as 0.9 used for typical problems, and the selection of gamma tailored to the specific problem statement. The choice of the discount factor, gamma, is a design decision influenced by the problem context, with common values like 0.9 used for typical problems, and the selection tailored to the specific problem statement.']}], 'duration': 922.322, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co1523985.jpg', 'highlights': ["The action 'walk' results in a deterministic transition to state S plus 1 with a reward of minus 1.", "The action 'tram' offers two potential outcomes - reaching state 2S with a reward of minus 2 or remaining in state S with a reward of minus 2 due to a tram failure.", 'The expected utility of a policy is the key metric for evaluating policies in MDP, computed as the sum of rewards over random paths generated by the policy.', 'The value of a policy, as the expected utility, becomes a non-random number that can be computed for every state, enabling optimization and comparison of policies.', 'The discount factor, denoted as gamma, influences the balance between present and future rewards, with a gamma close to 1 indicating a focus on long-term goals, while a gamma of 0 signifies a myopic view with no consideration for future rewards.']}, {'end': 3569.588, 'segs': [{'end': 2620.873, 'src': 'embed', 'start': 2594.875, 'weight': 1, 'content': [{'end': 2605.749, 'text': "And I've defined this new function, this q function q, Pi of sa, which is just the expected utility from the chance node, okay?", 'start': 2594.875, 'duration': 10.874}, {'end': 2607.75, 'text': "So so we've talked about value.", 'start': 2605.909, 'duration': 1.841}, {'end': 2610.23, 'text': 'value is expected utility from my actual states.', 'start': 2607.75, 'duration': 2.48}, {'end': 2615.132, 'text': "I'm gonna talk about Q values as expected utilities from the chance nodes.", 'start': 2610.91, 'duration': 4.222}, {'end': 2620.873, 'text': "So, after you have committed that, you, you, you have taken action A and and you're following policy Pi,", 'start': 2615.392, 'duration': 5.481}], 'summary': 'Defining q function q, pi of sa as expected utility from chance nodes', 'duration': 25.998, 'max_score': 2594.875, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2594875.jpg'}, {'end': 2948.114, 'src': 'embed', 'start': 2921.045, 'weight': 3, 'content': [{'end': 2924.607, 'text': 'That is just equal to sum of transition probabilities.', 'start': 2921.045, 'duration': 3.562}, {'end': 2934.839, 'text': "S A S prime over S primes times immediate reward that I'm going to get plus discounting v pi of S prime.", 'start': 2924.607, 'duration': 10.232}, {'end': 2936.921, 'text': 'Okay So this is kind of a recurrence that I have.', 'start': 2935.059, 'duration': 1.862}, {'end': 2939.324, 'text': 'I- I literally just combined these two and wrote it in green.', 'start': 2936.981, 'duration': 2.343}, {'end': 2941.747, 'text': "Okay? If you're not in an n state.", 'start': 2939.344, 'duration': 2.403}, {'end': 2944.53, 'text': "So if you're not in an n state, this is the recurrence I have.", 'start': 2941.967, 'duration': 2.563}, {'end': 2946.132, 'text': 'I have V pi here.', 'start': 2945.251, 'duration': 0.881}, {'end': 2948.114, 'text': 'I have V pi on this side too.', 'start': 2946.152, 'duration': 1.962}], 'summary': 'The recurrence equation involves transition probabilities, immediate rewards, and discounting in value iteration.', 'duration': 27.069, 'max_score': 2921.045, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2921045.jpg'}, {'end': 3024.576, 'src': 'embed', 'start': 2993.531, 'weight': 2, 'content': [{'end': 2994.652, 'text': 'I wanna do policy evaluation.', 'start': 2993.531, 'duration': 1.121}, {'end': 2998.754, 'text': "When you're doing policy evaluation, you gotta compute that V-pi for all states.", 'start': 2995.152, 'duration': 3.602}, {'end': 3001.956, 'text': "So let's start with V-pi of n.", 'start': 2999.294, 'duration': 2.662}, {'end': 3006.727, 'text': 'Well, that is equal to 0 because we know V-pi at n state is just equal to 0.', 'start': 3001.956, 'duration': 4.771}, {'end': 3010.87, 'text': "Now I want to know what's V pi of, of n, okay? State n.", 'start': 3006.727, 'duration': 4.143}, {'end': 3021.598, 'text': "What is that equal to? That's just equal to Q pi of n and state, right? V pi is just equal to Q pi of n and state.", 'start': 3010.87, 'duration': 10.728}, {'end': 3024.576, 'text': "So I'm gonna replace that.", 'start': 3022.955, 'duration': 1.621}], 'summary': 'Policy evaluation involves computing v-pi for all states, starting with v-pi of n equal to 0.', 'duration': 31.045, 'max_score': 2993.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2993531.jpg'}, {'end': 3093.708, 'src': 'heatmap', 'start': 3040.825, 'weight': 0.951, 'content': [{'end': 3044.207, 'text': 'Okay? So, so that is just that sum that we have there.', 'start': 3040.825, 'duration': 3.382}, {'end': 3047.649, 'text': 'All right? V pi of n is 0.', 'start': 3044.507, 'duration': 3.142}, {'end': 3048.95, 'text': 'So let me just put that 0 there.', 'start': 3047.649, 'duration': 1.301}, {'end': 3050.716, 'text': "I'm gonna put 0 there.", 'start': 3049.875, 'duration': 0.841}, {'end': 3056.801, 'text': 'I only have one state here too, right? So I just have this as a function of this one state n.', 'start': 3051.256, 'duration': 5.545}, {'end': 3058.102, 'text': 'So I have an- an equation.', 'start': 3056.801, 'duration': 1.301}, {'end': 3061.405, 'text': 'I can find the closed-form solution of V pi of n.', 'start': 3058.222, 'duration': 3.183}, {'end': 3068.703, 'text': "I'm just gonna move things around a little bit, and then I'll find out that V pi of n is just equal to 12.", 'start': 3061.405, 'duration': 7.298}, {'end': 3071.125, 'text': "So, so that's how you get that 12 that I've been talking about.", 'start': 3068.703, 'duration': 2.422}, {'end': 3077.089, 'text': 'So so you just found that that if you tell me the policy to follow a state, if that is the policy,', 'start': 3071.165, 'duration': 5.924}, {'end': 3080.251, 'text': 'then the value of that policy from state in is equal to 12..', 'start': 3077.089, 'duration': 3.162}, {'end': 3086.963, 'text': 'Do you think that somebody always chooses the same? So are you always choosing a state? Yeah.', 'start': 3080.251, 'duration': 6.712}, {'end': 3088.484, 'text': 'So, so the policy is a function of state.', 'start': 3087.024, 'duration': 1.46}, {'end': 3093.708, 'text': "I only have this one state that's interesting here, right? That, that's one state is n.", 'start': 3088.865, 'duration': 4.843}], 'summary': 'The closed-form solution of v pi of n is 12, based on the policy for a single state.', 'duration': 52.883, 'max_score': 3040.825, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3040825.jpg'}, {'end': 3141.755, 'src': 'embed', 'start': 3112.023, 'weight': 4, 'content': [{'end': 3114.784, 'text': 'So, so here, like in the previous example, it was kind of simple.', 'start': 3112.023, 'duration': 2.761}, {'end': 3116.824, 'text': 'I just solved the closed-form solution.', 'start': 3115.104, 'duration': 1.72}, {'end': 3123.006, 'text': 'But in, in reality, like you might have different states and then it might be a little bit more complicated.', 'start': 3117.144, 'duration': 5.862}, {'end': 3127.767, 'text': 'So we can actually have an iterative algorithm that allows us to find these V pies.', 'start': 3123.346, 'duration': 4.421}, {'end': 3135.649, 'text': 'So the way we do that is we start with the values, uh, for all states to be equal to 0.', 'start': 3128.247, 'duration': 7.402}, {'end': 3139.133, 'text': 'And, and this 0 that I put here is the first iteration.', 'start': 3135.649, 'duration': 3.484}, {'end': 3141.755, 'text': "So, so I'm going to count my iterations here.", 'start': 3139.193, 'duration': 2.562}], 'summary': 'Iterative algorithm finds v pies for different states, starting with values for all states equal to 0.', 'duration': 29.732, 'max_score': 3112.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3112023.jpg'}, {'end': 3271.119, 'src': 'embed', 'start': 3248.588, 'weight': 5, 'content': [{'end': 3256.354, 'text': "So, so if the difference is below some threshold, you can kind of call it, call it done and then say, well, I've, I've found the right values.", 'start': 3248.588, 'duration': 7.766}, {'end': 3263.916, 'text': 'And then in this case, we are basically looking at the difference between value at iteration t versus value at iteration t minus 1.', 'start': 3256.974, 'duration': 6.942}, {'end': 3270.258, 'text': 'And then we are taking the max of that over all possible states because I want the values to be close for all states.', 'start': 3263.916, 'duration': 6.342}, {'end': 3271.119, 'text': 'Okay Yes.', 'start': 3270.718, 'duration': 0.401}], 'summary': 'Iteratively compare values at different iterations to find right values.', 'duration': 22.531, 'max_score': 3248.588, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3248588.jpg'}, {'end': 3355.053, 'src': 'embed', 'start': 3333.376, 'weight': 6, 'content': [{'end': 3341.763, 'text': "Because I'm iterating over t times steps and I'm iterating over all my states and I'm summing over all s primes, right?", 'start': 3333.376, 'duration': 8.387}, {'end': 3344.104, 'text': "So- so because of that, that's a complexity I get.", 'start': 3341.783, 'duration': 2.321}, {'end': 3349.048, 'text': "And one thing to notice here is it doesn't depend on actions, right? It doesn't depend on the size of actions.", 'start': 3344.445, 'duration': 4.603}, {'end': 3353.492, 'text': "And the reason it doesn't depend on the size of actions is you have given me the policy.", 'start': 3349.789, 'duration': 3.703}, {'end': 3355.053, 'text': "You're telling me follow this policy.", 'start': 3353.512, 'duration': 1.541}], 'summary': "Iterating over t time steps and all states, complexity doesn't depend on actions.", 'duration': 21.677, 'max_score': 3333.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3333376.jpg'}, {'end': 3455.397, 'src': 'embed', 'start': 3425.838, 'weight': 8, 'content': [{'end': 3430.021, 'text': 'So just imagine the size of S.', 'start': 3425.838, 'duration': 4.183}, {'end': 3431.181, 'text': 'Okay All right.', 'start': 3430.021, 'duration': 1.16}, {'end': 3435.386, 'text': 'So summary so far, where are we? So we have talked about MDPs.', 'start': 3431.201, 'duration': 4.185}, {'end': 3440.467, 'text': 'These are graphs with states and chance nodes and transition probabilities and rewards.', 'start': 3435.426, 'duration': 5.041}, {'end': 3448.23, 'text': 'And we have talked about policy as the solution to an MDP, which is this function that takes a state and gives us an action.', 'start': 3441.228, 'duration': 7.002}, {'end': 3451.154, 'text': 'Okay? We talked about the value of a policy.', 'start': 3448.772, 'duration': 2.382}, {'end': 3455.397, 'text': 'So value of a policy is the expected utility of, of that policy.', 'start': 3451.294, 'duration': 4.103}], 'summary': 'Mdps are graphs with states, chance nodes, transition probabilities, and rewards. policy is a function that gives an action for a state, and the value of a policy is the expected utility.', 'duration': 29.559, 'max_score': 3425.838, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3425838.jpg'}, {'end': 3491.968, 'src': 'embed', 'start': 3468.86, 'weight': 0, 'content': [{'end': 3476.703, 'text': "And so far we've talked about this idea of policy evaluation, which is just an iterative algorithm to compute what's the value of a state,", 'start': 3468.86, 'duration': 7.843}, {'end': 3477.883, 'text': 'if you give me some policy.', 'start': 3476.703, 'duration': 1.18}, {'end': 3483.985, 'text': "Like how good is that policy? What's the value I'm gonna get at every state? Okay? All right.", 'start': 3478.003, 'duration': 5.982}, {'end': 3484.385, 'text': 'So, okay.', 'start': 3484.185, 'duration': 0.2}, {'end': 3487.726, 'text': 'That has been all assuming you give me the policy.', 'start': 3485.206, 'duration': 2.52}, {'end': 3491.968, 'text': 'Now, the thing I wanna spend a little bit of time on is, is figuring out how to find that policy.', 'start': 3487.787, 'duration': 4.181}], 'summary': 'Policy evaluation computes state value; now focus on finding the policy.', 'duration': 23.108, 'max_score': 3468.86, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3468860.jpg'}, {'end': 3579.255, 'src': 'embed', 'start': 3552.667, 'weight': 7, 'content': [{'end': 3558.009, 'text': 'like when we have transition probabilities that are not known or reward functions that are not known and how we go about learning them.', 'start': 3552.667, 'duration': 5.342}, {'end': 3559.93, 'text': 'And that- that would be the reinforcement learning lecture.', 'start': 3558.049, 'duration': 1.881}, {'end': 3561.53, 'text': 'So next lecture I might address some of this.', 'start': 3559.95, 'duration': 1.58}, {'end': 3563.223, 'text': 'Okay All right.', 'start': 3562.042, 'duration': 1.181}, {'end': 3564.784, 'text': "So let's talk about value iteration.", 'start': 3563.243, 'duration': 1.541}, {'end': 3566.625, 'text': 'So, so that was policy evaluation.', 'start': 3564.824, 'duration': 1.801}, {'end': 3569.588, 'text': 'So like that whole thing was policy evaluation.', 'start': 3566.665, 'duration': 2.923}, {'end': 3579.255, 'text': 'So now what I would like to do is I want to try to get the maximum expected utility and find the set of policies that gets me the maximum expected utility,', 'start': 3569.968, 'duration': 9.287}], 'summary': 'Lecture on reinforcement learning and value iteration for finding maximum expected utility.', 'duration': 26.588, 'max_score': 3552.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3552667.jpg'}], 'start': 2447.027, 'title': 'Mdp policy evaluation', 'summary': 'Explains value and policy evaluation in a markov decision process, focusing on expected utility, q values, v pi and q pi computation, iterative algorithm, and convergence criteria.', 'chapters': [{'end': 2709.273, 'start': 2447.027, 'title': 'Markov decision process and policy evaluation', 'summary': 'Explains the concept of value and policy evaluation in a markov decision process, emphasizing the expected utility and q values from states and chance nodes, with a focus on evaluating policies.', 'duration': 262.246, 'highlights': ['The concept of value and policy evaluation in a Markov decision process The chapter discusses the value of a policy as the expected utility received by following the policy from a given state, highlighting the importance of evaluating policies in a Markov decision process.', 'Expected utility and Q values from states and chance nodes It explains the expected utility as the value of a state and Q values as the expected utilities from chance nodes, providing insight into the evaluation of different states and chance nodes in the process.', 'Focus on evaluating policies The chapter emphasizes the significance of evaluating policies by calculating their value and expected utility, highlighting the importance of determining the effectiveness of different policies in a Markov decision process.']}, {'end': 2965.016, 'start': 2709.813, 'title': 'Value iteration in policy evaluation', 'summary': 'Discusses the process of evaluating a policy through the computation of v pi and q pi values, where v pi of s is equal to the sum of transition probabilities of reaching different states multiplied by the immediate rewards and discounted future values.', 'duration': 255.203, 'highlights': ['The computation of V pi and Q pi values is essential for evaluating the effectiveness of a given policy, as it involves assessing the expected utility and rewards based on transition probabilities and immediate rewards.', 'The recurrence formula for V pi of S, when not in an end state, involves the sum of transition probabilities of reaching different states multiplied by the immediate rewards and discounted future values, providing a basis for iterative computation or finding closed-form solutions.']}, {'end': 3291.987, 'start': 2965.036, 'title': 'Policy evaluation and iterative algorithm', 'summary': 'Discusses policy evaluation, computing v-pi for all states, an iterative algorithm for finding v-pi, and determining convergence through value difference, with an emphasis on finding the true value and determining algorithm stopping point.', 'duration': 326.951, 'highlights': ['The value of policy pi for state n is evaluated to be 12, illustrating the outcome of following a specific policy from a state.', 'An iterative algorithm is used to compute V-pi for all states, starting with values initialized to 0 and updating based on the previous timestamp, allowing for convergence to the true value.', 'Determining convergence in the iterative algorithm involves tracking the difference between values at different iterations, with the decision to stop based on the difference being below a certain threshold, considering the properties of the MDP.', 'The chapter also emphasizes the relevance of the policy to the state, the need for a closed-form solution for V-pi, and the heuristic for deciding the stopping point of the algorithm based on the value difference over all possible states.']}, {'end': 3569.588, 'start': 3292.007, 'title': 'Mdps and policy evaluation', 'summary': "Discusses markov decision processes, policy evaluation using an iterative algorithm to compute the value of a state for a given policy, with a complexity of o(t * s * s'), and the upcoming lecture on reinforcement learning addressing unknown settings.", 'duration': 277.581, 'highlights': ["The complexity for policy evaluation is O(t * s * s') as it iterates over t time steps, all states, and sums over all s prime.", 'The value of a policy is the expected utility, and policy evaluation computes the value of a state for a given policy to determine its effectiveness.', 'The upcoming lecture will address reinforcement learning and unknown settings with transition probabilities and reward functions that are not known.', 'The chapter discusses MDPs, the solution to an MDP as a policy, and the concept of value iteration for policy evaluation.']}], 'duration': 1122.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co2447027.jpg', 'highlights': ['The concept of value and policy evaluation in a Markov decision process, emphasizing the importance of evaluating policies', 'Expected utility and Q values from states and chance nodes, providing insight into the evaluation of different states and chance nodes', 'The computation of V pi and Q pi values for evaluating policy effectiveness based on transition probabilities and immediate rewards', 'The recurrence formula for V pi of S, involving the sum of transition probabilities and discounted future values', 'The iterative algorithm used to compute V-pi for all states, allowing for convergence to the true value', 'Determining convergence in the iterative algorithm based on the difference being below a certain threshold', "The complexity for policy evaluation is O(t * s * s') as it iterates over t time steps, all states, and sums over all s prime", 'The upcoming lecture will address reinforcement learning and unknown settings with transition probabilities and reward functions', 'The chapter discusses MDPs, the solution to an MDP as a policy, and the concept of value iteration for policy evaluation']}, {'end': 4407.499, 'segs': [{'end': 3591.384, 'src': 'embed', 'start': 3569.968, 'weight': 5, 'content': [{'end': 3579.255, 'text': 'So now what I would like to do is I want to try to get the maximum expected utility and find the set of policies that gets me the maximum expected utility,', 'start': 3569.968, 'duration': 9.287}, {'end': 3584.097, 'text': "okay?. So to do that, I'm gonna define this thing that's called an optimal value.", 'start': 3579.255, 'duration': 4.842}, {'end': 3591.384, 'text': 'So instead of value of a particular policy, I just want to be opt of S, which is the maximum value attained by any policy.', 'start': 3584.317, 'duration': 7.067}], 'summary': 'Seeking maximum expected utility by defining optimal value for policies.', 'duration': 21.416, 'max_score': 3569.968, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3569968.jpg'}, {'end': 3737.41, 'src': 'embed', 'start': 3710.882, 'weight': 1, 'content': [{'end': 3723.798, 'text': 'What is that equal to? What was Q pi? Q pi was just sum of transition probabilities times rewards, right? So, so what is Q opt? Yeah.', 'start': 3710.882, 'duration': 12.916}, {'end': 3728.422, 'text': "So, so it would just be basically this equation except for I'm gonna replace v-pi with v-opt.", 'start': 3723.818, 'duration': 4.604}, {'end': 3732.966, 'text': 'So, so from q-opt, I can end up anywhere like based on the transition probability.', 'start': 3728.802, 'duration': 4.164}, {'end': 3737.41, 'text': "So I'm gonna sum up over S primes and all possible places that I can end up at.", 'start': 3732.986, 'duration': 4.424}], 'summary': 'Exploring the q-opt equation in reinforcement learning with transition probabilities and potential outcomes.', 'duration': 26.528, 'max_score': 3710.882, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3710882.jpg'}, {'end': 3890.054, 'src': 'embed', 'start': 3863.414, 'weight': 0, 'content': [{'end': 3868.816, 'text': 'With that policy, I was able to compute v, I was able to compute q, I was able to write this recurrence.', 'start': 3863.414, 'duration': 5.402}, {'end': 3871.538, 'text': 'Then I had an iterative algorithm to do things.', 'start': 3869.297, 'duration': 2.241}, {'end': 3873.939, 'text': 'This is called value iteration.', 'start': 3871.998, 'duration': 1.941}, {'end': 3877.801, 'text': 'This is to find the right policy iteration.', 'start': 3875.099, 'duration': 2.702}, {'end': 3880.282, 'text': 'This is to find the policy.', 'start': 3879.061, 'duration': 1.221}, {'end': 3885.244, 'text': "How do I do that? Well, I have a value that's for the optimal, optimal value that I can get.", 'start': 3880.362, 'duration': 4.882}, {'end': 3890.054, 'text': "and it's going to be maximum over all possible actions I can take of the Q values.", 'start': 3885.731, 'duration': 4.323}], 'summary': 'Describing value iteration and policy iteration with q values.', 'duration': 26.64, 'max_score': 3863.414, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3863414.jpg'}, {'end': 4311.99, 'src': 'embed', 'start': 4266.885, 'weight': 2, 'content': [{'end': 4270.267, 'text': "If you pick different fail probabilities, you're gonna get different policies.", 'start': 4266.885, 'duration': 3.382}, {'end': 4275.289, 'text': 'So, for example, if you pick a fail probability that is large, then probably,', 'start': 4270.527, 'duration': 4.762}, {'end': 4280.432, 'text': 'like the policy is going to be just- just walk and never take the tram because the tram is failing all the time.', 'start': 4275.289, 'duration': 5.143}, {'end': 4288.064, 'text': "But if- if you decide to take, uh a pro- failed probability that's close to 0, then- then this is your optimal policy,", 'start': 4280.919, 'duration': 7.145}, {'end': 4289.345, 'text': 'which is close to the search problem.', 'start': 4288.064, 'duration': 1.281}, {'end': 4291.846, 'text': "So it's basically the solution to the search problem.", 'start': 4289.725, 'duration': 2.121}, {'end': 4293.868, 'text': 'So play around with this.', 'start': 4292.267, 'duration': 1.601}, {'end': 4294.849, 'text': 'The code is online.', 'start': 4293.988, 'duration': 0.861}, {'end': 4306.877, 'text': "This was just value iteration, um, value iteration, um, use- on this tram problem, okay? So I'm gonna skip this one too.", 'start': 4296.93, 'duration': 9.947}, {'end': 4308.578, 'text': 'All right.', 'start': 4308.218, 'duration': 0.36}, {'end': 4311.99, 'text': 'So, um, Yeah.', 'start': 4308.738, 'duration': 3.252}], 'summary': 'Different fail probabilities lead to different policies and optimal solutions.', 'duration': 45.105, 'max_score': 4266.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4266885.jpg'}, {'end': 4380.568, 'src': 'embed', 'start': 4352.804, 'weight': 3, 'content': [{'end': 4356.585, 'text': "And if you go even like higher iterations after that point, it's just fine-tuning.", 'start': 4352.804, 'duration': 3.781}, {'end': 4358.626, 'text': 'So the values are around 13 still.', 'start': 4356.745, 'duration': 1.881}, {'end': 4361.467, 'text': 'So you can play around with the volcano problem.', 'start': 4359.326, 'duration': 2.141}, {'end': 4372.172, 'text': 'Okay So when does this converge? So, um, If the discount factor is less than 1 or your MDP graph is acyclic, then this is going to converge.', 'start': 4362.247, 'duration': 9.925}, {'end': 4374.973, 'text': "So if MDP graph is acyclic, that's kind of obvious.", 'start': 4372.252, 'duration': 2.721}, {'end': 4378.015, 'text': "You're just doing dynamic programming over your full thing.", 'start': 4375.053, 'duration': 2.962}, {'end': 4380.568, 'text': "So, so that's going to, that's going to convert.", 'start': 4378.075, 'duration': 2.493}], 'summary': 'Mdp graph converges with discount factor less than 1 and acyclic graph.', 'duration': 27.764, 'max_score': 4352.804, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4352804.jpg'}], 'start': 3569.968, 'title': 'Markov decision processes and value iteration', 'summary': 'Discusses finding optimal policies in markov decision processes through computation of maximum expected utility and iterative algorithm value iteration, as well as the implementation of value iteration on a probabilistic tram problem showcasing convergence to the optimal policy and the impact of fail probability.', 'chapters': [{'end': 3977.627, 'start': 3569.968, 'title': 'Finding optimal policies in markov decision processes', 'summary': 'Discusses the concept of finding the optimal policy in markov decision processes through the computation of the maximum expected utility and the iterative algorithm value iteration, which involves the calculation of optimal values v-opt and q-opt.', 'duration': 407.659, 'highlights': ['The chapter introduces the concept of finding the optimal policy in Markov Decision Processes through the computation of the maximum expected utility. The chapter discusses the concept of finding the optimal policy in Markov Decision Processes through the computation of the maximum expected utility and the iterative algorithm value iteration.', 'The iterative algorithm value iteration involves the calculation of optimal values V-opt and Q-opt. The iterative algorithm value iteration involves the calculation of optimal values V-opt and Q-opt, which are used to determine the maximum value attained by any policy and the optimal policy.', 'The calculation of Q-opt involves summing up over S primes and all possible places that can be reached, considering the immediate reward and discounting the future value. The calculation of Q-opt involves summing up over S primes and all possible places that can be reached, considering the immediate reward and discounting the future value, leading to a time complexity of order of T times S times A times S prime.']}, {'end': 4407.499, 'start': 3977.687, 'title': 'Value iteration for tram problem', 'summary': 'Explains the implementation of value iteration algorithm on a probabilistic tram problem, showcasing how it converges to the optimal policy over multiple iterations and the impact of fail probability on the policy, with the convergence criteria stated.', 'duration': 429.812, 'highlights': ['The value iteration algorithm is implemented on a probabilistic tram problem to obtain the optimal policy. The chapter discusses coding the value iteration algorithm for the tram problem to derive the best optimal policy.', 'The impact of fail probability on the optimal policy is demonstrated, showing how different fail probabilities result in different optimal policies. The chapter highlights how varying fail probabilities lead to distinct optimal policies, influencing the decision to take the tram or walk in the problem scenario.', 'The convergence of the value iteration algorithm is illustrated, indicating that convergence is achieved when the discount factor is less than 1 or the MDP graph is acyclic. The convergence criteria for the value iteration algorithm are explained, stating that convergence occurs when the discount factor is less than 1 or when the MDP graph is acyclic.']}], 'duration': 837.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co3569968.jpg', 'highlights': ['The iterative algorithm value iteration involves the calculation of optimal values V-opt and Q-opt, which are used to determine the maximum value attained by any policy and the optimal policy.', 'The calculation of Q-opt involves summing up over S primes and all possible places that can be reached, considering the immediate reward and discounting the future value, leading to a time complexity of order of T times S times A times S prime.', 'The chapter discusses coding the value iteration algorithm for the tram problem to derive the best optimal policy.', 'The convergence criteria for the value iteration algorithm are explained, stating that convergence occurs when the discount factor is less than 1 or when the MDP graph is acyclic.', 'The impact of fail probability on the optimal policy is demonstrated, showing how different fail probabilities result in different optimal policies.', 'The chapter introduces the concept of finding the optimal policy in Markov Decision Processes through the computation of the maximum expected utility.']}, {'end': 4979.824, 'segs': [{'end': 4474.609, 'src': 'embed', 'start': 4407.539, 'weight': 0, 'content': [{'end': 4413.203, 'text': "So- so just a good rule of thumb is pick a Gamma that's less than 1, then- then you kind of get this convergence property.", 'start': 4407.539, 'duration': 5.664}, {'end': 4416.365, 'text': 'Okay? All right.', 'start': 4413.223, 'duration': 3.142}, {'end': 4419.607, 'text': 'So yeah, summary so far is we have MDPs.', 'start': 4416.986, 'duration': 2.621}, {'end': 4425.849, 'text': "Now we've talked about, um, finding policies, uh, rather than path.", 'start': 4420.347, 'duration': 5.502}, {'end': 4430.151, 'text': 'Policy evaluation is just a way of computing like how good a policy is.', 'start': 4426.289, 'duration': 3.862}, {'end': 4436.653, 'text': "And the reason I talked about policy evaluation is there's this other algorithm called policy iteration, which uses policy evaluation.", 'start': 4430.611, 'duration': 6.042}, {'end': 4438.394, 'text': "And we didn't discuss that in the class.", 'start': 4436.994, 'duration': 1.4}, {'end': 4444.236, 'text': "Uh, but it's kind of like equi- not equivalent, but you could use it in a similar manner as value iteration.", 'start': 4438.854, 'duration': 5.382}, {'end': 4445.377, 'text': 'It has its pros and cons.', 'start': 4444.276, 'duration': 1.101}, {'end': 4448.378, 'text': 'Uh, so policy evaluation u- is used in those settings.', 'start': 4445.917, 'duration': 2.461}, {'end': 4449.759, 'text': 'Do not leave, please.', 'start': 4448.779, 'duration': 0.98}, {'end': 4451.34, 'text': 'We have more stuff to cover.', 'start': 4449.779, 'duration': 1.561}, {'end': 4459.444, 'text': 'And then, uh, we have value iteration, uh, which, uh, computes this optimal value, which is the maximum expected utility.', 'start': 4452.34, 'duration': 7.104}, {'end': 4464.447, 'text': "Okay? And next time, we are going to talk about reinforcement learning, and that's gonna be awesome.", 'start': 4459.784, 'duration': 4.663}, {'end': 4466.588, 'text': "So we'll talk about unknown rewards.", 'start': 4464.587, 'duration': 2.001}, {'end': 4468.339, 'text': 'All right.', 'start': 4467.958, 'duration': 0.381}, {'end': 4474.609, 'text': 'So that was MVPs doing inference and, and, uh, kind of defining them.', 'start': 4469.661, 'duration': 4.948}], 'summary': "Gamma less than 1 yields convergence. policy evaluation for computing policy's quality. next: reinforcement learning.", 'duration': 67.07, 'max_score': 4407.539, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4407539.jpg'}, {'end': 4617.538, 'src': 'embed', 'start': 4588.582, 'weight': 3, 'content': [{'end': 4593.168, 'text': 'Another example is when you remove constraints, you have an easier search problem.', 'start': 4588.582, 'duration': 4.586}, {'end': 4596.592, 'text': "So you don't have closed form solutions, but you have an easier search problem.", 'start': 4593.288, 'duration': 3.304}, {'end': 4600.958, 'text': 'So you might have a really difficult search problem with a bunch of constraints that are hard to do.', 'start': 4596.632, 'duration': 4.326}, {'end': 4602.319, 'text': 'Remove the constraints.', 'start': 4601.458, 'duration': 0.861}, {'end': 4605.744, 'text': 'So when you remove the constraints, you have a relaxed problem.', 'start': 4602.74, 'duration': 3.004}, {'end': 4610.089, 'text': "which is just the original problem without the constraint, that's a search problem.", 'start': 4606.144, 'duration': 3.945}, {'end': 4617.538, 'text': 'You can solve that search problem using uniform cost search or dynamic programming, and- and solving that allows you to find the heuristic.', 'start': 4610.169, 'duration': 7.369}], 'summary': 'Removing constraints simplifies search problem, enabling easier solution with uniform cost search or dynamic programming.', 'duration': 28.956, 'max_score': 4588.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4588582.jpg'}, {'end': 4976.098, 'src': 'heatmap', 'start': 4936.742, 'weight': 1, 'content': [{'end': 4939.083, 'text': 'In the example that we have shown in a tram example.', 'start': 4936.742, 'duration': 2.341}, {'end': 4944.425, 'text': "I don't think we are able to get two paths that look like this because of the nature of the example.", 'start': 4939.083, 'duration': 5.342}, {'end': 4948.889, 'text': 'So So, in general, things to remember from structure perceptor on this.', 'start': 4944.746, 'duration': 4.143}, {'end': 4949.849, 'text': 'it does converge.', 'start': 4948.889, 'duration': 0.96}, {'end': 4957.275, 'text': "it does converge in a way that it can recover the true Ys, but it doesn't necessarily get the exact Ws as we saw last time, right?", 'start': 4949.849, 'duration': 7.426}, {'end': 4962.68, 'text': 'Like you might get 2 and 4, you might get 4 and 8, like as long as you have the same relationships, that is, that is enough.', 'start': 4957.295, 'duration': 5.385}, {'end': 4966.343, 'text': 'But, but you are going to be able to get the actual Ys and it does converge.', 'start': 4962.76, 'duration': 3.583}, {'end': 4971.25, 'text': 'So with that, um, project conversation is going to be next time.', 'start': 4966.923, 'duration': 4.327}, {'end': 4974.075, 'text': 'Do take a look at- do take a look at the website.', 'start': 4971.571, 'duration': 2.504}, {'end': 4976.098, 'text': 'So all the information on the project is on the website.', 'start': 4974.095, 'duration': 2.003}], 'summary': 'Structure perceptron converges to recover true ys but not exact ws, allowing for same relationships. project details on website.', 'duration': 39.356, 'max_score': 4936.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4936742.jpg'}], 'start': 4407.539, 'title': 'Mdps, policies, and search', 'summary': 'Covers mdps, policy evaluation, policy iteration, and value iteration for finding policies and computing optimal values, as well as relaxation techniques in search problems, illustrating trade-offs between efficiency and tightness in heuristic estimation.', 'chapters': [{'end': 4474.609, 'start': 4407.539, 'title': 'Mdps, policies, and value iteration', 'summary': 'Covers markov decision processes, policy evaluation, policy iteration, and value iteration, with a focus on finding policies and computing optimal values.', 'duration': 67.07, 'highlights': ['The chapter introduces Markov Decision Processes (MDPs) and discusses the importance of selecting a Gamma value less than 1 for convergence.', 'Policy evaluation is explained as a method for computing the effectiveness of a policy, which is crucial for algorithms like policy iteration and value iteration.', 'The class briefly mentions policy iteration and its relationship with value iteration, highlighting its pros and cons and its use in certain settings.', 'Value iteration is described as a method for computing the optimal value, which represents the maximum expected utility in a given scenario.', 'The upcoming topic of reinforcement learning, focusing on unknown rewards, is teased as an exciting area for discussion in the next session.']}, {'end': 4979.824, 'start': 4474.93, 'title': 'Relaxation techniques for search problems', 'summary': 'Discusses the concept of relaxation in search problems, including examples of removing constraints to solve easier problems and finding consistent heuristics, emphasizing the trade-off between efficiency and tightness in heuristic estimation.', 'duration': 504.894, 'highlights': ['Relaxation technique involves removing constraints to solve easier problems, such as removing walls to compute Manhattan distance as a heuristic, or solving relaxed search problems using uniform cost search or dynamic programming to find heuristics. ', 'The future cost of a relaxed problem can be used as a heuristic, which is also consistent, providing a trade-off between the efficiency of finding heuristics and the tightness of the estimates. ', 'Structured perceptron converges and can recover the true Ys, but may not necessarily get the exact Ws, with a reminder about the upcoming project conversation. ']}], 'duration': 572.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/9g32v7bK3Co/pics/9g32v7bK3Co4407539.jpg', 'highlights': ['The chapter introduces Markov Decision Processes (MDPs) and discusses the importance of selecting a Gamma value less than 1 for convergence.', 'Policy evaluation is explained as a method for computing the effectiveness of a policy, crucial for algorithms like policy iteration and value iteration.', 'Value iteration is described as a method for computing the optimal value, representing the maximum expected utility in a given scenario.', 'Relaxation technique involves removing constraints to solve easier problems, providing a trade-off between the efficiency of finding heuristics and the tightness of the estimates.']}], 'highlights': ['Series covers mdps, decision making under uncertainty, expected utility, mdp formalism, value iteration, and policy evaluation, providing insights into their applications in transportation and robotics.', 'The iterative algorithm value iteration involves the calculation of optimal values V-opt and Q-opt, which are used to determine the maximum value attained by any policy and the optimal policy.', 'The chapter introduces decision making under uncertainty in resource allocation, using examples from agriculture and a volcano crossing scenario.', 'The transition from successor function and cost function to transition probabilities and rewards is a major difference in MDP formalism, as it shifts from deterministic to probabilistic nature of the process.', 'The concept of value and policy evaluation in a Markov decision process, emphasizing the importance of evaluating policies', "The action 'walk' results in a deterministic transition to state S plus 1 with a reward of minus 1.", 'The expected utility of a policy is the key metric for evaluating policies in MDP, computed as the sum of rewards over random paths generated by the policy.', 'The discount factor, denoted as gamma, influences the balance between present and future rewards, with a gamma close to 1 indicating a focus on long-term goals, while a gamma of 0 signifies a myopic view with no consideration for future rewards.', 'The chapter discusses formalizing a transportation MDP to model a tram problem with two possible actions, walking and taking the tram, and adding a probability for the tram to fail with 0.5 chance.', 'The chapter emphasizes the need to consider randomness in decision-making, applicable in real-life scenarios like resource allocation and robotics.', 'The probabilities of transitioning to different states need to add up to 1, ensuring that all possible outcomes are accounted for.']}