title

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2 - Given a Model of the World

description

For more information about Stanfordâ€™s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai
Professor Emma Brunskill, Stanford University
https://stanford.io/3eJW8yT
Professor Emma Brunskill
Assistant Professor, Computer Science
Stanford AI for Human Impact Lab
Stanford Artificial Intelligence Lab
Statistical Machine Learning Group
To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs234/index.html
0:00 Introduction
2:55 Full Observability: Markov Decision Process (MDP)
3:55 Recall: Markov Property
4:50 Markov Processor Markov Chain
5:53 Example: Mars Rover Markov Chain Transition Matrix, P
12:06 Example: Mars Rover Markov Chain Episodes
13:05 Markov Reward Process (MRP)
14:37 Return & Value Function
16:32 Discount Factor
18:23 Example: Mars Rover MRP
23:19 Matrix Form of Bellman Equation for MRP
26:52 Iterative Algorithm for Computing Value of a MRP
33:29 MDP Policy Evaluation, Iterative Algorithm
34:44 Policy Evaluation: Example & Check Your Understanding
36:39 Practice: MDP 1 Iteration of Policy Evaluation, Mars Rover Example
50:48 MDP Policy Iteration (PI)
55:44 Delving Deeper into Policy Improvement Step

detail

{'title': 'Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2 - Given a Model of the World', 'heatmap': [{'end': 1238.883, 'start': 1191.226, 'weight': 0.85}, {'end': 2033.395, 'start': 1896.677, 'weight': 0.867}, {'end': 3183.806, 'start': 3090.08, 'weight': 1}, {'end': 3977.328, 'start': 3927.216, 'weight': 0.852}], 'summary': 'Lecture series on reinforcement learning covers key concepts such as value, model, and policy, markov processes, markov reward processes, markov decision processes, policy iteration, q function computation, and policy improvement, providing practical and mathematical implications for each topic.', 'chapters': [{'end': 286.339, 'segs': [{'end': 80.007, 'src': 'embed', 'start': 43.107, 'weight': 0, 'content': [{'end': 49.234, 'text': 'And it might be a good policy or a bad policy, and the way we evaluate that is in terms of its expected discounted sum of rewards.', 'start': 43.107, 'duration': 6.127}, {'end': 52.177, 'text': 'Does anybody remember what a model was? Yeah.', 'start': 49.895, 'duration': 2.282}, {'end': 57.143, 'text': "A model is like a representation of the world and how that changes in response to agent's action.", 'start': 52.498, 'duration': 4.645}, {'end': 58.561, 'text': 'Yes Right.', 'start': 58.021, 'duration': 0.54}, {'end': 66.003, 'text': 'So normally we think of a model of incorporating either reward model or a decision uh or or a dynamics model which specifies,', 'start': 58.601, 'duration': 7.402}, {'end': 70.564, 'text': 'in response to the current state and uh an action, how the world might change.', 'start': 66.003, 'duration': 4.561}, {'end': 73.005, 'text': 'It could be a stochastic model or a deterministic model.', 'start': 70.584, 'duration': 2.421}, {'end': 80.007, 'text': 'Um, and the reward model specifies what is the expected reward, um, that the agent receives from taking a state in a particular action.', 'start': 73.585, 'duration': 6.422}], 'summary': 'Evaluating policies based on expected rewards and models of world response.', 'duration': 36.9, 'max_score': 43.107, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is43107.jpg'}, {'end': 132.829, 'src': 'embed', 'start': 100.626, 'weight': 1, 'content': [{'end': 106.831, 'text': "we're just gonna talk about the problem of figuring out what is the right thing to do when your actions may have delayed consequences.", 'start': 100.626, 'duration': 6.205}, {'end': 112.181, 'text': 'which means that you may have to sacrifice immediate reward in order to maximize long-term reward.', 'start': 107.432, 'duration': 4.749}, {'end': 123.226, 'text': "So as we just stated, um, the model generally we're gonna think about are statistical or mathematical models of the dynamics and the reward function.", 'start': 115.464, 'duration': 7.762}, {'end': 132.829, 'text': 'Um. a policy is a function that maps the state and states to actions, and the value function is the expected discounted sum of rewards,', 'start': 123.767, 'duration': 9.062}], 'summary': 'Balancing immediate rewards with long-term consequences in decision-making using statistical and mathematical models.', 'duration': 32.203, 'max_score': 100.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is100626.jpg'}, {'end': 214.305, 'src': 'embed', 'start': 185.029, 'weight': 4, 'content': [{'end': 188.393, 'text': 'So the agent gets to take actions typically denoted by A.', 'start': 185.029, 'duration': 3.364}, {'end': 190.194, 'text': 'those affect the state of the world in some way.', 'start': 188.393, 'duration': 1.801}, {'end': 193.958, 'text': 'um, and then the agent receives back a state and a reward.', 'start': 190.194, 'duration': 3.764}, {'end': 199.123, 'text': 'So, last time we talked about the fact that this could in fact be an observation instead of a state.', 'start': 194.498, 'duration': 4.625}, {'end': 206.258, 'text': "but that when we think about the world being Markov, we're gonna think of an agent just focusing on the current state.", 'start': 200.814, 'duration': 5.444}, {'end': 208.68, 'text': 'Um, so the most recent observation,', 'start': 206.839, 'duration': 1.841}, {'end': 214.305, 'text': "like you know whether or not the robot's laser range finder is saying that there are walls to the left or right of it,", 'start': 208.68, 'duration': 5.625}], 'summary': "An agent takes actions denoted by a, affecting the world's state, and receives a state and a reward in return.", 'duration': 29.276, 'max_score': 185.029, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is185029.jpg'}, {'end': 256.024, 'src': 'embed', 'start': 226.908, 'weight': 3, 'content': [{'end': 230.21, 'text': "Um, but most of the time today we'll be thinking about sort of immediate sensors.", 'start': 226.908, 'duration': 3.302}, {'end': 232.051, 'text': "If it's not clear, feel free to reach out.", 'start': 230.53, 'duration': 1.521}, {'end': 236.814, 'text': 'So what did the Markov process mean?', 'start': 235.113, 'duration': 1.701}, {'end': 243.579, 'text': 'The Markov process is to say that the state that the agent is using to make their decisions is a sufficient statistic of the history.', 'start': 237.015, 'duration': 6.564}, {'end': 256.024, 'text': "Which means that, in order to predict the future distribution of states on the next time step here we're using t to denote time step that,", 'start': 244.499, 'duration': 11.525}], 'summary': "Today's focus is on immediate sensors and markov process for predicting future state distribution.", 'duration': 29.116, 'max_score': 226.908, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is226908.jpg'}], 'start': 5.216, 'title': 'Reinforcement learning concepts', 'summary': 'Introduces value, model, and policy in reinforcement learning, and discusses the evaluation of policies in terms of expected discounted sum of rewards. it also covers the markov decision process, where an agent aims to maximize its expected discounted sum of rewards.', 'chapters': [{'end': 149.487, 'start': 5.216, 'title': 'Reinforcement learning overview', 'summary': 'Introduces the concepts of value, model, and policy in reinforcement learning, and discusses the evaluation of policies in terms of the expected discounted sum of rewards. it also highlights the importance of models as representations of the world and the problem of making decisions when actions may have delayed consequences.', 'duration': 144.271, 'highlights': ['Reinforcement learning involves the concepts of a model, a value, and a policy, and the evaluation of policies is based on the expected discounted sum of rewards.', 'A model is a representation of the world and can incorporate a reward model or a dynamics model, specifying the expected reward from taking a certain action in a state, and how the world might change in response to an action.', 'The planning problem in reinforcement learning involves making decisions when actions may have delayed consequences, requiring the sacrifice of immediate reward to maximize long-term reward.']}, {'end': 286.339, 'start': 149.487, 'title': 'Markov decision process', 'summary': 'Discusses the markov decision process, where an agent interacts with the world by taking actions, affecting the state, and receiving rewards, aiming to maximize its expected discounted sum of rewards.', 'duration': 136.852, 'highlights': ["The Markov process states that the agent's current state is a sufficient statistic of the history, enabling prediction of future state distributions given the current state and action, independent of the past.", "The agent's actions affect the state of the world, and the agent receives back a state and a reward, with the goal of maximizing its expected discounted sum of rewards."]}], 'duration': 281.123, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is5216.jpg', 'highlights': ['Reinforcement learning involves the concepts of a model, a value, and a policy, and the evaluation of policies is based on the expected discounted sum of rewards.', 'The planning problem in reinforcement learning involves making decisions when actions may have delayed consequences, requiring the sacrifice of immediate reward to maximize long-term reward.', 'A model is a representation of the world and can incorporate a reward model or a dynamics model, specifying the expected reward from taking a certain action in a state, and how the world might change in response to an action.', "The Markov process states that the agent's current state is a sufficient statistic of the history, enabling prediction of future state distributions given the current state and action, independent of the past.", "The agent's actions affect the state of the world, and the agent receives back a state and a reward, with the goal of maximizing its expected discounted sum of rewards."]}, {'end': 1087.803, 'segs': [{'end': 358.895, 'src': 'embed', 'start': 326.685, 'weight': 0, 'content': [{'end': 332.608, 'text': 'So, formally, the definition of a Markov process is that you have um a finite or potentially infinite set of states,', 'start': 326.685, 'duration': 5.923}, {'end': 337.27, 'text': 'and you have a dynamics model which specifies the probability of the next state given the previous state.', 'start': 332.608, 'duration': 4.662}, {'end': 339.891, 'text': "There's no rewards, there's no actions yet.", 'start': 337.97, 'duration': 1.921}, {'end': 343.593, 'text': 'Um, and if you have a finite set of states, you can just write this down as a matrix.', 'start': 340.351, 'duration': 3.242}, {'end': 350.367, 'text': "the transition matrix that says you're starting in some state, what's the probability distribution over next states that you could reach?", 'start': 344.622, 'duration': 5.745}, {'end': 358.895, 'text': 'So if we go back to the Mars Rover example that we talked about last time, um, in this little Mars Rover example,', 'start': 353.07, 'duration': 5.825}], 'summary': 'Markov process has finite/infinite states, with transition matrix specifying state probabilities.', 'duration': 32.21, 'max_score': 326.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is326685.jpg'}, {'end': 693.744, 'src': 'embed', 'start': 665.835, 'weight': 3, 'content': [{'end': 670.099, 'text': 'So this is just specifying that the transition model over how the world works over time.', 'start': 665.835, 'duration': 4.264}, {'end': 671.24, 'text': "Um, and it's just.", 'start': 670.359, 'duration': 0.881}, {'end': 674.103, 'text': "uh, I've written it in matrix notation there to be compact.", 'start': 671.24, 'duration': 2.863}, {'end': 680.268, 'text': "but if it's easier to think about it, it's fine to just think about it in terms of these probability of next states given the previous state.", 'start': 674.103, 'duration': 6.165}, {'end': 686.714, 'text': 'So you can just enumerate those or you can write it in a- in a matrix form if the- if the number of states happens to be finite.', 'start': 681.089, 'duration': 5.625}, {'end': 689.3, 'text': 'So what would this look like?', 'start': 688.379, 'duration': 0.921}, {'end': 693.744, 'text': 'if you wanted to think of what might happen to the agent over time in this case, or what the process might look like?', 'start': 689.3, 'duration': 4.444}], 'summary': 'Specifying transition model using matrix notation or probabilities for finite state systems.', 'duration': 27.909, 'max_score': 665.835, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is665835.jpg'}, {'end': 816.737, 'src': 'embed', 'start': 789.697, 'weight': 2, 'content': [{'end': 795.039, 'text': "And so now, what is a Markov reward process? Um, again, we don't have actions yet, just like before.", 'start': 789.697, 'duration': 5.342}, {'end': 797.68, 'text': 'But now, we also have a reward function.', 'start': 795.839, 'duration': 1.841}, {'end': 800.761, 'text': 'Um, so we still have a dynamics model like before.', 'start': 798.08, 'duration': 2.681}, {'end': 807.404, 'text': "And now we have a reward function that says if you're in a particular state, what is the expected reward you get from being in that state?", 'start': 801.281, 'duration': 6.123}, {'end': 816.737, 'text': 'We can also now have a discount factor which allows us to trade off between- or allows us to think about how much we weight immediate rewards versus future rewards.', 'start': 808.55, 'duration': 8.187}], 'summary': 'Markov reward process models state rewards & dynamics with discount factor.', 'duration': 27.04, 'max_score': 789.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is789697.jpg'}, {'end': 882.923, 'src': 'embed', 'start': 856.766, 'weight': 1, 'content': [{'end': 860.93, 'text': 'um, I mentioned last time that rewards for the Markov decision process can either be a function of the state,', 'start': 856.766, 'duration': 4.164}, {'end': 862.492, 'text': 'the state in action or state action next state.', 'start': 860.93, 'duration': 1.562}, {'end': 865.736, 'text': "Right now, we're still in Markov reward processes, so there's no action.", 'start': 862.913, 'duration': 2.823}, {'end': 872.684, 'text': 'So in this case, um, the ways you could define rewards would either be over the immediate state or state at next state.', 'start': 866.817, 'duration': 5.867}, {'end': 882.923, 'text': 'So once we start to think about there being rewards, we can start to think about there being returns and expected returns.', 'start': 877.398, 'duration': 5.525}], 'summary': 'Markov decision process rewards can be based on state, action, or state transition. no action in markov reward processes. rewards defined for immediate state or state at next state. leads to consideration of returns and expected returns.', 'duration': 26.157, 'max_score': 856.766, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is856766.jpg'}, {'end': 1093.565, 'src': 'embed', 'start': 1062.675, 'weight': 4, 'content': [{'end': 1065.457, 'text': 'Does anyone have any questions about discount factors? Yeah.', 'start': 1062.675, 'duration': 2.782}, {'end': 1080.361, 'text': "My question is So the great question was you know the- what we're defining here is that, uh, using a gamma that, uh,", 'start': 1065.477, 'duration': 14.884}, {'end': 1083.542, 'text': 'progresses through this exponential geometric fashion, is that necessary?', 'start': 1080.361, 'duration': 3.181}, {'end': 1087.803, 'text': "Um, it's one nice choice that ends up having very nice mathematical properties.", 'start': 1084.042, 'duration': 3.761}, {'end': 1093.565, 'text': "Um, uh, there- one could try using other approaches, but it's certainly the most common one.", 'start': 1088.143, 'duration': 5.422}], 'summary': 'Discount factors are commonly defined using a gamma progressing exponentially, offering nice mathematical properties.', 'duration': 30.89, 'max_score': 1062.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1062675.jpg'}], 'start': 290.131, 'title': 'Markov processes and rewards', 'summary': 'Discusses the concept of markov processes and introduces markov reward processes, including transition dynamics, transition matrix, returns, and discount factors, with practical and mathematical implications.', 'chapters': [{'end': 564.636, 'start': 290.131, 'title': 'Understanding markov processes', 'summary': 'Discusses the concept of markov processes, defining them as stochastic processes evolving over time with transition dynamics and transition matrix, and provides an example of a mars rover scenario.', 'duration': 274.505, 'highlights': ['Markov process defined with transition dynamics and matrix Formally, a Markov process is defined with a dynamics model specifying the probability of the next state given the previous state and can be represented as a transition matrix.', "Example of Mars Rover scenario An example of a Mars Rover scenario is used to illustrate the concept of transition dynamics in a Markov process, where the transition probabilities dictate the rover's movement based on different states.", 'Explanation of transition probabilities in Markov chains The transition probabilities in Markov chains describe the likelihood of staying in the same state or transitioning to the next state, providing insight into the dynamics of the process.']}, {'end': 1087.803, 'start': 564.956, 'title': 'Understanding markov processes and rewards', 'summary': 'Introduces the concept of markov processes and rewards, discussing the transition model, markov reward processes, returns, discount factors, and their mathematical and practical implications.', 'duration': 522.847, 'highlights': ['The chapter introduces the concept of Markov processes and rewards The discussion primarily revolves around the concept of Markov processes and rewards, setting the foundation for further exploration.', 'The transition model is discussed in detail, emphasizing the impact of vector notation and its role in determining the next state based on the current state The transition model and its notation are explained, highlighting the process of determining the next state based on the current state, with specific examples and probabilities.', 'Markov reward processes and their components, including the reward function and discount factor, are explained, with an emphasis on the expected return and value function The explanation delves into Markov reward processes, detailing the reward function, discount factor, expected return, and the value function, providing a comprehensive understanding of their components and implications.', 'The concept of returns and their relationship with the horizon, as well as the distinction between deterministic and stochastic processes, is explored The discussion covers the relationship between returns and the horizon, along with the distinction between deterministic and stochastic processes, offering insights into their fundamental differences and implications.', 'The significance and practical implications of discount factors are detailed, including their mathematical properties and the influence on decision-making A comprehensive explanation of discount factors is provided, highlighting their significance, mathematical properties, and practical implications on decision-making, with examples and considerations for different values of Gamma.']}], 'duration': 797.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is290131.jpg', 'highlights': ['Markov process defined with transition dynamics and matrix Formally, a Markov process is defined with a dynamics model specifying the probability of the next state given the previous state and can be represented as a transition matrix.', 'The chapter introduces the concept of Markov processes and rewards The discussion primarily revolves around the concept of Markov processes and rewards, setting the foundation for further exploration.', 'Markov reward processes and their components, including the reward function and discount factor, are explained, with an emphasis on the expected return and value function The explanation delves into Markov reward processes, detailing the reward function, discount factor, expected return, and the value function, providing a comprehensive understanding of their components and implications.', 'The transition model is discussed in detail, emphasizing the impact of vector notation and its role in determining the next state based on the current state The transition model and its notation are explained, highlighting the process of determining the next state based on the current state, with specific examples and probabilities.', 'The significance and practical implications of discount factors are detailed, including their mathematical properties and the influence on decision-making A comprehensive explanation of discount factors is provided, highlighting their significance, mathematical properties, and practical implications on decision-making, with examples and considerations for different values of Gamma.']}, {'end': 1692.533, 'segs': [{'end': 1182.299, 'src': 'embed', 'start': 1155.481, 'weight': 1, 'content': [{'end': 1159.444, 'text': 'Um, and then on time step S7, we get a reward of 10.', 'start': 1155.481, 'duration': 3.963}, {'end': 1164.248, 'text': 'But that has to be weighed down by the discount factor, which here is 1 half.', 'start': 1159.444, 'duration': 4.804}, {'end': 1169.291, 'text': "So it's 1 half to the power of 3.", 'start': 1164.868, 'duration': 4.423}, {'end': 1174.232, 'text': 'And so the sample return for this particular episode is just 1.25.', 'start': 1169.291, 'duration': 4.941}, {'end': 1177.675, 'text': 'And of course, we could define this for any particular, um, episode.', 'start': 1174.232, 'duration': 3.443}, {'end': 1182.299, 'text': "And these episodes generally might go through different states, even if they're starting in the same initial state,", 'start': 1177.755, 'duration': 4.544}], 'summary': 'On time step s7, a reward of 10 is weighed by discount factor 0.5, yielding a sample return of 1.25.', 'duration': 26.818, 'max_score': 1155.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1155481.jpg'}, {'end': 1238.883, 'src': 'heatmap', 'start': 1191.226, 'weight': 0.85, 'content': [{'end': 1194.548, 'text': 'And in other cases, um, uh, it might go all the way to the left.', 'start': 1191.226, 'duration': 3.322}, {'end': 1203.796, 'text': 'So if we then think about what the expected value function would be, it would involve averaging over a lot of these.', 'start': 1196.95, 'duration': 6.846}, {'end': 1211.478, 'text': 'And as we average over all of these, um, then we can start to get different rewards for different time steps.', 'start': 1205.293, 'duration': 6.185}, {'end': 1220.145, 'text': 'So how would we compute this?', 'start': 1218.824, 'duration': 1.321}, {'end': 1226.951, 'text': 'Um now, one thing you could do, which is sort of motivated by what I was just showing before, is that you could estimate it by simulation.', 'start': 1220.866, 'duration': 6.085}, {'end': 1238.883, 'text': 'So you could, um uh, just take for say an initial starting state distribution um, which could be just a single starting state or many starting states,', 'start': 1228.352, 'duration': 10.531}], 'summary': 'Discussion on computing expected value function through simulation and averaging over time steps.', 'duration': 47.657, 'max_score': 1191.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1191226.jpg'}, {'end': 1316.494, 'src': 'embed', 'start': 1287.15, 'weight': 3, 'content': [{'end': 1292.352, 'text': 'if you want to figure out what the value is of your Markov reward process, um,', 'start': 1287.15, 'duration': 5.202}, {'end': 1294.994, 'text': 'you could just do simulations and that would give you an estimate of the value.', 'start': 1292.352, 'duration': 2.642}, {'end': 1299.716, 'text': 'The nice thing about doing this is this requires no assumption of the Markov structure.', 'start': 1295.874, 'duration': 3.842}, {'end': 1303.399, 'text': "It's not actually using the fact that it's a Markov reward process at all.", 'start': 1299.736, 'duration': 3.663}, {'end': 1307.741, 'text': "It's just a way to estimate sums of return- sums of rewards.", 'start': 1304.019, 'duration': 3.722}, {'end': 1316.494, 'text': "So that's both nice in the sense that, um, if you're using this in a process that you had estimated from some data,", 'start': 1309.813, 'duration': 6.681}], 'summary': 'Simulations provide estimate of markov reward process value without assuming its structure.', 'duration': 29.344, 'max_score': 1287.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1287150.jpg'}, {'end': 1387.463, 'src': 'embed', 'start': 1360.313, 'weight': 0, 'content': [{'end': 1367.218, 'text': "So the value function of a Markov reward process is simply the immediate reward the agent gets from the current state it's in,", 'start': 1360.313, 'duration': 6.905}, {'end': 1368.879, 'text': 'plus the discounted sum of future rewards.', 'start': 1367.218, 'duration': 1.661}, {'end': 1381.802, 'text': 'weighed by the discount factor times the- and where we- where we express that discounted sum of future words is we can just express it with v,', 'start': 1370.339, 'duration': 11.463}, {'end': 1382.462, 'text': 'v of s prime.', 'start': 1381.802, 'duration': 0.66}, {'end': 1387.463, 'text': "So we sort of say well, whatever state you're in right now, you're gonna get your immediate reward,", 'start': 1383.282, 'duration': 4.181}], 'summary': 'Value function: immediate reward + discounted future rewards, expressed as v of s prime.', 'duration': 27.15, 'max_score': 1360.313, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1360313.jpg'}, {'end': 1682.468, 'src': 'embed', 'start': 1653.923, 'weight': 2, 'content': [{'end': 1659.427, 'text': 'And in this case we can simply use that to derive an iterative equation,', 'start': 1653.923, 'duration': 5.504}, {'end': 1665.072, 'text': 'where we use the previous value of the state um in order to bootstrap and compute the next value of a state.', 'start': 1659.427, 'duration': 5.645}, {'end': 1666.253, 'text': 'We do that for all states.', 'start': 1665.352, 'duration': 0.901}, {'end': 1671.584, 'text': "And the computational complexity of this is a little bit lower because it's only s squared.", 'start': 1667.762, 'duration': 3.822}, {'end': 1676.346, 'text': "Because you're doing this for each of the states and then you're summing over all the possible next states.", 'start': 1671.604, 'duration': 4.742}, {'end': 1682.468, 'text': 'When I say we do this total convergence, generally what we do in this case is we define a norm.', 'start': 1678.167, 'duration': 4.301}], 'summary': 'Using iterative equations to compute next state values, with computational complexity of s squared.', 'duration': 28.545, 'max_score': 1653.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1653923.jpg'}], 'start': 1088.143, 'title': 'Markov reward processes and solving markov processes', 'summary': 'Covers markov reward processes, methods for estimating value function, impact of discount factor, matrix notation, and two methods for computing markov processes with computational complexities of n squared, n cubed, and s squared.', 'chapters': [{'end': 1514.59, 'start': 1088.143, 'title': 'Markov reward processes', 'summary': 'Explains markov reward processes, their mathematical properties, and methods for estimating the value function through simulation and analytical solutions, including the impact of the discount factor on sample returns and the use of matrix notation for value function computation.', 'duration': 426.447, 'highlights': ['The value function of a Markov reward process is the immediate reward plus the discounted sum of future rewards, allowing for analytical solutions and matrix notation computation.', 'Simulation can be used to estimate the value function of a Markov reward process, with the accuracy of empirical averages decreasing on the order of 1 over the square root of n, where n is the number of rollouts.', 'The sample return for a specific episode in a Markov reward process can be calculated using the reward definition, transition model, and the impact of the discount factor, as demonstrated with the Mars rover example.']}, {'end': 1692.533, 'start': 1514.89, 'title': 'Solving markov processes', 'summary': 'Explains two methods for computing markov processes: direct solution involving matrix inversion, with a computational complexity between n squared and n cubed, and iterative dynamic programming with a computational complexity of s squared.', 'duration': 177.643, 'highlights': ['The computational complexity of direct solution involving matrix inversion is generally on the order somewhere between n squared and n cubed, depending on which matrix inversion is used.', 'The computational complexity of iterative dynamic programming is s squared, which is lower compared to the direct solution involving matrix inversion.', 'The iterative dynamic programming involves initializing the value function, performing a Bellman backup, and using the previous value of the state to compute the next value of a state for all states, resulting in a computational complexity of s squared.']}], 'duration': 604.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1088143.jpg', 'highlights': ['The value function of a Markov reward process is the immediate reward plus the discounted sum of future rewards, allowing for analytical solutions and matrix notation computation.', 'The sample return for a specific episode in a Markov reward process can be calculated using the reward definition, transition model, and the impact of the discount factor, as demonstrated with the Mars rover example.', 'The computational complexity of iterative dynamic programming is s squared, which is lower compared to the direct solution involving matrix inversion.', 'Simulation can be used to estimate the value function of a Markov reward process, with the accuracy of empirical averages decreasing on the order of 1 over the square root of n, where n is the number of rollouts.']}, {'end': 2549.845, 'segs': [{'end': 1738.885, 'src': 'embed', 'start': 1712.752, 'weight': 1, 'content': [{'end': 1717.893, 'text': 'So here are two different ways to try to compute the value of a Markov, uh, Markov reward process, or three really.', 'start': 1712.752, 'duration': 5.141}, {'end': 1720.514, 'text': 'One is simulation, the second is analytically.', 'start': 1718.253, 'duration': 2.261}, {'end': 1726.295, 'text': 'The analytical- analytic one requires us to have a finite set of states, and the third one is dynamic programming.', 'start': 1720.594, 'duration': 5.701}, {'end': 1730.676, 'text': "We're also right now defining only all of these for when the state space is finite.", 'start': 1726.635, 'duration': 4.041}, {'end': 1733.497, 'text': "um, but we'll talk about when the state space is infinite later on.", 'start': 1730.676, 'duration': 2.821}, {'end': 1738.885, 'text': 'So now we can finally get on to Markov decision processes.', 'start': 1736.704, 'duration': 2.181}], 'summary': 'Two ways to compute markov reward process: simulation and analytic; third method is dynamic programming.', 'duration': 26.133, 'max_score': 1712.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1712752.jpg'}, {'end': 1842.834, 'src': 'embed', 'start': 1816.369, 'weight': 3, 'content': [{'end': 1821.374, 'text': "I think, um, there are a lot of cases where, um, we don't have perfect models of the environment.", 'start': 1816.369, 'duration': 5.005}, {'end': 1823.755, 'text': 'Maybe if we have better models, then things would be deterministic.', 'start': 1821.414, 'duration': 2.341}, {'end': 1828.139, 'text': "Um, and so we're gonna approximate our uncertainty over those models with stochasticity.", 'start': 1823.976, 'duration': 4.163}, {'end': 1832.963, 'text': "So maybe you have a robot and it's a little bit faulty, and so sometimes it gets stuck on carpet and then sometimes it goes forward.", 'start': 1828.399, 'duration': 4.564}, {'end': 1839.751, 'text': 'We could write that down as a stochastic transition matrix where sometimes it stays in the same place and sometimes it advances to the next state.', 'start': 1833.607, 'duration': 6.144}, {'end': 1842.834, 'text': "Or maybe you're on sand or things like that.", 'start': 1841.032, 'duration': 1.802}], 'summary': 'Imperfect environment models lead to stochasticity in robot behavior.', 'duration': 26.465, 'max_score': 1816.369, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1816369.jpg'}, {'end': 2033.395, 'src': 'heatmap', 'start': 1896.677, 'weight': 0.867, 'content': [{'end': 1901.422, 'text': "Otherwise, it'll generally move to the next state over if it's trying to do action A1.", 'start': 1896.677, 'duration': 4.745}, {'end': 1906.327, 'text': "And for action A2, it'll move to the right unless it hits A7, and then it'll stay there.", 'start': 1901.922, 'duration': 4.405}, {'end': 1917.002, 'text': 'So like we said at the beginning of class, um, a Markov decision process policy specifies what action to take in each state.', 'start': 1910.819, 'duration': 6.183}, {'end': 1925.745, 'text': 'And the policies themselves can be deterministic or stochastic, meaning that you could either have a distribution over the next action you might take,', 'start': 1917.962, 'duration': 7.783}, {'end': 1932.388, 'text': "given the state you're in, or you could have a deterministic mapping that says, whenever I'm in this state, I always, you know, do action A1..", 'start': 1925.745, 'duration': 6.643}, {'end': 1941.462, 'text': "Now, in a lot of this class, we'll be thinking about deterministic policies, but later on, when we get into policy search,", 'start': 1935.017, 'duration': 6.445}, {'end': 1943.344, 'text': "we'll talk a lot more about stochastic policies.", 'start': 1941.462, 'duration': 1.882}, {'end': 1951.299, 'text': 'So if you have an MDP plus a policy, then that immediately specifies a Markov reward process.', 'start': 1946.095, 'duration': 5.204}, {'end': 1961.187, 'text': "Because once you have specified the policy, then you can think of that as inducing a Markov reward process, because you're only ever taking.", 'start': 1952.761, 'duration': 8.426}, {'end': 1963.75, 'text': "you've specified your distribution over actions for your state.", 'start': 1961.187, 'duration': 2.563}, {'end': 1971.396, 'text': 'Um, and so then you can think of sort of what is the reward, the expected reward you would get under that policy for any state.', 'start': 1963.77, 'duration': 7.626}, {'end': 1974.979, 'text': 'And similarly, you can define your transition model.', 'start': 1972.176, 'duration': 2.803}, {'end': 1982.613, 'text': 'for a Markov reward process by averaging across your transition models according to the weight at which you would take those different actions.', 'start': 1976.308, 'duration': 6.305}, {'end': 1993.542, 'text': "So the reason why it's useful to think about these connections between Markov decision processes and Markov reward processes is it implies that if you have a fixed policy,", 'start': 1985.075, 'duration': 8.467}, {'end': 2000.067, 'text': 'you can just use all the techniques that we just described for Markov reward processes, namely simulation, analytic, analytic,', 'start': 1993.542, 'duration': 6.525}, {'end': 2004.891, 'text': 'solution or dynamic programming, in order to compute what is the value of a policy?', 'start': 2000.067, 'duration': 4.824}, {'end': 2018.124, 'text': "So, if we go back to sort of the iterative algorithm, um, then it's exactly the same as before, exactly the same as the Markov reward process,", 'start': 2009.338, 'duration': 8.786}, {'end': 2021.026, 'text': "except for now we're indexing our reward by the policy.", 'start': 2018.124, 'duration': 2.902}, {'end': 2024.909, 'text': 'So, in order to learn what is the value of a particular policy,', 'start': 2021.627, 'duration': 3.282}, {'end': 2029.032, 'text': 'we instantiate the reward function by always picking the action that the policy would take.', 'start': 2024.909, 'duration': 4.123}, {'end': 2033.395, 'text': "So in this case, I'm doing it for simplicity for deterministic policy.", 'start': 2029.052, 'duration': 4.343}], 'summary': 'Markov decision process policy specifies actions in each state, deterministic or stochastic policies influence expected rewards and transition models.', 'duration': 136.718, 'max_score': 1896.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1896677.jpg'}, {'end': 1941.462, 'src': 'embed', 'start': 1910.819, 'weight': 2, 'content': [{'end': 1917.002, 'text': 'So like we said at the beginning of class, um, a Markov decision process policy specifies what action to take in each state.', 'start': 1910.819, 'duration': 6.183}, {'end': 1925.745, 'text': 'And the policies themselves can be deterministic or stochastic, meaning that you could either have a distribution over the next action you might take,', 'start': 1917.962, 'duration': 7.783}, {'end': 1932.388, 'text': "given the state you're in, or you could have a deterministic mapping that says, whenever I'm in this state, I always, you know, do action A1..", 'start': 1925.745, 'duration': 6.643}, {'end': 1941.462, 'text': "Now, in a lot of this class, we'll be thinking about deterministic policies, but later on, when we get into policy search,", 'start': 1935.017, 'duration': 6.445}], 'summary': 'Markov decision process policies can be deterministic or stochastic, with a focus on deterministic policies in this class.', 'duration': 30.643, 'max_score': 1910.819, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1910819.jpg'}, {'end': 2018.124, 'src': 'embed', 'start': 1985.075, 'weight': 0, 'content': [{'end': 1993.542, 'text': "So the reason why it's useful to think about these connections between Markov decision processes and Markov reward processes is it implies that if you have a fixed policy,", 'start': 1985.075, 'duration': 8.467}, {'end': 2000.067, 'text': 'you can just use all the techniques that we just described for Markov reward processes, namely simulation, analytic, analytic,', 'start': 1993.542, 'duration': 6.525}, {'end': 2004.891, 'text': 'solution or dynamic programming, in order to compute what is the value of a policy?', 'start': 2000.067, 'duration': 4.824}, {'end': 2018.124, 'text': "So, if we go back to sort of the iterative algorithm, um, then it's exactly the same as before, exactly the same as the Markov reward process,", 'start': 2009.338, 'duration': 8.786}], 'summary': 'Connecting markov decision processes and markov reward processes for policy value computation.', 'duration': 33.049, 'max_score': 1985.075, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1985075.jpg'}], 'start': 1697.697, 'title': 'Markov decision processes', 'summary': 'Introduces and discusses markov decision processes, value computation, policies, and information flow, providing a comprehensive understanding of this topic.', 'chapters': [{'end': 1859.786, 'start': 1697.697, 'title': 'Markov decision processes', 'summary': 'Introduces markov decision processes, discussing the computation of value for markov reward processes, defining markov decision processes, and explaining the use of stochasticity to model uncertainty.', 'duration': 162.089, 'highlights': ['Markov decision processes involve dynamics models specified for each action separately and a reward function dependent on both state and action. Markov decision processes involve dynamics models specified for each action separately and a reward function dependent on both state and action.', 'The computation of the value of a Markov reward process can be done through simulation, analytical methods for finite state spaces, and dynamic programming. The computation of the value of a Markov reward process can be done through simulation, analytical methods for finite state spaces, and dynamic programming.', 'Stochasticity is used to model uncertainty in situations where perfect models of the environment are lacking, such as in the case of faulty robots or unpredictable traffic. Stochasticity is used to model uncertainty in situations where perfect models of the environment are lacking, such as in the case of faulty robots or unpredictable traffic.']}, {'end': 2549.845, 'start': 1864.858, 'title': 'Markov decision processes and policies', 'summary': 'Discusses markov decision processes, policies, and value functions, which are essential for computing expected discounted sum of rewards for each state under a policy, and it shows an example of how information flows during the computation.', 'duration': 684.987, 'highlights': ['Markov decision process policy specifies what action to take in each state, and the policies themselves can be deterministic or stochastic.', 'A Markov decision process plus a policy immediately specifies a Markov reward process, allowing computation of the value of a policy.', 'The process of policy evaluation involves updating the values of every single state until they stop changing, and this process is guaranteed to be a contraction due to the discount factor.', 'The eigenvalues of the stochastic matrix in the Markov decision process are always less than or equal to 1, making the matrix invertible as long as the discount factor is less than 1.']}], 'duration': 852.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is1697697.jpg', 'highlights': ['Introduces and discusses markov decision processes, value computation, policies, and information flow, providing a comprehensive understanding of this topic.', 'The computation of the value of a Markov reward process can be done through simulation, analytical methods for finite state spaces, and dynamic programming.', 'Markov decision process policy specifies what action to take in each state, and the policies themselves can be deterministic or stochastic.', 'Stochasticity is used to model uncertainty in situations where perfect models of the environment are lacking, such as in the case of faulty robots or unpredictable traffic.']}, {'end': 3036.557, 'segs': [{'end': 2577.105, 'src': 'embed', 'start': 2549.885, 'weight': 2, 'content': [{'end': 2553.285, 'text': 'So now, we can start to talk about Markov decision process control.', 'start': 2549.885, 'duration': 3.4}, {'end': 2559.147, 'text': 'Now, just to note there, so I- I led us through or we just went through policy evaluation in an iterative way.', 'start': 2553.626, 'duration': 5.521}, {'end': 2562.848, 'text': 'You could have also done it analytically or you could have done it with simulation.', 'start': 2559.247, 'duration': 3.601}, {'end': 2568.017, 'text': "but it has a particularly nice analogy now that we're gonna start to think about control.", 'start': 2565.054, 'duration': 2.963}, {'end': 2570.139, 'text': 'So again, what do I mean by control?', 'start': 2568.717, 'duration': 1.422}, {'end': 2574.683, 'text': "Control here is gonna be the fact that ultimately, we don't care about just evaluating policies.", 'start': 2570.319, 'duration': 4.364}, {'end': 2577.105, 'text': 'typically, we want our agent to actually be learning policies.', 'start': 2574.683, 'duration': 2.422}], 'summary': 'Discussing markov decision process control and transitioning from policy evaluation to policy learning.', 'duration': 27.22, 'max_score': 2549.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is2549885.jpg'}, {'end': 2617.93, 'src': 'embed', 'start': 2592.645, 'weight': 3, 'content': [{'end': 2599.912, 'text': 'So, um, and the optimal policy for an MDP in an infinite horizon finite state MDP is deterministic.', 'start': 2592.645, 'duration': 7.267}, {'end': 2609.422, 'text': "That's one really good reason why it's sufficient for us to just focus on deterministic policies, finite state MDPs, um, in infinite horizons.", 'start': 2600.913, 'duration': 8.509}, {'end': 2616.029, 'text': "Okay So how do we compute it? Well, first before we do this, let's think about how many policies there might be.", 'start': 2611.368, 'duration': 4.661}, {'end': 2617.93, 'text': 'So there are seven discrete states.', 'start': 2616.569, 'duration': 1.361}], 'summary': 'Deterministic policy for infinite horizon mdp with 7 discrete states', 'duration': 25.285, 'max_score': 2592.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is2592645.jpg'}, {'end': 2688.131, 'src': 'embed', 'start': 2653.859, 'weight': 0, 'content': [{'end': 2662.907, 'text': 'in general, if you have S states and A actions and this is the cardinality of those sets how many possible deterministic policies are there?', 'start': 2653.859, 'duration': 9.048}, {'end': 2666.33, 'text': 'Um, and then the second question, which is whether or not these are always unique.', 'start': 2662.927, 'duration': 3.403}, {'end': 2676.775, 'text': 'Does anybody want to take a guess at how many deterministic policies there are in this case?', 'start': 2672.722, 'duration': 4.053}, {'end': 2685.79, 'text': "It's a mapping from states to actions.", 'start': 2680.869, 'duration': 4.921}, {'end': 2688.131, 'text': 'so would it be 2 to the 7th?', 'start': 2685.79, 'duration': 2.341}], 'summary': 'There are 2^7 possible deterministic policies, and uniqueness is uncertain.', 'duration': 34.272, 'max_score': 2653.859, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is2653859.jpg'}, {'end': 3026.749, 'src': 'embed', 'start': 3000.513, 'weight': 1, 'content': [{'end': 3004.755, 'text': 'So the number of deterministic policies we just discussed is A to the S.', 'start': 3000.513, 'duration': 4.242}, {'end': 3008.757, 'text': 'Um, and policy iteration is a technique that is generally better than enumeration.', 'start': 3004.755, 'duration': 4.002}, {'end': 3014.039, 'text': "So what do I mean by enumeration in this context? I mean, there's a finite number of policies.", 'start': 3009.656, 'duration': 4.383}, {'end': 3018.703, 'text': 'You could just evaluate each of them separately and then pick the max.', 'start': 3014.36, 'duration': 4.343}, {'end': 3025.728, 'text': 'So if you have a lot of compute, you might just want to, and this might be better if you really care about wall clock and you have many, many,', 'start': 3020.004, 'duration': 5.724}, {'end': 3026.749, 'text': 'many processors.', 'start': 3025.728, 'duration': 1.021}], 'summary': 'Policy iteration is generally better than enumeration due to finite number of policies and computational advantages.', 'duration': 26.236, 'max_score': 3000.513, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3000513.jpg'}], 'start': 2549.885, 'title': 'Markov decision processes', 'summary': 'Covers markov decision process control and deterministic policies in mdps, emphasizing optimal policy computation for a finite state mdp with seven discrete states and two actions, a1 and a2. it also discusses the number of deterministic policies in an mdp, highlighting that the number of possible deterministic policies is a to the s, and the advantages of policy iteration over enumeration for computing the optimal policy.', 'chapters': [{'end': 2632.393, 'start': 2549.885, 'title': 'Markov decision process control', 'summary': 'Introduces the concept of control in markov decision process, emphasizing the computation of optimal policies for a finite state mdp with seven discrete states and two actions, a1 and a2.', 'duration': 82.508, 'highlights': ['The chapter explains the shift from policy evaluation to control in Markov decision process, focusing on the computation of optimal policies for an MDP with seven discrete states and two actions, A1 and A2.', 'It emphasizes the importance of computing optimal policies and the existence of a unique optimal value function for an MDP in an infinite horizon finite state MDP.', 'The chapter discusses the deterministic nature of the optimal policy for an MDP in an infinite horizon finite state MDP, highlighting the focus on deterministic policies and finite state MDPs.', 'It mentions the consideration of two actions, A1 and A2, for seven discrete states in the context of robor locations within the MDP, while also emphasizing the stochastic nature of the scenarios.']}, {'end': 3036.557, 'start': 2634.954, 'title': 'Deterministic policies in mdps', 'summary': 'Discusses the number of deterministic policies in an mdp, where the number of possible deterministic policies is a to the s, and the optimal policy is not always unique due to ties in the value function, and policy iteration is a better technique than enumeration for computing the optimal policy.', 'duration': 401.603, 'highlights': ['The number of possible deterministic policies is A to the S, where A is the number of actions and S is the number of states. The chapter explains that the number of possible deterministic policies in a Markov Decision Process (MDP) is A to the S, where A represents the number of actions and S represents the number of states.', "The optimal policy is not always unique due to ties in the value function. It's clarified that the optimal policy in an MDP is not always unique due to the possibility of ties in the value function, leading to more than one policy with the same value.", 'Policy iteration is a technique generally better than enumeration for computing the optimal policy. The chapter explains that policy iteration is a technique generally preferred over enumeration for computing the optimal policy in an MDP, even in cases with large state spaces.']}], 'duration': 486.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is2549885.jpg', 'highlights': ['The number of possible deterministic policies is A to the S, where A is the number of actions and S is the number of states.', 'Policy iteration is a technique generally better than enumeration for computing the optimal policy.', 'The chapter explains the shift from policy evaluation to control in Markov decision process, focusing on the computation of optimal policies for an MDP with seven discrete states and two actions, A1 and A2.', 'It emphasizes the importance of computing optimal policies and the existence of a unique optimal value function for an MDP in an infinite horizon finite state MDP.']}, {'end': 3326.092, 'segs': [{'end': 3061.895, 'src': 'embed', 'start': 3037.625, 'weight': 0, 'content': [{'end': 3045.628, 'text': "But if you don't have kind of infinite compute, it's generally more computationally efficient if you have to do this serially to do policy iteration.", 'start': 3037.625, 'duration': 8.003}, {'end': 3046.969, 'text': "And so we'll talk about what that is.", 'start': 3045.948, 'duration': 1.021}, {'end': 3053.912, 'text': 'So in policy iteration, what we do is we basically keep track of a guess of what the optimal policy might be.', 'start': 3048.449, 'duration': 5.463}, {'end': 3057.713, 'text': 'We evaluate its value, and then we try to improve it.', 'start': 3054.772, 'duration': 2.941}, {'end': 3061.895, 'text': "If we can't improve it anymore, um, well then we can- then we can halt.", 'start': 3058.513, 'duration': 3.382}], 'summary': 'Policy iteration is a computationally efficient method for evaluating and improving the optimal policy, halting when no further improvements can be made.', 'duration': 24.27, 'max_score': 3037.625, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3037625.jpg'}, {'end': 3183.806, 'src': 'heatmap', 'start': 3090.08, 'weight': 1, 'content': [{'end': 3092.281, 'text': 'we do value function policy.', 'start': 3090.08, 'duration': 2.201}, {'end': 3097.804, 'text': "We evaluate the policy using the same sorts of techniques we just discussed because it's a fixed policy,", 'start': 3092.762, 'duration': 5.042}, {'end': 3102.867, 'text': 'which means we are now basically in a Markov reward um process and then we do policy improvement.', 'start': 3097.804, 'duration': 5.063}, {'end': 3107.169, 'text': 'And so the- really the new thing compared to what we were doing before now is policy improvement.', 'start': 3103.467, 'duration': 3.702}, {'end': 3115.89, 'text': "So in order to define how we could improve a policy, we're gonna define something new, which is the state action value.", 'start': 3110.287, 'duration': 5.603}, {'end': 3118.691, 'text': 'So before we were just talking about state values.', 'start': 3116.67, 'duration': 2.021}, {'end': 3124.734, 'text': 'State values are denoted by V.', 'start': 3118.991, 'duration': 5.743}, {'end': 3131.577, 'text': "We're talking about, like V, Pi of S, which says if you start in state S and you follow policy Pi, what is the expected discounted sum of rewards??", 'start': 3124.734, 'duration': 6.843}, {'end': 3138.731, 'text': "A state action value says well, I'm gonna follow this policy Pi, but not right away.", 'start': 3132.798, 'duration': 5.933}, {'end': 3145.136, 'text': "I'm gonna first take an action A, which might be different than what my policy is telling me to do.", 'start': 3139.472, 'duration': 5.664}, {'end': 3148.799, 'text': "And then later on the next time step, I'm gonna follow policy Pi.", 'start': 3145.756, 'duration': 3.043}, {'end': 3154.503, 'text': "So it just says I'm gonna get my immediate reward from taking this action A that I'm choosing.", 'start': 3150.2, 'duration': 4.303}, {'end': 3157.545, 'text': "And then I'm gonna transition to a new state.", 'start': 3155.744, 'duration': 1.801}, {'end': 3160.708, 'text': 'Again, that depends on my current state and the action I just took.', 'start': 3158.366, 'duration': 2.342}, {'end': 3163.37, 'text': "And from then on, I'm gonna take policy Pi.", 'start': 3161.148, 'duration': 2.222}, {'end': 3168.76, 'text': 'So that defines the Q function.', 'start': 3166.819, 'duration': 1.941}, {'end': 3178.084, 'text': "And what policy improvement does is it says, okay, you've got a policy, you just did policy evaluation, and you got a value of it.", 'start': 3170.621, 'duration': 7.463}, {'end': 3183.806, 'text': 'So policy evaluation just allowed you to compute what was the value of that policy.', 'start': 3179.625, 'duration': 4.181}], 'summary': 'The transcript discusses policy evaluation and improvement in a markov reward process, introducing the state action value and the q function.', 'duration': 93.726, 'max_score': 3090.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3090080.jpg'}, {'end': 3178.084, 'src': 'embed', 'start': 3145.756, 'weight': 3, 'content': [{'end': 3148.799, 'text': "And then later on the next time step, I'm gonna follow policy Pi.", 'start': 3145.756, 'duration': 3.043}, {'end': 3154.503, 'text': "So it just says I'm gonna get my immediate reward from taking this action A that I'm choosing.", 'start': 3150.2, 'duration': 4.303}, {'end': 3157.545, 'text': "And then I'm gonna transition to a new state.", 'start': 3155.744, 'duration': 1.801}, {'end': 3160.708, 'text': 'Again, that depends on my current state and the action I just took.', 'start': 3158.366, 'duration': 2.342}, {'end': 3163.37, 'text': "And from then on, I'm gonna take policy Pi.", 'start': 3161.148, 'duration': 2.222}, {'end': 3168.76, 'text': 'So that defines the Q function.', 'start': 3166.819, 'duration': 1.941}, {'end': 3178.084, 'text': "And what policy improvement does is it says, okay, you've got a policy, you just did policy evaluation, and you got a value of it.", 'start': 3170.621, 'duration': 7.463}], 'summary': 'Policy improvement defines the q function and evaluates the policy.', 'duration': 32.328, 'max_score': 3145.756, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3145756.jpg'}, {'end': 3247.069, 'src': 'embed', 'start': 3217.694, 'weight': 1, 'content': [{'end': 3224.277, 'text': "And then we're gonna compute a new policy and this is the improvement step which maximizes this Q.", 'start': 3217.694, 'duration': 6.583}, {'end': 3227.639, 'text': 'So we just do this computation and then we take the max.', 'start': 3224.277, 'duration': 3.362}, {'end': 3247.069, 'text': 'Now by definition, this has to be greater than or equal to Q Pi i of s Pi i of a, right? Because either a is equal to Pi i of a, sorry, Pi i of s.', 'start': 3229.2, 'duration': 17.869}], 'summary': 'Improvement step maximizes q, ensuring q is >= q pi i of s pi i of a.', 'duration': 29.375, 'max_score': 3217.694, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3217694.jpg'}], 'start': 3037.625, 'title': 'Reinforcement learning policy iteration', 'summary': 'Covers policy iteration in reinforcement learning, focusing on computational efficiency, iterative evaluation, and improvement of the optimal policy until convergence, as well as markov reward process and policy improvement to maximize state-action values and ensure convergence to global optima.', 'chapters': [{'end': 3097.804, 'start': 3037.625, 'title': 'Policy iteration in reinforcement learning', 'summary': 'Discusses policy iteration in reinforcement learning, emphasizing the computational efficiency when serial computation is required, and the iterative process of evaluating and improving the optimal policy until it converges, starting from a randomly initialized policy.', 'duration': 60.179, 'highlights': ["In policy iteration, it's more computationally efficient to perform policy evaluation and improvement serially, especially when infinite compute is not available.", 'The process involves keeping track of a guess of the optimal policy, evaluating its value, and iteratively improving it until it converges, starting from a randomly initialized policy.', "Policy iteration involves initializing a policy randomly, evaluating it using value function policy, and iteratively improving it while it's not changing, until convergence."]}, {'end': 3326.092, 'start': 3097.804, 'title': 'Markov reward process and policy improvement', 'summary': 'Discusses the concept of policy improvement and the computation of state-action values to maximize the q function, ensuring convergence to the global optima without susceptibility to getting stuck.', 'duration': 228.288, 'highlights': ['Policy improvement involves computing the state-action value function Q Pi, maximizing it, and deriving a new policy based on this maximization.', 'The Q function, computed for all actions and states, is used to determine a new policy that maximizes Q, ensuring convergence to the global optima.', 'The approach guarantees convergence to the global optima, eliminating susceptibility to getting stuck, unlike some gradient-based reinforcement learning methods.']}], 'duration': 288.467, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3037625.jpg', 'highlights': ['Policy iteration involves initializing a policy randomly, evaluating it using value function policy, and iteratively improving it until it converges.', 'Policy improvement involves computing the state-action value function Q Pi, maximizing it, and deriving a new policy based on this maximization.', "In policy iteration, it's more computationally efficient to perform policy evaluation and improvement serially, especially when infinite compute is not available.", 'The Q function, computed for all actions and states, is used to determine a new policy that maximizes Q, ensuring convergence to the global optima.']}, {'end': 3791.881, 'segs': [{'end': 3349.183, 'src': 'embed', 'start': 3326.172, 'weight': 0, 'content': [{'end': 3333.775, 'text': 'You do this policy evaluation and then you compute the Q function and then you compute the new policy that takes an argmax of the Q function.', 'start': 3326.172, 'duration': 7.603}, {'end': 3336.797, 'text': "So that's how policy improvement works.", 'start': 3334.936, 'duration': 1.861}, {'end': 3339.538, 'text': 'The next critical question that Aris was bringing up is okay.', 'start': 3337.137, 'duration': 2.401}, {'end': 3340.399, 'text': 'why do we do this?', 'start': 3339.538, 'duration': 0.861}, {'end': 3341.279, 'text': 'and is this a good idea?', 'start': 3340.399, 'duration': 0.88}, {'end': 3349.183, 'text': "So, when we look at this, um, let's look through this step a little bit more.", 'start': 3344.621, 'duration': 4.562}], 'summary': 'Policy improvement involves computing q function and new policy to maximize it.', 'duration': 23.011, 'max_score': 3326.172, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3326172.jpg'}, {'end': 3416.896, 'src': 'embed', 'start': 3379.232, 'weight': 2, 'content': [{'end': 3384.625, 'text': "The previous policy that we're using, before.", 'start': 3379.232, 'duration': 5.393}, {'end': 3406.71, 'text': "So what I've done there is, I've said, okay, the max action over the Q has to be at least as good as following your old policy by definition,", 'start': 3397.844, 'duration': 8.866}, {'end': 3410.712, 'text': "because otherwise you could always pick the same policy as before, or else you're gonna pick a better action.", 'start': 3406.71, 'duration': 4.002}, {'end': 3416.896, 'text': 'And this reward function here is just exactly the definition of the value of your old policy.', 'start': 3411.613, 'duration': 5.283}], 'summary': 'The new policy aims to ensure that the maximum action over q is at least as good as the old policy, to avoid picking the same or worse actions.', 'duration': 37.664, 'max_score': 3379.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3379232.jpg'}, {'end': 3472.474, 'src': 'embed', 'start': 3446.682, 'weight': 1, 'content': [{'end': 3452.548, 'text': 'What does this Q function represent? It represents if I take an action and then I follow my old policy from then onwards.', 'start': 3446.682, 'duration': 5.866}, {'end': 3459.034, 'text': "And then I'm picking whatever action is maximizing that quantity for each state.", 'start': 3453.969, 'duration': 5.065}, {'end': 3461.797, 'text': "Okay So I'm gonna do this process for each state.", 'start': 3459.895, 'duration': 1.902}, {'end': 3468.393, 'text': "But then so that's gonna just define a new policy, right?", 'start': 3465.572, 'duration': 2.821}, {'end': 3472.474, 'text': "Like I said, that might be the same or it could be a different policy than the one you've had before.", 'start': 3468.493, 'duration': 3.981}], 'summary': 'The q function represents action maximization for each state in defining a new policy.', 'duration': 25.792, 'max_score': 3446.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3446682.jpg'}, {'end': 3602.219, 'src': 'embed', 'start': 3546.537, 'weight': 3, 'content': [{'end': 3552.781, 'text': 'Okay So why do we get a monotonic improvement in the policy value by doing this? So in the policy value.', 'start': 3546.537, 'duration': 6.244}, {'end': 3554.524, 'text': 'So what-?', 'start': 3554.224, 'duration': 0.3}, {'end': 3556.685, 'text': 'first of all, what do I mean by monotonic improvement??', 'start': 3554.524, 'duration': 2.161}, {'end': 3566.747, 'text': 'Um, what I mean is that the value uh, something is monotonic if, um, the new policy is greater than equal to the old policy for all states.', 'start': 3556.705, 'duration': 10.042}, {'end': 3569.528, 'text': 'So it has to either have the same value or be better.', 'start': 3567.627, 'duration': 1.901}, {'end': 3578.73, 'text': 'And my proposition is that the new policy is greater than or equal to the old policy in all states, with strict inequality,', 'start': 3570.268, 'duration': 8.462}, {'end': 3580.01, 'text': 'if the old policy was suboptimal.', 'start': 3578.73, 'duration': 1.28}, {'end': 3589.937, 'text': 'So why does this work? So it works for the following reasons.', 'start': 3585.236, 'duration': 4.701}, {'end': 3592.837, 'text': "Let's go ahead and just like walk through the proof briefly.", 'start': 3590.577, 'duration': 2.26}, {'end': 3602.219, 'text': "Okay So this is what we've said here is that, um, V Pi i of s, that's our old value of our policy.", 'start': 3592.857, 'duration': 9.362}], 'summary': 'Monotonic improvement in policy value ensures new policy is better or equal to old policy for all states', 'duration': 55.682, 'max_score': 3546.537, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3546537.jpg'}], 'start': 3326.172, 'title': 'Policy improvement in q function computation', 'summary': 'Discusses policy improvement through q function computation, emphasizing the importance of ensuring new policy is at least as good as the previous policy and defining a new policy through q function computation. additionally, it explores how a new policy can lead to a guaranteed improvement in value function, resulting in a monotonic increase in policy value over the old policy.', 'chapters': [{'end': 3468.393, 'start': 3326.172, 'title': 'Policy improvement and q function computation', 'summary': 'Discusses policy improvement through q function computation, highlighting the importance of ensuring new policy is at least as good as the previous policy and the process of defining a new policy through the q function computation.', 'duration': 142.221, 'highlights': ['The process involves computing the Q function and deriving a new policy that ensures the new policy is at least as good as the previous policy.', 'The Q function represents the value of taking an action and following the old policy afterwards, defining a new policy based on maximizing this value for each state.', "The max over the Q function must be at least as good as the old policy's value, ensuring improvement in the new policy."]}, {'end': 3791.881, 'start': 3468.493, 'title': 'Policy improvement and monotonic value increase', 'summary': 'Discusses the concept of policy improvement, demonstrating how a new policy can lead to a guaranteed improvement in value function, resulting in a monotonic increase in policy value over the old policy.', 'duration': 323.388, 'highlights': ['By following a new policy and evaluating it for all time steps, a guaranteed improvement in value function can be achieved, leading to a monotonic increase in policy value over the old policy.', 'The new policy is guaranteed to be at least as good as the old policy in terms of the value function, and it is consistently better if the old policy was suboptimal.', 'The proof involves demonstrating that the value of the new policy is by definition at least as good as the previous value function, resulting in a monotonic improvement in policy value.']}], 'duration': 465.709, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3326172.jpg', 'highlights': ['The process involves computing the Q function and deriving a new policy that ensures the new policy is at least as good as the previous policy.', 'The Q function represents the value of taking an action and following the old policy afterwards, defining a new policy based on maximizing this value for each state.', "The max over the Q function must be at least as good as the old policy's value, ensuring improvement in the new policy.", 'By following a new policy and evaluating it for all time steps, a guaranteed improvement in value function can be achieved, leading to a monotonic increase in policy value over the old policy.', 'The new policy is guaranteed to be at least as good as the old policy in terms of the value function, and it is consistently better if the old policy was suboptimal.', 'The proof involves demonstrating that the value of the new policy is by definition at least as good as the previous value function, resulting in a monotonic improvement in policy value.']}, {'end': 4407.931, 'segs': [{'end': 3977.328, 'src': 'heatmap', 'start': 3927.216, 'weight': 0.852, 'content': [{'end': 3931.178, 'text': 'Okay So policy iteration computes, um, the optimal value in a policy in one way.', 'start': 3927.216, 'duration': 3.962}, {'end': 3935.819, 'text': 'The idea in policy iteration is you always have a policy um,', 'start': 3931.418, 'duration': 4.401}, {'end': 3942.062, 'text': 'that is- that you know the value of it for the infinite horizon and then you incrementally try to improve it.', 'start': 3935.819, 'duration': 6.243}, {'end': 3944.443, 'text': 'Value iteration is an alternative approach.', 'start': 3942.642, 'duration': 1.801}, {'end': 3951.008, 'text': "Value iteration instead says, we're gonna think of computing the optimal value if you get to act for a finite number of steps.", 'start': 3945.123, 'duration': 5.885}, {'end': 3955.112, 'text': 'The beginning just one step, and then two steps, and then three steps, et cetera.', 'start': 3951.749, 'duration': 3.363}, {'end': 3958.335, 'text': 'Um, and you just keep iterating to longer and longer.', 'start': 3955.933, 'duration': 2.402}, {'end': 3964.561, 'text': "So that's different, right? Because policy says you always have a policy and you know what its value is, it just might not be very good.", 'start': 3959.076, 'duration': 5.485}, {'end': 3971.447, 'text': "Value iteration says you always know what the optimal value of policy is, but only if you're gonna get to act for, say, k times steps.", 'start': 3965.241, 'duration': 6.206}, {'end': 3977.328, 'text': "So they're just- they're computing different things, um, and they both will converge to the same thing eventually.", 'start': 3972.466, 'duration': 4.862}], 'summary': 'Policy iteration computes optimal value for infinite horizon, while value iteration computes for finite steps. both converge to the same result.', 'duration': 50.112, 'max_score': 3927.216, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3927216.jpg'}, {'end': 4086.658, 'src': 'embed', 'start': 4056.628, 'weight': 4, 'content': [{'end': 4062.915, 'text': 'You start off, you can initialize your value function to zero for all states, and then you loop until you converge.', 'start': 4056.628, 'duration': 6.287}, {'end': 4069.743, 'text': "Um, or if you're doing a finite horizon, which we might not have time to get to today, but, um, uh, then you'd go to that horizon.", 'start': 4063.616, 'duration': 6.127}, {'end': 4073.407, 'text': 'And basically for each state, you do this Bellman backup operator.', 'start': 4070.024, 'duration': 3.383}, {'end': 4086.658, 'text': "So you'd say my value at k plus 1 time steps for that state is if I get to pick the best immediate action plus the discounted sum of future rewards using that old value function I had from the previous time step.", 'start': 4074.068, 'duration': 12.59}], 'summary': 'Initialize value function to zero, iterate for convergence, apply bellman backup operator for future rewards.', 'duration': 30.03, 'max_score': 4056.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is4056628.jpg'}, {'end': 4227.124, 'src': 'embed', 'start': 4202.305, 'weight': 0, 'content': [{'end': 4209.928, 'text': 'So, if you wanna- if an operator is a contraction it means that if you apply it to two different things, you can think of these as value functions um,', 'start': 4202.305, 'duration': 7.623}, {'end': 4217.871, 'text': 'then the distance between them shrinks after um, or at least is no bigger after you apply the operator compared to their distance before.', 'start': 4209.928, 'duration': 7.943}, {'end': 4223.422, 'text': "So just to, um, actually I'll save examples for later.", 'start': 4219.54, 'duration': 3.882}, {'end': 4227.124, 'text': 'Feel free to come up to me after class if you want to see an example of this, um, or I can do it on Piazza.', 'start': 4223.442, 'duration': 3.682}], 'summary': 'Operator contraction: distance between values shrinks after application.', 'duration': 24.819, 'max_score': 4202.305, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is4202305.jpg'}, {'end': 4284.582, 'src': 'embed', 'start': 4240.891, 'weight': 1, 'content': [{'end': 4247.833, 'text': 'So the key question of whether or not value iteration will converge is because the Bellman backup is a contraction operator.', 'start': 4240.891, 'duration': 6.942}, {'end': 4252.574, 'text': "And it's a contraction operator as long as Gamma is less than 1.", 'start': 4247.853, 'duration': 4.721}, {'end': 4255.995, 'text': "Which means that if you do- if- let's say, you had two different Bel-,", 'start': 4252.574, 'duration': 3.421}, {'end': 4261.056, 'text': 'two different value functions and then you did the Bellman backup on both of them, then the distance between them would shrink.', 'start': 4255.995, 'duration': 5.061}, {'end': 4269.813, 'text': "So how do we prove this? Um, we prove it, For interest of time, I'll show you the proof again.", 'start': 4263.417, 'duration': 6.396}, {'end': 4273.435, 'text': "I'm happy to go through it, um, uh, or we can go through it in office hours, et cetera.", 'start': 4269.833, 'duration': 3.602}, {'end': 4275.377, 'text': 'Let me just show it kind of briefly.', 'start': 4273.475, 'duration': 1.902}, {'end': 4284.582, 'text': 'Um, so the idea to- to prove that the Bellman backup is a contraction operator is we consider there being two different value functions, K and J.', 'start': 4275.397, 'duration': 9.185}], 'summary': 'Value iteration converges if gamma is less than 1, as bellman backup is a contraction operator.', 'duration': 43.691, 'max_score': 4240.891, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is4240891.jpg'}, {'end': 4393.933, 'src': 'embed', 'start': 4366.755, 'weight': 3, 'content': [{'end': 4375.458, 'text': "So I think a good exercise to do, um, is to then say, given that it's a contraction operator, um, that means it has to converge to a fixed point.", 'start': 4366.755, 'duration': 8.703}, {'end': 4376.839, 'text': 'There has to be a unique solution.', 'start': 4375.598, 'duration': 1.241}, {'end': 4386.27, 'text': "So if you apply the Bellman operator repeatedly, there's a single fixed point that you will go to, which is a single, um, vector of value, uh, values.", 'start': 4378.206, 'duration': 8.064}, {'end': 4393.933, 'text': "It's also good to think about whether the initialization of values impacts anything if you only care about the result after it's converged.", 'start': 4387.55, 'duration': 6.383}], 'summary': 'Contraction operator converges to single unique fixed point, unaffected by initialization.', 'duration': 27.178, 'max_score': 4366.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is4366755.jpg'}], 'start': 3795.645, 'title': 'Policy improvement and convergence', 'summary': 'Discusses reaching policy improvement limits, exploring the possibility of reaching a maximum number of iterations for policy change. it also delves into policy iteration, value iteration, and the convergence of bellman backup, highlighting the unique solution and minimal impact of initialization on the converged result.', 'chapters': [{'end': 3847.447, 'start': 3795.645, 'title': 'Reaching policy improvement limits', 'summary': 'Discusses the concept of monotonic improvement in policy and explores the possibility of reaching a maximum number of iterations for policy change.', 'duration': 51.802, 'highlights': ["The concept of monotonic improvement in policy is explored, raising the question of whether the policy can ever change if it doesn't initially change (relevance score: 5)", 'The potential existence of a maximum number of iterations for policy change is considered, prompting further discussion on the limitations of policy iteration (relevance score: 4)']}, {'end': 4238.13, 'start': 3848.703, 'title': 'Policy iteration and value iteration', 'summary': 'Explores policy iteration and value iteration, discussing whether the policy can change, the maximum number of policy iterations, and the differences between policy iteration and value iteration. it also delves into the bellman equation, bellman backup operators, and how value iteration works, as well as the contraction aspect of value iteration.', 'duration': 389.427, 'highlights': ['Value iteration computes the optimal value in a policy by incrementally trying to improve it, while policy iteration always has a policy with known value and incrementally tries to improve it. Difference between value iteration and policy iteration, the approach of computing optimal value, and comparison of incremental improvement.', 'The chapter explains the Bellman equation and Bellman backup operators, commonly used in Markov decision processes and reinforcement learning. Explanation of the Bellman equation, its importance in Markov decision processes, and its application as a backup operator.', 'Value iteration initializes the value function to zero for all states and loops until convergence, using the Bellman backup operator to compute the optimal value for each state. Process of initializing and looping in value iteration, the use of the Bellman backup operator, and the computation of optimal values for states.', 'The concept of contraction in value iteration is introduced, where an operator shrinks the distance between two different value functions, ensuring that their distance does not increase after applying the operator. Introduction to the contraction aspect in value iteration, the impact on distance between value functions, and the purpose of the contraction.']}, {'end': 4407.931, 'start': 4240.891, 'title': 'Convergence of bellman backup', 'summary': 'Discusses the proof that the bellman backup is a contraction operator as long as gamma is less than 1, which ensures convergence to a single fixed point, implying a unique solution and minimal impact of initialization on the converged result.', 'duration': 167.04, 'highlights': ['The Bellman backup is a contraction operator as long as Gamma is less than 1, ensuring the distance between two different value functions shrinks after applying the Bellman backup.', 'The proof involves re-expressing two different value functions after applying the Bellman backup operator and bounding the difference between them by the max norm of the distance, demonstrating that the Bellman backup is a contraction operator.', 'The contraction property implies convergence to a single fixed point and a unique solution, with minimal impact of the initialization of values on the converged result.']}], 'duration': 612.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/E3f2Camj0Is/pics/E3f2Camj0Is3795645.jpg', 'highlights': ['The concept of contraction in value iteration is introduced, ensuring that the distance between value functions does not increase after applying the operator.', 'The Bellman backup is a contraction operator as long as Gamma is less than 1, ensuring the distance between two different value functions shrinks after applying the Bellman backup.', 'The proof involves re-expressing two different value functions after applying the Bellman backup operator and bounding the difference between them by the max norm of the distance, demonstrating that the Bellman backup is a contraction operator.', 'The contraction property implies convergence to a single fixed point and a unique solution, with minimal impact of the initialization of values on the converged result.', 'Value iteration initializes the value function to zero for all states and loops until convergence, using the Bellman backup operator to compute the optimal value for each state.']}], 'highlights': ['The planning problem in reinforcement learning involves making decisions when actions may have delayed consequences, requiring the sacrifice of immediate reward to maximize long-term reward.', 'The computational complexity of iterative dynamic programming is s squared, which is lower compared to the direct solution involving matrix inversion.', 'Policy iteration involves initializing a policy randomly, evaluating it using value function policy, and iteratively improving it until it converges.', 'The Q function, computed for all actions and states, is used to determine a new policy that maximizes Q, ensuring convergence to the global optima.', 'The concept of contraction in value iteration is introduced, ensuring that the distance between value functions does not increase after applying the operator.']}