title

DeepMind x UCL RL Lecture Series - Model-free Prediction [5/13]

description

Research Scientist Hado van Hasselt takes a closer look at model-free prediction and its relation to Monte Carlo and temporal difference algorithms.
Slides: https://dpmd.ai/modelfreeprediction
Full video lecture series: https://dpmd.ai/DeepMindxUCL21

detail

{'title': 'DeepMind x UCL RL Lecture Series - Model-free Prediction [5/13]', 'heatmap': [], 'summary': 'Series covers model-free prediction in reinforcement learning, lecture breaks for reflection, introduction to blackjack, monte carlo and temporal difference learning, reinforcement learning methods, comparison of temporal difference and monte carlo learning, and limitations of mixed multi-step approaches in reinforcement learning.', 'chapters': [{'end': 38.252, 'segs': [{'end': 45.358, 'src': 'embed', 'start': 20.077, 'weight': 0, 'content': [{'end': 27.403, 'text': 'The most important chapters for this lecture are chapters five and six, and we will cover some material from these other chapters as well,', 'start': 20.077, 'duration': 7.326}, {'end': 30.465, 'text': 'but some of that will be shared with the subsequent lectures.', 'start': 27.403, 'duration': 3.062}, {'end': 33.408, 'text': 'So this is actually background material for a couple of lectures in a row.', 'start': 30.546, 'duration': 2.862}, {'end': 38.252, 'text': 'We will just not exactly go through these in exactly the same sequence as the book does.', 'start': 34.188, 'duration': 4.064}, {'end': 42.455, 'text': 'This is why we list a fairly large chunk of background material here.', 'start': 38.492, 'duration': 3.963}, {'end': 45.358, 'text': 'Feel free to defer some of that reading until later.', 'start': 43.156, 'duration': 2.202}], 'summary': 'Focus on chapters 5 and 6, with additional material shared in subsequent lectures.', 'duration': 25.281, 'max_score': 20.077, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw20077.jpg'}], 'start': 1.047, 'title': 'Model-free prediction in reinforcement learning', 'summary': 'Covers model-free prediction in reinforcement learning, with a focus on chapters five and six, along with additional material for subsequent lectures, setting the background for multiple upcoming lectures.', 'chapters': [{'end': 38.252, 'start': 1.047, 'title': 'Model-free prediction in reinforcement learning', 'summary': 'Covers model-free prediction in reinforcement learning, focusing on chapters five and six, with additional material covering subsequent lectures and shared content, setting the background for multiple upcoming lectures.', 'duration': 37.205, 'highlights': ['The most important chapters for this lecture are chapters five and six.', 'The material will be shared with subsequent lectures.', 'This background material sets the stage for multiple upcoming lectures.']}], 'duration': 37.205, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1047.jpg', 'highlights': ['The most important chapters for this lecture are chapters five and six.', 'This background material sets the stage for multiple upcoming lectures.', 'The material will be shared with subsequent lectures.']}, {'end': 1375.326, 'segs': [{'end': 77.042, 'src': 'embed', 'start': 38.492, 'weight': 0, 'content': [{'end': 42.455, 'text': 'This is why we list a fairly large chunk of background material here.', 'start': 38.492, 'duration': 3.963}, {'end': 45.358, 'text': 'Feel free to defer some of that reading until later.', 'start': 43.156, 'duration': 2.202}, {'end': 50.315, 'text': 'In fact, it might help the understanding go through the material not all at once, but to revisit it later.', 'start': 45.398, 'duration': 4.917}, {'end': 54.836, 'text': "Also, don't forget to pause during the lecture, I mean.", 'start': 52.275, 'duration': 2.561}, {'end': 58.397, 'text': 'Sometimes I will ask you a question, ask you to think about something, and of course,', 'start': 55.356, 'duration': 3.041}, {'end': 62.918, 'text': "that's a good occasion to actually pause for a second and actually reflect, maybe write some stuff down.", 'start': 58.397, 'duration': 4.521}, {'end': 65.719, 'text': 'But also, like I said, this is a fairly long lecture,', 'start': 63.498, 'duration': 2.221}, {'end': 71.861, 'text': 'so feel free to make use of the fact that this is a recording and therefore you can pause and maybe even take a break,', 'start': 65.719, 'duration': 6.142}, {'end': 77.042, 'text': 'or maybe even continue the lecture over more than one day, if that works for you.', 'start': 71.861, 'duration': 5.181}], 'summary': 'Lecture encourages revisiting, pausing, and breaking for better understanding.', 'duration': 38.55, 'max_score': 38.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw38492.jpg'}, {'end': 203.164, 'src': 'embed', 'start': 171.976, 'weight': 3, 'content': [{'end': 175.078, 'text': "And at first we'll talk about model-free prediction in this lecture,", 'start': 171.976, 'duration': 3.102}, {'end': 179.762, 'text': 'which is the process to estimate values when you do not have the Markov decision process.', 'start': 175.078, 'duration': 4.684}, {'end': 181.763, 'text': "You don't know what it is, but you can interact with it.", 'start': 179.822, 'duration': 1.941}, {'end': 185.116, 'text': "This, of course, is the case when you're, for instance, in the real world.", 'start': 182.315, 'duration': 2.801}, {'end': 191.039, 'text': 'You could imagine that the world maybe has some sort of a really large market decision process underneath it,', 'start': 185.616, 'duration': 5.423}, {'end': 192.979, 'text': "but you don't have immediate access to that.", 'start': 191.039, 'duration': 1.94}, {'end': 194.8, 'text': 'So instead, all that you can do is interact.', 'start': 193.019, 'duration': 1.781}, {'end': 203.164, 'text': 'After model-free prediction, we can talk about model-free control, which is the process to optimize values rather than estimating them.', 'start': 197.761, 'duration': 5.403}], 'summary': 'Model-free prediction and control discussed for markov decision processes.', 'duration': 31.188, 'max_score': 171.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw171976.jpg'}, {'end': 280.242, 'src': 'embed', 'start': 248.79, 'weight': 4, 'content': [{'end': 250.011, 'text': 'more deep reinforcement learning.', 'start': 248.79, 'duration': 1.221}, {'end': 255.053, 'text': 'Finally, we will cover some advanced topics in current research, but only much later.', 'start': 250.951, 'duration': 4.102}, {'end': 258.755, 'text': "Okay, so let's get started.", 'start': 257.613, 'duration': 1.142}, {'end': 265.158, 'text': "Our first topic will be Monte Carlo algorithms, and I'll explain in a moment what that means.", 'start': 259.654, 'duration': 5.504}, {'end': 270.833, 'text': 'So the point here is to use sampling.', 'start': 267.97, 'duration': 2.863}, {'end': 274.917, 'text': "We're going to interact with the world and this will allow us to learn without a model.", 'start': 270.853, 'duration': 4.064}, {'end': 280.242, 'text': "If we're sampling complete episodes in reinforcement learning, we call this Monte Carlo.", 'start': 275.758, 'duration': 4.484}], 'summary': 'Introduction to monte carlo algorithms for reinforcement learning.', 'duration': 31.452, 'max_score': 248.79, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw248790.jpg'}, {'end': 449.79, 'src': 'embed', 'start': 426.089, 'weight': 5, 'content': [{'end': 433.375, 'text': "That's just a small notational note to avoid confusion between this lecture and the earlier lecture on bandits.", 'start': 426.089, 'duration': 7.286}, {'end': 441.177, 'text': "Now we're going to extend slightly to make this more general, and we're going to consider bandits with states.", 'start': 435.748, 'duration': 5.429}, {'end': 447.266, 'text': 'For now, the episodes will still remain to be one step long, as in the bandit case before,', 'start': 442.078, 'duration': 5.188}, {'end': 449.79, 'text': 'and this means that actions do not affect the state transitions.', 'start': 447.266, 'duration': 2.524}], 'summary': 'Extending bandits to include states while maintaining one-step episodes.', 'duration': 23.701, 'max_score': 426.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw426089.jpg'}, {'end': 547.054, 'src': 'embed', 'start': 518.621, 'weight': 6, 'content': [{'end': 523.823, 'text': "So far we've mostly considered lookup tables where every state has its own entry.", 'start': 518.621, 'duration': 5.202}, {'end': 527.093, 'text': 'or every state action pair has its own entry.', 'start': 525.311, 'duration': 1.782}, {'end': 533.7, 'text': "So think of this as a big table stored in the robot's brain or in the agent's brain.", 'start': 527.493, 'duration': 6.207}, {'end': 538.885, 'text': 'And for every state and action you might see, you just have a separate entry that you might update.', 'start': 534.541, 'duration': 4.344}, {'end': 543.069, 'text': 'But of course this comes with some problems,', 'start': 539.926, 'duration': 3.143}, {'end': 547.054, 'text': 'because there might be way too many states and their actions to actually store this effectively in memory.', 'start': 543.069, 'duration': 3.985}], 'summary': 'Considered lookup tables for state-action pairs, but may have memory storage issues.', 'duration': 28.433, 'max_score': 518.621, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw518621.jpg'}, {'end': 630.783, 'src': 'embed', 'start': 591.94, 'weight': 7, 'content': [{'end': 596.821, 'text': "But even if that's the case, then it could be very large, but of course often it's also not the case.", 'start': 591.94, 'duration': 4.881}, {'end': 601.499, 'text': 'Now you could still have a finite agent state and maybe store these separately,', 'start': 597.697, 'duration': 3.802}, {'end': 605.702, 'text': 'but then you still suffer from this other problem that it might be very slow if you have many different agent states.', 'start': 601.499, 'duration': 4.203}, {'end': 611.165, 'text': 'So our solution for those problems is to use function approximation.', 'start': 607.603, 'duration': 3.562}, {'end': 622.812, 'text': "So we write VW or QW where W is now some sort of a parameter vector and we're going to update these parameters instead of updating cells in a big table.", 'start': 611.885, 'duration': 10.927}, {'end': 628.297, 'text': 'So the parameter vector w will be updated using Monte Carlo or Temporary Difference Learning,', 'start': 624.127, 'duration': 4.17}, {'end': 630.783, 'text': "which are two algorithms that we're going to talk about in this lecture.", 'start': 628.297, 'duration': 2.486}], 'summary': 'Using function approximation to update parameter vector w instead of cells in a big table, addressing problems of speed and size in agent states.', 'duration': 38.843, 'max_score': 591.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw591940.jpg'}, {'end': 939.509, 'src': 'embed', 'start': 914.673, 'weight': 9, 'content': [{'end': 922.676, 'text': 'which defines a squared distance between our current estimates in this case according to this linear function and the true value function v pi.', 'start': 914.673, 'duration': 8.003}, {'end': 929.178, 'text': "Obviously we don't have v pi, so we're going to replace this with things that we can use, but for now keep this objective in mind.", 'start': 923.276, 'duration': 5.902}, {'end': 939.509, 'text': 'If we could then compute stochastic gradients for this objective, this would convert to a global optimum of this loss function,', 'start': 931.044, 'duration': 8.465}], 'summary': 'Developing a method to compute stochastic gradients for optimizing the loss function.', 'duration': 24.836, 'max_score': 914.673, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw914673.jpg'}, {'end': 984.741, 'src': 'embed', 'start': 955.478, 'weight': 10, 'content': [{'end': 962.833, 'text': 'It could be that the features are, not good enough to be able to accurately predict the value for every state.', 'start': 955.478, 'duration': 7.355}, {'end': 968.795, 'text': 'If we do stochastic gradient descent, the update rule is very simple.', 'start': 965.854, 'duration': 2.941}, {'end': 977.718, 'text': 'We first note that the gradient of our value function with respect to our parameters w is simply the features we see here on the left-hand side.', 'start': 969.556, 'duration': 8.162}, {'end': 984.741, 'text': 'So at time step t, if we see state st, the gradient of our value function on that state will simply be xt.', 'start': 978.499, 'duration': 6.242}], 'summary': 'Stochastic gradient descent uses simple update rule to optimize value function features for state prediction.', 'duration': 29.263, 'max_score': 955.478, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw955478.jpg'}, {'end': 1224.042, 'src': 'embed', 'start': 1198.64, 'weight': 11, 'content': [{'end': 1204.823, 'text': 'But this process will converge under suitable assumptions to the best parameter factor that you can find.', 'start': 1198.64, 'duration': 6.183}, {'end': 1214.658, 'text': 'So again, for linear functions, we basically now are going to extend the action value approach also to linear functions,', 'start': 1208.976, 'duration': 5.682}, {'end': 1218.06, 'text': "where we're going to assume that we have features for each state action pair.", 'start': 1214.658, 'duration': 3.402}, {'end': 1224.042, 'text': 'And then we can just multiply rate parameter w with these features,', 'start': 1219.4, 'duration': 4.642}], 'summary': 'Extending action value approach to linear functions with features for each state action pair.', 'duration': 25.402, 'max_score': 1198.64, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1198640.jpg'}], 'start': 38.492, 'title': 'Optimizing learning through lecture breaks and reinforcement learning recap', 'summary': 'Emphasizes the benefits of lecture breaks for reflection and note-taking, and introduces reinforcement learning concepts including model-free prediction, monte carlo algorithms, bandits with states, and function approximation. it also discusses challenges of using lookup tables in rl, proposes function approximation, and explains linear function approximation for value function estimation in rl.', 'chapters': [{'end': 77.042, 'start': 38.492, 'title': 'Optimizing learning through lecture breaks', 'summary': 'Emphasizes the benefits of pausing during lectures for reflection and note-taking, and suggests breaking the lecture into multiple sessions for improved understanding and retention.', 'duration': 38.55, 'highlights': ['Pausing during lectures for reflection and note-taking can improve understanding and retention.', 'Breaking the lecture into multiple sessions can aid in better comprehension and retention.', 'Encouraging students to revisit background material for improved understanding.']}, {'end': 518.061, 'start': 77.526, 'title': 'Reinforcement learning recap and model-free prediction', 'summary': 'Recaps reinforcement learning concepts, including model-free prediction, and introduces topics such as monte carlo algorithms, bandits with states, and function approximation.', 'duration': 440.535, 'highlights': ['The chapter explains reinforcement learning as the science of learning how to make decisions through an agent interacting with an environment, involving policies, value functions, and models.', 'The chapter introduces the concept of model-free prediction in reinforcement learning, which involves estimating values without knowledge of the Markov decision process, and discusses the application of this concept in real-world scenarios.', 'The chapter discusses Monte Carlo algorithms in reinforcement learning, emphasizing the use of sampling to learn without a model and the specific usage of Monte Carlo sampling to refer to complete episode sampling.', 'The chapter delves into the concept of bandits with states in reinforcement learning, explaining the extension from the normal bandit case to contextual bandits, where the goal is to estimate expected rewards based on both actions and states.', 'The chapter touches on the topic of function approximation in reinforcement learning, indicating a shift in focus before returning to discuss bandits with states and sequential decision processes.']}, {'end': 869.45, 'start': 518.621, 'title': 'Function approximation in rl', 'summary': 'Discusses the challenges of using lookup tables for every state-action pair, introduces the concept of agent state, and proposes using function approximation with parameter vector update instead of updating cells in a big table, aiming to generalize to unseen states and learn value before visiting them.', 'duration': 350.829, 'highlights': ['The challenges of using lookup tables for every state-action pair due to the potential inefficiency of memory storage and slow learning when updating each state independently.', 'Introduction of the concept of agent state and the need for function approximation to address the issues of slow learning and unobservable states, with the aim to generalize to unseen states and update similar states automatically.', 'The proposal to use function approximation with parameter vector update instead of updating cells in a big table to potentially generalize to unseen states and update the values of similar states automatically.']}, {'end': 1375.326, 'start': 871.751, 'title': 'Linear function approximation', 'summary': 'Explains the use of linear function approximation for value function estimation in reinforcement learning, emphasizing the update rules, objective function, and stochastic gradient descent, with a focus on minimizing loss and optimizing parameter vector w.', 'duration': 503.575, 'highlights': ['The objective function for predictions in linear function approximation is defined as minimizing the squared distance between current estimates and the true value function v pi.', 'Stochastic gradient descent is used for updating the parameter vector w, where the step size parameter and prediction error term feature prominently in the update rule.', 'The chapter elaborates on the extension of the action value approach to linear functions, highlighting the use of features for each state-action pair and the corresponding stochastic gradient descent update for the weights.']}], 'duration': 1336.834, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw38492.jpg', 'highlights': ['Encouraging students to revisit background material for improved understanding.', 'Breaking the lecture into multiple sessions can aid in better comprehension and retention.', 'Pausing during lectures for reflection and note-taking can improve understanding and retention.', 'The chapter introduces the concept of model-free prediction in reinforcement learning.', 'The chapter discusses Monte Carlo algorithms in reinforcement learning.', 'The chapter delves into the concept of bandits with states in reinforcement learning.', 'The challenges of using lookup tables for every state-action pair due to potential inefficiency of memory storage and slow learning.', 'Introduction of the concept of agent state and the need for function approximation to address the issues of slow learning and unobservable states.', 'The proposal to use function approximation with parameter vector update instead of updating cells in a big table.', 'The objective function for predictions in linear function approximation is defined as minimizing the squared distance between current estimates and the true value function v pi.', 'Stochastic gradient descent is used for updating the parameter vector w.', 'The chapter elaborates on the extension of the action value approach to linear functions.']}, {'end': 1823.181, 'segs': [{'end': 1410.713, 'src': 'embed', 'start': 1377.626, 'weight': 0, 'content': [{'end': 1379.326, 'text': 'This example is in the game of Blackjack.', 'start': 1377.626, 'duration': 1.7}, {'end': 1386.148, 'text': 'And Blackjack is a card game in which the goal is to get more points than an opponent called the dealer.', 'start': 1379.767, 'duration': 6.381}, {'end': 1393.95, 'text': "We're going to go first and we're going to try to accumulate as many points as we can, but not more than 21.", 'start': 1387.168, 'duration': 6.782}, {'end': 1397.471, 'text': 'So if you get more than 21 points, you go bust, as they say.', 'start': 1393.95, 'duration': 3.521}, {'end': 1404.768, 'text': 'And so therefore, basically your goal is to get as close to 21 without going beyond it.', 'start': 1398.702, 'duration': 6.066}, {'end': 1410.713, 'text': "To do so, you're going to draw cards and each of these cards is going to be worth a certain amount of points.", 'start': 1405.668, 'duration': 5.045}], 'summary': 'Blackjack is a card game where players aim to get close to 21 points without going over, and the goal is to beat the dealer.', 'duration': 33.087, 'max_score': 1377.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1377626.jpg'}, {'end': 1447.894, 'src': 'embed', 'start': 1419.508, 'weight': 4, 'content': [{'end': 1425.709, 'text': 'All of the picture cards, the Jack, Queen and King are worth 10 points and the Ace is a special card.', 'start': 1419.508, 'duration': 6.201}, {'end': 1429.41, 'text': "It's worth either 11 points or you can pick it to be worth one point.", 'start': 1425.789, 'duration': 3.621}, {'end': 1433.911, 'text': 'This is useful for when you draw a card and you go above 21.', 'start': 1429.61, 'duration': 4.301}, {'end': 1437.752, 'text': 'If you then had an Ace, you can say, ah, no, my Ace is now no longer worth 11 points.', 'start': 1433.911, 'duration': 3.841}, {'end': 1440.252, 'text': "Now I'm going to make it worth one point instead.", 'start': 1437.812, 'duration': 2.44}, {'end': 1442.353, 'text': "And now you're below 21 again.", 'start': 1440.832, 'duration': 1.521}, {'end': 1447.894, 'text': "We're going to formalize this problem in the following way.", 'start': 1443.513, 'duration': 4.381}], 'summary': 'In blackjack, picture cards are worth 10 points, ace can be 11 or 1 point, allowing players to adjust their score to stay below 21.', 'duration': 28.386, 'max_score': 1419.508, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1419508.jpg'}, {'end': 1580.787, 'src': 'embed', 'start': 1523.159, 'weight': 1, 'content': [{'end': 1526.02, 'text': 'But the state will have changed because now you no longer have a usual ace.', 'start': 1523.159, 'duration': 2.861}, {'end': 1530.692, 'text': "In terms of the action space, there's two different actions you can do.", 'start': 1528.228, 'duration': 2.464}, {'end': 1535.078, 'text': "You can either stick, at which point it's now the dealer's turn and they will resolve.", 'start': 1530.812, 'duration': 4.266}, {'end': 1537.001, 'text': 'This will then terminate your episode.', 'start': 1535.358, 'duration': 1.643}, {'end': 1539.705, 'text': 'Or you can draw, which means you just take another card.', 'start': 1537.622, 'duration': 2.083}, {'end': 1543.23, 'text': 'If you draw, you can draw again in the next step, or you could stick in the next step.', 'start': 1540.105, 'duration': 3.125}, {'end': 1552.689, 'text': "When you stick, the episode always terminates and you get a plus one, if then the dealer doesn't get a higher sum than you,", 'start': 1545.086, 'duration': 7.603}, {'end': 1554.81, 'text': 'or if the dealer goes bust, which is also possible.', 'start': 1552.689, 'duration': 2.121}, {'end': 1557.171, 'text': 'So if the dealer goes above 21, they lose.', 'start': 1554.83, 'duration': 2.341}, {'end': 1562.233, 'text': "If they don't go above 21, but they have fewer points than you, they also lose and you get plus one.", 'start': 1557.791, 'duration': 4.442}, {'end': 1567.256, 'text': 'If you happen to arrive at exactly the same number, you get zero.', 'start': 1563.054, 'duration': 4.202}, {'end': 1572.938, 'text': 'But if the dealer manages to get more points than you without going above 21, then you get minus one.', 'start': 1568.376, 'duration': 4.562}, {'end': 1580.787, 'text': "If instead you draw, if you go above 21 and you didn't have a usable ace, you cannot avoid this from happening.", 'start': 1574.705, 'duration': 6.082}], 'summary': "In a blackjack game, sticking yields +1 if dealer doesn't exceed 21, drawing can lead to point loss.", 'duration': 57.628, 'max_score': 1523.159, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1523159.jpg'}, {'end': 1665.605, 'src': 'embed', 'start': 1636.585, 'weight': 2, 'content': [{'end': 1640.028, 'text': "So there's some slight partial observability here, but that turns out not to be a big factor.", 'start': 1636.585, 'duration': 3.443}, {'end': 1644.055, 'text': 'Then what we do is we run Monte Carlo learning.', 'start': 1641.994, 'duration': 2.061}, {'end': 1651.598, 'text': "So we're going to generate a whole bunch of episodes and we're going to sample the returns for those episodes for some fixed policy.", 'start': 1644.095, 'duration': 7.503}, {'end': 1660.262, 'text': "And then we're going to generate these plots and I'm going to explain these in a second, which show your value estimates for that policy.", 'start': 1651.618, 'duration': 8.644}, {'end': 1665.605, 'text': 'And then, of course, in later lectures we can talk about oh, how should we then maybe improve our policy to do better?', 'start': 1661.323, 'duration': 4.282}], 'summary': 'Monte carlo learning generates value estimates for fixed policy.', 'duration': 29.02, 'max_score': 1636.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1636585.jpg'}, {'end': 1709.122, 'src': 'embed', 'start': 1684.902, 'weight': 6, 'content': [{'end': 1691.406, 'text': 'So one axis is which card the dealer is showing, which is either an ace or a two, or a three and so on, or a 10,', 'start': 1684.902, 'duration': 6.504}, {'end': 1696.533, 'text': "where we merge all of the picture cards just into a 10 as well, because they're all worth 10..", 'start': 1691.406, 'duration': 5.127}, {'end': 1699.055, 'text': 'And on the other axis, we see the current sum that we have.', 'start': 1696.533, 'duration': 2.522}, {'end': 1703.318, 'text': "It's either 12, 13, and so on, or 21.", 'start': 1699.295, 'duration': 4.023}, {'end': 1707.22, 'text': 'The z-axis, the height, is basically the estimated value.', 'start': 1703.318, 'duration': 3.902}, {'end': 1709.122, 'text': "We see it's always between minus one and one.", 'start': 1707.261, 'duration': 1.861}], 'summary': "Blackjack strategy based on dealer's card and player's current sum, with estimated value between -1 and 1.", 'duration': 24.22, 'max_score': 1684.902, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1684902.jpg'}, {'end': 1801.285, 'src': 'embed', 'start': 1772.806, 'weight': 7, 'content': [{'end': 1776.347, 'text': 'And then the second question is why does it look more bumpy than the plot below it?', 'start': 1772.806, 'duration': 3.541}, {'end': 1782.217, 'text': "So feel free to think about that for a second, and then I'm going to give you my explanation.", 'start': 1778.194, 'duration': 4.023}, {'end': 1785.981, 'text': 'So maybe the first one was maybe somewhat obvious.', 'start': 1783.278, 'duration': 2.703}, {'end': 1790.725, 'text': "So after 10, 000 episodes, we don't have a lot of data for each of these states yet.", 'start': 1786.241, 'duration': 4.484}, {'end': 1797.03, 'text': 'But after half a million episodes, we have accumulated by now quite a bit of data for each of these states, so our value estimates have improved.', 'start': 1791.145, 'duration': 5.885}, {'end': 1801.285, 'text': 'So maybe the difference between the left and the right was somewhat obvious.', 'start': 1798.223, 'duration': 3.062}], 'summary': 'Value estimates improve after half a million episodes, reducing bumpiness in plots.', 'duration': 28.479, 'max_score': 1772.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1772806.jpg'}], 'start': 1377.626, 'title': 'Blackjack game fundamentals', 'summary': 'Introduces the rules and objectives of blackjack, with the goal of accumulating points close to 21 without exceeding it using various cards. it also describes the game state space and action space, including actions and outcomes, and discusses monte carlo learning to estimate values for different policies through episode generation and analysis of data accumulation impact.', 'chapters': [{'end': 1440.252, 'start': 1377.626, 'title': 'Blackjack basics', 'summary': 'Introduces the rules and objectives of the card game blackjack, where the goal is to accumulate points as close to 21 as possible without exceeding it, using number cards, picture cards, and special ace card worth 11 or 1 point.', 'duration': 62.626, 'highlights': ['The objective in Blackjack is to accumulate points as close to 21 without going over, using number cards, picture cards worth 10 points, and the versatile Ace card worth 11 or 1 point.', 'If the accumulated points exceed 21, it results in a bust, making the player lose the round.', 'The special Ace card can be worth either 11 points or 1 point, providing flexibility to adjust the total points to avoid going over 21.']}, {'end': 1636.025, 'start': 1440.832, 'title': 'Blackjack game state description', 'summary': "Describes the state space and action space of a blackjack game, including the current sum, dealer's card, and usable ace, with the actions of drawing and sticking, and the corresponding outcomes of the game.", 'duration': 195.193, 'highlights': ["The state space of the blackjack game includes the current sum, dealer's card, and usable ace, with the current sum ranging between 12 and 21.", 'The action space consists of two actions: sticking and drawing, with corresponding outcomes of winning, losing, or continuing the game.', "The episode terminates if the player's sum goes above 21 without a usable ace, resulting in a loss with a score of minus one."]}, {'end': 1823.181, 'start': 1636.585, 'title': 'Monte carlo learning and value estimation', 'summary': 'Discusses the process of monte carlo learning to estimate value for different policies by generating episodes, explaining the value estimates through plots, and analyzing the impact of data accumulation on value estimates for different states.', 'duration': 186.596, 'highlights': ['The process of Monte Carlo learning involves generating a large number of episodes to sample returns for a fixed policy, with the chapter mentioning the use of 10,000 and half a million episodes for analysis and comparison.', "Explanation of the value estimation plots, detailing the axes representing the dealer's card and the current sum, and the z-axis representing the estimated value, which ranges between -1 and 1, with intermediate rewards always being zero.", 'Analysis of the bumpy nature of the value estimation plots based on the number of episodes, explaining the impact of data accumulation on value estimates for different states and the rarity of certain states in the deck of cards.']}], 'duration': 445.555, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1377626.jpg', 'highlights': ['The objective in Blackjack is to accumulate points as close to 21 without going over, using number cards, picture cards worth 10 points, and the versatile Ace card worth 11 or 1 point.', "The state space of the blackjack game includes the current sum, dealer's card, and usable ace, with the current sum ranging between 12 and 21.", 'The process of Monte Carlo learning involves generating a large number of episodes to sample returns for a fixed policy, with the chapter mentioning the use of 10,000 and half a million episodes for analysis and comparison.', 'If the accumulated points exceed 21, it results in a bust, making the player lose the round.', 'The special Ace card can be worth either 11 points or 1 point, providing flexibility to adjust the total points to avoid going over 21.', 'The action space consists of two actions: sticking and drawing, with corresponding outcomes of winning, losing, or continuing the game.', "Explanation of the value estimation plots, detailing the axes representing the dealer's card and the current sum, and the z-axis representing the estimated value, which ranges between -1 and 1, with intermediate rewards always being zero.", 'Analysis of the bumpy nature of the value estimation plots based on the number of episodes, explaining the impact of data accumulation on value estimates for different states and the rarity of certain states in the deck of cards.', "The episode terminates if the player's sum goes above 21 without a usable ace, resulting in a loss with a score of minus one."]}, {'end': 2755.757, 'segs': [{'end': 2431.315, 'src': 'embed', 'start': 2394.47, 'weight': 0, 'content': [{'end': 2399.853, 'text': 'This algorithm is called SARSA because it uses a state, action, reward, state, and action.', 'start': 2394.47, 'duration': 5.383}, {'end': 2403.083, 'text': 'This name was coined by Rich Sutton.', 'start': 2400.942, 'duration': 2.141}, {'end': 2411.007, 'text': 'Now in terms of property, temporal difference learning is model-free.', 'start': 2408.245, 'duration': 2.762}, {'end': 2415.649, 'text': "It doesn't require knowledge of the Markov decision process, and it can therefore learn directly from experience.", 'start': 2411.087, 'duration': 4.562}, {'end': 2420.011, 'text': 'And interestingly, it can also learn from incomplete episodes by using this bootstrapping.', 'start': 2416.409, 'duration': 3.602}, {'end': 2426.374, 'text': "This means that if the episode is really long, you don't have to wait until all the way in the end of the episode before you can start learning.", 'start': 2420.631, 'duration': 5.743}, {'end': 2428.375, 'text': 'And this can be quite beneficial.', 'start': 2427.234, 'duration': 1.141}, {'end': 2431.315, 'text': 'because then you can also learn during the episode.', 'start': 2429.474, 'duration': 1.841}], 'summary': 'Sarsa algorithm, coined by rich sutton, learns directly from incomplete episodes using bootstrapping.', 'duration': 36.845, 'max_score': 2394.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2394470.jpg'}, {'end': 2678.696, 'src': 'embed', 'start': 2655.492, 'weight': 2, 'content': [{'end': 2664.549, 'text': "Well, the Monte Carlo algorithm would basically just look at the total outcome and therefore it would then update all of these states that you've seen so far to take into account this new total outcome.", 'start': 2655.492, 'duration': 9.057}, {'end': 2671.072, 'text': 'It basically just looks at the whole sample and it says, well, when you were leaving the office, you thought it was 30, but it was actually 43.', 'start': 2664.649, 'duration': 6.423}, {'end': 2673.754, 'text': 'When you reached the car, five minutes had passed.', 'start': 2671.072, 'duration': 2.682}, {'end': 2678.696, 'text': 'You thought it was still 35 minutes more, which means our total prediction was 40.', 'start': 2673.774, 'duration': 4.922}], 'summary': 'Monte carlo algorithm updates states based on actual outcomes, improving predictions.', 'duration': 23.204, 'max_score': 2655.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2655492.jpg'}], 'start': 1823.782, 'title': 'Monte carlo and temporal difference learning in reinforcement learning', 'summary': 'Discusses the use of monte carlo learning in value prediction, emphasizing the impact of the number of episodes and high variance in returns. it also introduces temporal difference learning, comparing it with other methods, explaining the sarsa algorithm, and highlighting its model-free nature and ability to learn from incomplete episodes. additionally, it explains the differences between monte carlo and temporal difference learning using a driving home example.', 'chapters': [{'end': 1932.832, 'start': 1823.782, 'title': 'Monte carlo learning in value prediction', 'summary': 'Discusses the use of monte carlo learning in value prediction, highlighting the impact of the number of episodes on learning speed and the challenge of high variance in returns for long episodes.', 'duration': 109.05, 'highlights': ['Monte Carlo algorithms can be used to learn value predictions, with the example of Blackjack illustrating the potential slowness of learning for long episodes.', 'The impact of the number of episodes on learning speed is discussed, emphasizing the potential slowness when episodes are very long.', 'The challenge of high variance in returns for long episodes is highlighted, leading to a consideration of alternative algorithms with fewer downsides.']}, {'end': 2451.109, 'start': 1933.012, 'title': 'Temporal difference learning in reinforcement learning', 'summary': 'Introduces the concept of temporal difference learning in reinforcement learning, comparing it with dynamic programming and monte carlo learning, highlighting its properties and applications, and explaining the sarsa algorithm. it discusses the use of value estimates and sampling in temporal difference learning, emphasizing its model-free nature and ability to learn from incomplete episodes.', 'duration': 518.097, 'highlights': ['Temporal difference learning is model-free and can learn directly from experience, even from incomplete episodes, offering the potential for learning during the episode, which is beneficial in cases where the lifetime is considered as one big episode.', 'Temporal difference learning introduces the concept of bootstrapping, using value estimates to update the value estimates, and it shares commonalities with dynamic programming and Monte Carlo learning by involving one-step deep processing and sampling.', 'The SARSA algorithm, which is an extension of temporal difference learning to action values, uses a state-action, reward, state, and action sequence, and it is model-free, allowing learning directly from experience, without requiring knowledge of the Markov decision process.']}, {'end': 2755.757, 'start': 2453.837, 'title': 'Difference between monte carlo and temporal difference learning', 'summary': 'Explains the driving home example to illustrate the differences between monte carlo and temporal difference learning, where monte carlo updates all states based on the total outcome, while temporal difference learning updates state values based on immediate predictions and subsequent revisions.', 'duration': 301.92, 'highlights': ['Monte Carlo algorithm updates all states based on the total outcome, adjusting predictions accordingly, such as updating the prediction from 35 to 38 when reaching the car.', 'Temporal difference learning updates state values immediately based on immediate predictions and subsequent revisions, as seen when revising the prediction from 35 to 40 upon reaching the car and from 35 to 15 upon exiting the highway.']}], 'duration': 931.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw1823782.jpg', 'highlights': ['Temporal difference learning introduces bootstrapping and can learn from incomplete episodes.', 'SARSA algorithm is model-free and learns directly from experience.', 'Monte Carlo algorithm updates all states based on the total outcome.']}, {'end': 3588.742, 'segs': [{'end': 2815.748, 'src': 'embed', 'start': 2780.117, 'weight': 0, 'content': [{'end': 2785.62, 'text': "So now we're going to compare these algorithms a little bit more, and we're going to talk about the advantages and disadvantages of each.", 'start': 2780.117, 'duration': 5.503}, {'end': 2790.822, 'text': 'So as I mentioned, temporal difference learning can learn already before knowing the final outcome.', 'start': 2786.7, 'duration': 4.122}, {'end': 2797.485, 'text': 'It can in fact learn online after every step that it has seen, whereas Monte Carlo learning must wait until the end of the episode,', 'start': 2791.362, 'duration': 6.123}, {'end': 2800.226, 'text': 'before the return is known and before it could actually execute this update.', 'start': 2797.485, 'duration': 2.741}, {'end': 2804.188, 'text': 'In addition, temporal difference learning can learn without the final outcome.', 'start': 2801.247, 'duration': 2.941}, {'end': 2807.95, 'text': 'This is useful for when you, for instance, only have incomplete sequences.', 'start': 2804.248, 'duration': 3.702}, {'end': 2813.046, 'text': 'It could be the case that you have a database of experience that you want to learn from somehow,', 'start': 2808.843, 'duration': 4.203}, {'end': 2815.748, 'text': 'but the database is corrupted or is missing some data,', 'start': 2813.046, 'duration': 2.702}], 'summary': 'Comparing temporal difference learning and monte carlo learning, the former can learn online after each step and without the final outcome, whereas the latter must wait until the end of the episode for the return to be known.', 'duration': 35.631, 'max_score': 2780.117, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2780117.jpg'}, {'end': 2884.097, 'src': 'embed', 'start': 2857.489, 'weight': 2, 'content': [{'end': 2865.572, 'text': 'How many steps you want to do in an episode does not matter in terms of the computational complexity on each time step for temporal difference learning.', 'start': 2857.489, 'duration': 8.083}, {'end': 2870.593, 'text': 'So why is that not true for Monte Carlo? Well, Monte Carlo needs to store everything.', 'start': 2866.212, 'duration': 4.381}, {'end': 2872.394, 'text': 'So TD can learn from single transitions,', 'start': 2870.713, 'duration': 1.681}, {'end': 2878.115, 'text': "but Monte Carlo must store all the predictions you've seen in an episode in order to be able to update them at the end of the episode.", 'start': 2872.394, 'duration': 5.721}, {'end': 2884.097, 'text': 'So that means that the memory requirements for Monte Carlo actually grow when the episode becomes longer and longer.', 'start': 2878.556, 'duration': 5.541}], 'summary': 'Temporal difference learning has lower memory requirements than monte carlo as it can learn from single transitions, while monte carlo needs to store all predictions from an episode, causing memory requirements to grow with longer episodes.', 'duration': 26.608, 'max_score': 2857.489, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2857489.jpg'}, {'end': 2953.82, 'src': 'embed', 'start': 2924.693, 'weight': 3, 'content': [{'end': 2931.357, 'text': 'Now, the temporal difference target does have lower variance because the return might depend on many random actions, transitions and rewards,', 'start': 2924.693, 'duration': 6.664}, {'end': 2935.52, 'text': 'whereas the temporal difference target only depends on one random action, transition and reward.', 'start': 2931.357, 'duration': 4.163}, {'end': 2940.487, 'text': 'But in some cases temporal difference learning can have irreducible lies.', 'start': 2937.421, 'duration': 3.066}, {'end': 2947.179, 'text': "For instance, the world might be partially observable, and the states that we're plugging into these value estimates might not tell us everything.", 'start': 2941.188, 'duration': 5.991}, {'end': 2953.82, 'text': "That's already a problem for Monte Carlo learning, because it means that the states that we're updating don't have all the information.", 'start': 2948.137, 'duration': 5.683}], 'summary': 'Temporal difference target has lower variance, but may have irreducible bias due to partial observability.', 'duration': 29.127, 'max_score': 2924.693, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2924693.jpg'}, {'end': 2999.757, 'src': 'embed', 'start': 2975.706, 'weight': 4, 'content': [{'end': 2981.674, 'text': 'Implicitly, Monte Carlo learning, a different way to think about this, would account for all of the latent variables happening along the way.', 'start': 2975.706, 'duration': 5.968}, {'end': 2987.501, 'text': "So, even though you can't observe exactly where you are, the return itself would just take that into account,", 'start': 2981.714, 'duration': 5.787}, {'end': 2992.528, 'text': "because the return itself does depend on all of the environment variables that you can't maybe observe.", 'start': 2987.501, 'duration': 5.027}, {'end': 2999.757, 'text': 'Similarly, but a little bit different, the function used to approximate the values might fit poorly.', 'start': 2994.415, 'duration': 5.342}], 'summary': 'Monte carlo learning accounts for latent variables in returns and may have poor value approximation.', 'duration': 24.051, 'max_score': 2975.706, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2975706.jpg'}, {'end': 3438.652, 'src': 'embed', 'start': 3411.018, 'weight': 5, 'content': [{'end': 3415.179, 'text': 'You can learn very quickly at the beginning, but maybe the error stabilizes at a higher point.', 'start': 3411.018, 'duration': 4.161}, {'end': 3418.42, 'text': 'But if we compare temporal difference learning to Monte Carlo learning,', 'start': 3415.859, 'duration': 2.561}, {'end': 3425.461, 'text': 'we can see that temporal difference learning allows you to set the step size higher and also has a different trade off.', 'start': 3418.42, 'duration': 7.041}, {'end': 3428.682, 'text': 'And indeed, a lot of these errors are smaller than for Monte Carlo.', 'start': 3425.661, 'duration': 3.021}, {'end': 3434.29, 'text': 'So if we look at, for instance, at the midway point at 50 episodes, we can see that temporal difference.', 'start': 3429.608, 'duration': 4.682}, {'end': 3438.652, 'text': 'learning then prefers a step size of 0.1 if you had to pick one constant step size,', 'start': 3434.29, 'duration': 4.362}], 'summary': 'Temporal difference learning prefers a step size of 0.1 at 50 episodes.', 'duration': 27.634, 'max_score': 3411.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3411018.jpg'}], 'start': 2755.777, 'title': 'Temporal difference and monte carlo learning', 'summary': 'Discusses the differences between temporal difference and monte carlo learning, highlighting advantages and disadvantages such as the ability to learn without the final outcome, computational complexity, memory requirements, bias-variance trade-off, and properties related to function approximation. it also explains temporal difference learning using a random walk example with five states and two actions per state, showcasing the effects of different step sizes on the learning speed and error rates.', 'chapters': [{'end': 3028.372, 'start': 2755.777, 'title': 'Comparison of temporal difference and monte carlo learning', 'summary': 'Discusses the differences between temporal difference and monte carlo learning, highlighting advantages and disadvantages such as the ability to learn without the final outcome, computational complexity, memory requirements, bias-variance trade-off, and properties related to function approximation.', 'duration': 272.595, 'highlights': ['Temporal difference learning can learn online after every step, whereas Monte Carlo learning must wait until the end of the episode, resulting in faster learning process for temporal difference learning.', 'Temporal difference learning can learn without the final outcome, making it suitable for incomplete sequences or continuing environments, while Monte Carlo needs full episodes for its updates, indicating the flexibility of temporal difference learning.', 'Temporal difference learning has constant computational complexity on every time step regardless of the number of steps in an episode, unlike Monte Carlo which needs to store all predictions seen in an episode, leading to lower computational complexity for temporal difference learning.', 'Temporal difference learning has lower variance in the target as it depends on fewer random actions, transitions, and rewards compared to the Monte Carlo return, highlighting the variance difference between the two learning methods.', "Monte Carlo learning can account for all latent variables happening along the way, even if they can't be observed, suggesting a different approach to handling incomplete information compared to temporal difference learning.", 'In the tabular case, both Monte Carlo and temporal difference learning will converge to the true value estimates, indicating a common convergence property in certain scenarios.']}, {'end': 3588.742, 'start': 3032.315, 'title': 'Temporal difference learning', 'summary': "Explains temporal difference learning using a random walk example with five states and two actions per state, where the true values for each state are derived through dynamic programming and the algorithm's performance is compared with monte carlo learning, showcasing the effects of different step sizes on the learning speed and error rates.", 'duration': 556.427, 'highlights': ['The random walk example features five states with two actions each, and the true values for each state are derived through dynamic programming, with state E having a reward of one when taking the right action and state A having a reward of zero when taking the left action.', 'The temporal difference (TD) algorithm is illustrated through value estimates for each state after a certain number of episodes, showcasing a step size parameter of 0.1 and the convergence of the values to the true values after 100 episodes.', 'Comparing Monte Carlo learning and temporal difference learning, it is observed that temporal difference learning allows for higher step sizes, resulting in smaller errors than Monte Carlo learning at the midway point of 50 episodes.', 'The chapter delves into the properties of temporal difference learning and Monte Carlo learning, showcasing the higher variance and error rates of Monte Carlo learning for any constant step size, emphasizing the advantages of temporal difference learning.']}], 'duration': 832.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw2755777.jpg', 'highlights': ['Temporal difference learning can learn online after every step, resulting in faster learning process.', 'Temporal difference learning can learn without the final outcome, making it suitable for incomplete sequences.', 'Temporal difference learning has constant computational complexity on every time step, unlike Monte Carlo.', 'Temporal difference learning has lower variance in the target compared to Monte Carlo return.', 'Monte Carlo learning can account for all latent variables happening along the way, suggesting a different approach to handling incomplete information.', 'Temporal difference learning allows for higher step sizes, resulting in smaller errors than Monte Carlo learning.']}, {'end': 4052.363, 'segs': [{'end': 3650.069, 'src': 'embed', 'start': 3590.839, 'weight': 0, 'content': [{'end': 3597.567, 'text': "And then we're going to apply this idea of batch learning to a specific small example in which we only have two states,", 'start': 3590.839, 'duration': 6.728}, {'end': 3600.99, 'text': 'where the purpose is for us to be able to reason through this all the way.', 'start': 3597.567, 'duration': 3.423}, {'end': 3604.474, 'text': "So the two states, let's just call them A and B.", 'start': 3601.931, 'duration': 2.543}, {'end': 3608.599, 'text': "There's not going to be any discounting, and let's say we have eight episodes of experience.", 'start': 3604.474, 'duration': 4.125}, {'end': 3612.061, 'text': 'Here, each line below denotes exactly one episode.', 'start': 3609.4, 'duration': 2.661}, {'end': 3621.745, 'text': 'So one of these episodes starts in a state A and then gets a reward of zero, then proceeds to a state B and then another reward of zero,', 'start': 3612.761, 'duration': 8.984}, {'end': 3622.866, 'text': 'and then the episode terminates.', 'start': 3621.745, 'duration': 1.121}, {'end': 3630.509, 'text': 'The next episode on the next line, we started in state B instead of A, and then we got a reward of one and terminated.', 'start': 3624.586, 'duration': 5.923}, {'end': 3638.962, 'text': 'That happens more often, so six out of these eight episodes are of that form where we start in B and then terminate with a reward of one.', 'start': 3631.297, 'duration': 7.665}, {'end': 3644.025, 'text': 'And then we also have one episode that did start in B as well, but it terminated with a reward of zero.', 'start': 3639.542, 'duration': 4.483}, {'end': 3650.069, 'text': 'Now I want you to think about, maybe even without thinking about these algorithms at all.', 'start': 3645.426, 'duration': 4.643}], 'summary': 'Applying batch learning to 2-state example with 8 episodes.', 'duration': 59.23, 'max_score': 3590.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3590839.jpg'}, {'end': 3804.727, 'src': 'embed', 'start': 3775.203, 'weight': 1, 'content': [{'end': 3781.686, 'text': "Well, that's actually a little bit unclear, and I'm going to explain why both these answers are actually in some way reasonable.", 'start': 3775.203, 'duration': 6.483}, {'end': 3791.529, 'text': 'So Monte Carlo learning converges to the best mean square fit for the observed returns.', 'start': 3785.464, 'duration': 6.065}, {'end': 3792.95, 'text': 'So you can write this as follows,', 'start': 3791.609, 'duration': 1.341}, {'end': 3804.727, 'text': 'where we sum over the episodes k from one to big K And then within each episode we look at all the timestamps from 1 to TK, big TK for that episode.', 'start': 3792.95, 'duration': 11.777}], 'summary': 'Monte carlo learning converges to the best mean square fit for observed returns.', 'duration': 29.524, 'max_score': 3775.203, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3775203.jpg'}, {'end': 3865.972, 'src': 'embed', 'start': 3834.201, 'weight': 3, 'content': [{'end': 3839.883, 'text': 'Instead, temporal difference learning converges to the solution of the max likelihood Markov model, given the data.', 'start': 3834.201, 'duration': 5.682}, {'end': 3841.984, 'text': "That's what we saw on the previous slide.", 'start': 3840.383, 'duration': 1.601}, {'end': 3844.925, 'text': "This is the most likely model given the data that we've seen so far.", 'start': 3842.004, 'duration': 2.921}, {'end': 3850.107, 'text': 'And then temporal difference learning, turns out, finds that solution that corresponds to this model.', 'start': 3845.465, 'duration': 4.642}, {'end': 3856.489, 'text': "So, if you agree with that approach, if you say well, that's what you should be estimating and then you should be solving that.", 'start': 3850.667, 'duration': 5.822}, {'end': 3857.71, 'text': "that's what temporal difference learning does.", 'start': 3856.489, 'duration': 1.221}, {'end': 3865.972, 'text': "So this would be the solution of the empirical Markov decision process, assuming that the empirical data that you've seen is actually the true data.", 'start': 3859.127, 'duration': 6.845}], 'summary': 'Temporal difference learning converges to max likelihood markov model solution.', 'duration': 31.771, 'max_score': 3834.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3834201.jpg'}, {'end': 3987.074, 'src': 'embed', 'start': 3947.132, 'weight': 5, 'content': [{'end': 3952.454, 'text': 'If that were true for the problem that we were in, that would be the better estimate, perhaps.', 'start': 3947.132, 'duration': 5.322}, {'end': 3957.376, 'text': 'And indeed learning with Monte Carlo can help in partially observable environments,', 'start': 3953.134, 'duration': 4.242}, {'end': 3962.579, 'text': 'because it makes less of an assumption that the states are useful to construct your value estimates.', 'start': 3957.376, 'duration': 5.203}, {'end': 3965.587, 'text': 'So we mentioned this before,', 'start': 3964.585, 'duration': 1.002}, {'end': 3975.784, 'text': 'so in some sense you can view this example with this two-state batch learning as an example of this difference in terms of how they deal with fully observed versus partially observed environments.', 'start': 3965.587, 'duration': 10.197}, {'end': 3984.673, 'text': 'Important to note is that with finite data and also with function approximation, the solutions, even in the limits,', 'start': 3978.01, 'duration': 6.663}, {'end': 3987.074, 'text': 'might differ between temporal difference, learning and Monte Carlo.', 'start': 3984.673, 'duration': 2.401}], 'summary': 'Monte carlo learning is better for partially observable environments, with differences in value estimates compared to fully observed environments.', 'duration': 39.942, 'max_score': 3947.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3947132.jpg'}, {'end': 4037.606, 'src': 'embed', 'start': 4011.659, 'weight': 8, 'content': [{'end': 4017.845, 'text': "And the reason why it's there is that left corresponds here to shallow updates where we just look one step into the future.", 'start': 4011.659, 'duration': 6.186}, {'end': 4020.427, 'text': 'The top versus bottom.', 'start': 4019.066, 'duration': 1.361}, {'end': 4023.971, 'text': 'In the top, we look at mechanisms that look full breadth.', 'start': 4020.748, 'duration': 3.223}, {'end': 4027.374, 'text': 'Of course, in order to do that, you do need to have access to the model.', 'start': 4024.411, 'duration': 2.963}, {'end': 4037.606, 'text': 'So that means that, for instance, if we look both full breadth and full depth, this would give you exhaustive search.', 'start': 4030.939, 'duration': 6.667}], 'summary': 'Left corresponds to shallow updates, while top looks at mechanisms with full breadth and depth.', 'duration': 25.947, 'max_score': 4011.659, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4011659.jpg'}], 'start': 3590.839, 'title': 'Reinforcement learning methods', 'summary': 'Covers batch learning using a small example with two states (a and b) and eight episodes, explores determining state values in reinforcement learning, and discusses temporal difference learning and monte carlo methods and their implications in observable and partially observable environments.', 'chapters': [{'end': 3650.069, 'start': 3590.839, 'title': 'Batch learning example', 'summary': 'Introduces the concept of batch learning using a small example with two states (a and b), eight episodes of experience, and a majority of episodes (6 out of 8) terminating with a reward of one.', 'duration': 59.23, 'highlights': ['The example consists of two states (A and B), with six out of eight episodes terminating with a reward of one.', 'The episode experiences are detailed, with a majority (6 out of 8) starting in state B and terminating with a reward of one.', 'The concept of batch learning is applied to the specific small example to facilitate reasoning through the process.']}, {'end': 3832.027, 'start': 3650.069, 'title': 'Determining state values in reinforcement learning', 'summary': 'Explores the process of determining the values for states a and b in a reinforcement learning model, presenting two plausible approaches and discussing the convergence of monte carlo learning to the best mean square fit for observed returns.', 'duration': 181.958, 'highlights': ['Monte Carlo learning converges to the best mean square fit for the observed returns, aiming to minimize the squared error between observed returns and value estimates.', 'The process involves determining the values for states A and B based on observed episodes, with potential values for state B calculated at 0.75 and state A possibly being zero or 0.75, leading to a discussion on the reasonable approaches to determining state values.']}, {'end': 4052.363, 'start': 3834.201, 'title': 'Temporal difference learning and monte carlo', 'summary': 'Discusses how temporal difference learning and monte carlo methods differ in their approach to estimating the value function and their implications in fully observable and partially observable environments, with emphasis on the convergence to the max likelihood markov model and the impact of finite data and function approximation.', 'duration': 218.162, 'highlights': ['Temporal difference learning converges to the solution of the max likelihood Markov model, given the data, which is the most likely model given the data seen so far.', 'Monte Carlo exploits partially observable environments, making less of an assumption that the states are useful to construct value estimates.', 'The difference in how temporal difference learning and Monte Carlo deal with fully observed versus partially observed environments is an example of their distinct approaches.', "With finite data and function approximation, the solutions may differ between temporal difference learning and Monte Carlo, highlighting the impact of these factors on the methods' outcomes.", 'The unified view presented compares dynamic programming, shallow updates, full breadth, and full depth mechanisms, emphasizing the trade-offs and computational considerations of each approach.']}], 'duration': 461.524, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw3590839.jpg', 'highlights': ['The example consists of two states (A and B), with six out of eight episodes terminating with a reward of one.', 'Monte Carlo learning converges to the best mean square fit for the observed returns, aiming to minimize the squared error between observed returns and value estimates.', 'The episode experiences are detailed, with a majority (6 out of 8) starting in state B and terminating with a reward of one.', 'Temporal difference learning converges to the solution of the max likelihood Markov model, given the data, which is the most likely model given the data seen so far.', 'The concept of batch learning is applied to the specific small example to facilitate reasoning through the process.', 'Monte Carlo exploits partially observable environments, making less of an assumption that the states are useful to construct value estimates.', 'The difference in how temporal difference learning and Monte Carlo deal with fully observed versus partially observed environments is an example of their distinct approaches.', "With finite data and function approximation, the solutions may differ between temporal difference learning and Monte Carlo, highlighting the impact of these factors on the methods' outcomes.", 'The unified view presented compares dynamic programming, shallow updates, full breadth, and full depth mechanisms, emphasizing the trade-offs and computational considerations of each approach.']}, {'end': 5058.475, 'segs': [{'end': 4121.319, 'src': 'embed', 'start': 4094.501, 'weight': 1, 'content': [{'end': 4100.466, 'text': "And in addition to that, which we haven't quite talked about so much yet, the information can propagate quite slowly.", 'start': 4094.501, 'duration': 5.965}, {'end': 4102.169, 'text': "I'll show you an example of this in a moment.", 'start': 4100.487, 'duration': 1.682}, {'end': 4110.613, 'text': "This means that if we see a reward that is quite useful, like it's a surprising reward, temporal difference learning will, by its nature,", 'start': 4102.83, 'duration': 7.783}, {'end': 4113.716, 'text': 'only update the state value immediately in front of it.', 'start': 4110.613, 'duration': 3.103}, {'end': 4121.319, 'text': "If in that episode you never reach that state again, that means that all of the other state values don't learn about this reward.", 'start': 4114.816, 'duration': 6.503}], 'summary': 'Temporal difference learning updates state values only in immediate proximity to rewards, potentially causing slow information propagation.', 'duration': 26.818, 'max_score': 4094.501, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4094501.jpg'}, {'end': 4156.029, 'src': 'embed', 'start': 4131.211, 'weight': 0, 'content': [{'end': 4138.456, 'text': 'learning has a problem, in the sense that information can propagate quite slowly backwards and therefore the credit assignment could be quite slow.', 'start': 4131.211, 'duration': 7.245}, {'end': 4145.381, 'text': 'Now, Monte Carlo learning does propagate information faster.', 'start': 4141.078, 'duration': 4.303}, {'end': 4151.486, 'text': 'As I just said, if you do see a surprising reward, Monte Carlo learning will, at the end of the episode, tell all the previous states about that.', 'start': 4145.461, 'duration': 6.025}, {'end': 4156.029, 'text': 'But of course, the updates are noisier, and it has all of the other properties that we talked about before.', 'start': 4151.986, 'duration': 4.043}], 'summary': 'Monte carlo learning propagates information faster, though updates are noisier.', 'duration': 24.818, 'max_score': 4131.211, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4131211.jpg'}, {'end': 4408.656, 'src': 'embed', 'start': 4381.786, 'weight': 4, 'content': [{'end': 4388.551, 'text': "Whereas if you would have done a 10-step return, at least you're more likely to more quickly bump into one of these updated state values,", 'start': 4381.786, 'duration': 6.765}, {'end': 4392.514, 'text': 'and then information could start propagating backwards to the beginning, where you start.', 'start': 4388.551, 'duration': 3.963}, {'end': 4401.688, 'text': 'So we can apply this to a random walk just to see and get a bit of more of intuition.', 'start': 4396.983, 'duration': 4.705}, {'end': 4404.952, 'text': 'So we see the same random walk that we talked about before here at the top.', 'start': 4401.848, 'duration': 3.104}, {'end': 4408.656, 'text': "But now let's consider giving it 19 states rather than five.", 'start': 4405.592, 'duration': 3.064}], 'summary': 'Using a 10-step return increases likelihood of encountering updated state values, aiding information propagation in a random walk with 19 states.', 'duration': 26.87, 'max_score': 4381.786, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4381786.jpg'}, {'end': 4554.713, 'src': 'embed', 'start': 4530.85, 'weight': 5, 'content': [{'end': 4538.116, 'text': 'That is a 512 step temporal difference learning algorithm, which in this case is essentially Monte Carlo,', 'start': 4530.85, 'duration': 7.266}, {'end': 4542.54, 'text': 'because the probability of having an episode that is more than 512 steps long is very small.', 'start': 4538.116, 'duration': 4.424}, {'end': 4551.289, 'text': 'And we can also see that by the fact that the 256-step temporal difference learning algorithm is actually quite similar,', 'start': 4544.782, 'duration': 6.507}, {'end': 4554.713, 'text': 'because both of those are already quite similar to the full Monte Carlo algorithm.', 'start': 4551.289, 'duration': 3.424}], 'summary': '512-step temporal difference learning algorithm approximates monte carlo due to low probability of episodes exceeding 512 steps; 256-step algorithm also similar.', 'duration': 23.863, 'max_score': 4530.85, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4530850.jpg'}, {'end': 4864.858, 'src': 'embed', 'start': 4838.015, 'weight': 3, 'content': [{'end': 4845.199, 'text': "That was actually referring exactly to this algorithm, where we say there's a more generic algorithm called TD lambda,", 'start': 4838.015, 'duration': 7.184}, {'end': 4848.12, 'text': 'but you can set lambda to zero and then you have your one-step TD algorithm.', 'start': 4845.199, 'duration': 2.921}, {'end': 4857.536, 'text': 'We can compare these to these n-step approaches, and here we plot them side by side for that same multi-step random walk.', 'start': 4850.233, 'duration': 7.303}, {'end': 4859.876, 'text': 'And we see some commonalities.', 'start': 4858.416, 'duration': 1.46}, {'end': 4864.858, 'text': 'First, let me draw your attention to lambda is zero, which is exactly the same curve as for n is one.', 'start': 4860.016, 'duration': 4.842}], 'summary': 'Comparison of td lambda and n-step approaches shows commonalities, with lambda=0 giving same result as n=1.', 'duration': 26.843, 'max_score': 4838.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4838015.jpg'}, {'end': 5008.873, 'src': 'embed', 'start': 4978.932, 'weight': 2, 'content': [{'end': 4982.454, 'text': 'But there is a rough correspondence, and this is one way to think about these algorithms.', 'start': 4978.932, 'duration': 3.522}, {'end': 4987.121, 'text': "And now we're going to talk about some benefits.", 'start': 4985.1, 'duration': 2.021}, {'end': 4988.662, 'text': "So we've already alluded to this.", 'start': 4987.161, 'duration': 1.501}, {'end': 4993.265, 'text': 'So multi-step returns have benefits from both temporal difference learning and Monte Carlo.', 'start': 4989.322, 'duration': 3.943}, {'end': 4997.747, 'text': 'And the reason to consider them is that bootstrapping can have issues with bias.', 'start': 4994.385, 'duration': 3.362}, {'end': 5001.889, 'text': 'So one step TD is not always great and Monte Carlo can have issues with the variance.', 'start': 4997.907, 'duration': 3.982}, {'end': 5008.873, 'text': 'And typically we can think about intermediate values as being somewhat good because they trade off bias and variance in an appropriate way.', 'start': 5002.41, 'duration': 6.463}], 'summary': 'Multi-step returns have benefits from both temporal difference learning and monte carlo, addressing issues with bias and variance.', 'duration': 29.941, 'max_score': 4978.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4978932.jpg'}], 'start': 4053.482, 'title': 'Temporal difference and monte carlo learning', 'summary': 'Compares temporal difference learning and monte carlo learning, addressing trade-offs in credit assignment speed and accuracy, step size and error, and the concept of mixed multi-step returns, including the td(lambda) algorithm which demonstrates a trade-off between bias and variance.', 'chapters': [{'end': 4289.985, 'start': 4053.482, 'title': 'Temporal difference vs. monte carlo learning', 'summary': 'Discusses the differences between temporal difference learning and monte carlo learning, highlighting the trade-offs in credit assignment speed and accuracy, as well as the potential for intermediate approaches using n-step returns.', 'duration': 236.503, 'highlights': ['Monte Carlo learning updates all previous states in an episode when a new reward is encountered, allowing for faster information propagation, while temporal difference learning may only update the immediate state, leading to slower credit assignment. (relevance: 5)', 'Temporal difference learning makes shallow updates by taking one step in the world to update a value estimate, whereas Monte Carlo makes deep updates by rolling forward until the end of the episode. (relevance: 4)', 'Intermediate approaches using n-step returns offer a trade-off between temporal difference learning and Monte Carlo learning, allowing for consideration of multiple steps ahead for value estimation and credit assignment. (relevance: 3)']}, {'end': 4600.179, 'start': 4290.546, 'title': 'Temporal difference learning and monte carlo learning', 'summary': 'Explores the properties of td learning and monte carlo learning, highlighting the trade-off between step size and error, with the best values falling in the intermediate range.', 'duration': 309.633, 'highlights': ['The parameter plot shows that for a 512-step temporal difference learning algorithm, the preferred step size parameter is much smaller than for a one-step temporal difference learning, and even when well-tuned, the error remains quite high.', 'When considering a 10-step update in SARSA, it propagates information much further back, leading to faster learning and lower error over the first 10 episodes, for a well-tuned step size parameter.', 'The 10-step return in the next episode is more likely to quickly bump into updated state values, allowing information to propagate backwards to the beginning and aiding in faster learning.']}, {'end': 5058.475, 'start': 4600.259, 'title': 'Mixed multi-step returns', 'summary': 'Covers the concept of mixed multi-step returns, specifically discussing the td(lambda) algorithm which interpolates between td and monte carlo methods, demonstrating a trade-off between bias and variance and providing a rough correspondence between lambda and n-step returns.', 'duration': 458.216, 'highlights': ['The TD(lambda) algorithm interpolates between TD and Monte Carlo methods, with the parameter lambda determining the degree of bootstrapping, providing a trade-off between bias and variance (lambda 0.9 roughly corresponds to an n-step of 10).', 'Intermediate values, such as n=5 or n=10, and lambda=0.9, tend to work well in practice, offering a balance between bias and variance in multi-step returns.', 'The chapter explains that multi-step returns combine benefits from both temporal difference learning and Monte Carlo, addressing issues related to bias and variance in bootstrapping methods.']}], 'duration': 1004.993, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw4053482.jpg', 'highlights': ['Monte Carlo learning updates all previous states in an episode for faster information propagation (relevance: 5)', 'Temporal difference learning makes shallow updates by taking one step to update a value estimate (relevance: 4)', 'Intermediate approaches using n-step returns offer a trade-off between temporal difference and Monte Carlo learning (relevance: 3)', 'The TD(lambda) algorithm provides a trade-off between bias and variance (lambda 0.9 roughly corresponds to an n-step of 10)', 'The 10-step return in SARSA propagates information further back, aiding in faster learning (relevance: 2)', 'The parameter plot shows that for a 512-step temporal difference learning algorithm, the preferred step size parameter is much smaller (relevance: 1)']}, {'end': 6090.827, 'segs': [{'end': 5140.656, 'src': 'embed', 'start': 5111.221, 'weight': 0, 'content': [{'end': 5116.383, 'text': 'And conversely, temporal difference learning can update immediately and is independent of the span of the predictions.', 'start': 5111.221, 'duration': 5.162}, {'end': 5122.922, 'text': 'So before maybe you took this to be an argument that we should be using temporal difference learning.', 'start': 5117.077, 'duration': 5.845}, {'end': 5127.646, 'text': "but here I'm going to make a different argument and we're going to ask can we get the best of both worlds??", 'start': 5122.922, 'duration': 4.724}, {'end': 5134.511, 'text': 'Can we use these mixed multi-step returns or like other flavors of that, including maybe even Monte Carlo,', 'start': 5127.726, 'duration': 6.785}, {'end': 5140.656, 'text': "but with an algorithm that doesn't have computational requirements that grow indefinitely during the episode?", 'start': 5134.511, 'duration': 6.145}], 'summary': 'Temporal difference learning can update immediately, seeking best of both worlds with mixed multi-step returns and manageable computational requirements.', 'duration': 29.435, 'max_score': 5111.221, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw5111221.jpg'}, {'end': 5280.798, 'src': 'embed', 'start': 5253.888, 'weight': 4, 'content': [{'end': 5259.952, 'text': "This E is called an eligibility trace and it's defined recursively by taking the previous trace,", 'start': 5253.888, 'duration': 6.064}, {'end': 5267.357, 'text': 'decaying it according to gamma and lambda and then adding our current feature vector or more generally, our gradient xt.', 'start': 5259.952, 'duration': 7.405}, {'end': 5275.243, 'text': 'Note, special case, if lambda is zero, we do indeed get one step td back.', 'start': 5270.359, 'duration': 4.884}, {'end': 5280.798, 'text': "So there's a choice here whether you want to do online TD, which means you update during the episode.", 'start': 5276.175, 'duration': 4.623}], 'summary': 'Eligibility trace is defined by decaying the previous trace according to gamma and lambda, with a special case for lambda=0 for one-step td back.', 'duration': 26.91, 'max_score': 5253.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw5253888.jpg'}, {'end': 5740.907, 'src': 'embed', 'start': 5714.531, 'weight': 1, 'content': [{'end': 5720.875, 'text': 'Of course, coming up with this derivation is maybe a different matter, but you can now follow this step by step and see that this is indeed true.', 'start': 5714.531, 'duration': 6.344}, {'end': 5725.483, 'text': 'So that means that our total update can be written as this,', 'start': 5723.103, 'duration': 2.38}, {'end': 5733.225, 'text': 'where this eligibility trace vector was defined as the summation of the timestep t of our discounted feature vectors.', 'start': 5725.483, 'duration': 7.742}, {'end': 5740.907, 'text': 'This, we can also inspect a little bit more, and we can, for instance, pull off the last timestep, which is just xt.', 'start': 5734.385, 'duration': 6.522}], 'summary': 'Derivation process explained with step-by-step analysis and vector manipulation.', 'duration': 26.376, 'max_score': 5714.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw5714531.jpg'}, {'end': 6004.58, 'src': 'embed', 'start': 5975.889, 'weight': 3, 'content': [{'end': 5982.745, 'text': 'we had our mixed multi-step return, glabda, And it turns out, if you write this as a sum of temporal difference error,', 'start': 5975.889, 'duration': 6.856}, {'end': 5986.307, 'text': 'you can go through that yourself in a similar way that we did the Monte Carlo returns.', 'start': 5982.745, 'duration': 3.562}, {'end': 5994.331, 'text': 'It turns out you get this summation, which now no longer just says your discount factor, but it has a quantity lambda times gamma.', 'start': 5986.387, 'duration': 7.944}, {'end': 5997.437, 'text': "Otherwise, it's exactly the same thing.", 'start': 5995.856, 'duration': 1.581}, {'end': 6004.58, 'text': 'So it turns out we can write these lambda returns as also a summation of one step temporal difference error, but with gamma times.', 'start': 5997.517, 'duration': 7.063}], 'summary': 'Lambda returns can be expressed as a sum of temporal difference errors with a quantity lambda times gamma.', 'duration': 28.691, 'max_score': 5975.889, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw5975889.jpg'}], 'start': 5058.475, 'title': 'Temporal difference learning', 'summary': 'Discusses the limitations of mixed multi-step approaches in reinforcement learning, emphasizing the advantages of temporal difference learning, explains the use of eligibility traces in temporal difference learning, and explores the concept of eligibility trace and temporal difference error for online and offline updating.', 'chapters': [{'end': 5127.646, 'start': 5058.475, 'title': 'Temporal difference learning', 'summary': 'Discusses the limitations of mixed multi-step approaches in reinforcement learning, highlighting the trade-offs between temporal span and computational updates, emphasizing the advantages of temporal difference learning.', 'duration': 69.171, 'highlights': ['Temporal difference learning can update immediately and is independent of the span of the predictions.', 'Mixed multi-step approaches require waiting until the end of the episode before knowing the full return, posing computational challenges and limitations.', 'The chapter explores the trade-offs in statistical properties of updates and the computational construction of full returns in reinforcement learning.']}, {'end': 5679.189, 'start': 5127.726, 'title': 'Temporal difference learning with eligibility traces', 'summary': 'Explains how to use eligibility traces in temporal difference learning, demonstrating how it enables updates of all past states with a single update, independent of the temporal span of predictions, and extends to function approximation.', 'duration': 551.463, 'highlights': ['The use of eligibility traces allows updates of all past states with a single update, independent of the temporal span of predictions, and extends to function approximation.', 'Monte Carlo learning updates all states in an episode at once, typical as it requires waiting for the end of the episode to know the return.', 'The update involves a simple update that takes an alpha step size parameter times the one-step TD error, utilizing a vector called an eligibility trace, defined recursively by decaying according to gamma and lambda and adding the current feature vector or gradient.', 'The chapter demonstrates how to rewrite the total Monte Carlo error as a sum of temporal difference errors, and use a generic property of double sums to simplify the update process.']}, {'end': 6090.827, 'start': 5679.349, 'title': 'Eligibility trace and temporal difference error', 'summary': 'Explores the concept of eligibility trace and temporal difference error, demonstrating how it is used for monte carlo updates and mixed multi-step returns, offering the potential for online updating and equivalent algorithms for offline and online updating.', 'duration': 411.478, 'highlights': ['The eligibility trace vector is defined as the summation of the timestep t of discounted feature vectors, aiding in the evaluation of the total update (Quantifiable: definition of eligibility trace vector).', 'The algorithm allows the computation of the sum incrementally as we go, thus avoiding growing memory requirements and enabling updates during the episode (Quantifiable: memory efficiency and capability for in-episode updates).', 'The chapter introduces mixed multi-step returns with a parameter lambda, which leads to the adjustment of eligibility trace to have a decay of gamma times lambda, enabling the implementation of the algorithm and application of updates online (Quantifiable: introduction of mixed multi-step returns and the adjustment of eligibility trace).']}], 'duration': 1032.352, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eaWfWoVUTEw/pics/eaWfWoVUTEw5058475.jpg', 'highlights': ['Temporal difference learning can update immediately and is independent of the span of the predictions.', 'The use of eligibility traces allows updates of all past states with a single update, independent of the temporal span of predictions, and extends to function approximation.', 'The eligibility trace vector is defined as the summation of the timestep t of discounted feature vectors, aiding in the evaluation of the total update (Quantifiable: definition of eligibility trace vector).', 'The chapter introduces mixed multi-step returns with a parameter lambda, which leads to the adjustment of eligibility trace to have a decay of gamma times lambda, enabling the implementation of the algorithm and application of updates online (Quantifiable: introduction of mixed multi-step returns and the adjustment of eligibility trace).', 'The update involves a simple update that takes an alpha step size parameter times the one-step TD error, utilizing a vector called an eligibility trace, defined recursively by decaying according to gamma and lambda and adding the current feature vector or gradient.']}], 'highlights': ['Monte Carlo learning updates all previous states in an episode for faster information propagation (relevance: 5)', 'Temporal difference learning makes shallow updates by taking one step to update a value estimate (relevance: 4)', 'The objective in Blackjack is to accumulate points as close to 21 without going over, using number cards, picture cards worth 10 points, and the versatile Ace card worth 11 or 1 point (relevance: 3)', 'The chapter introduces mixed multi-step returns with a parameter lambda, which leads to the adjustment of eligibility trace to have a decay of gamma times lambda, enabling the implementation of the algorithm and application of updates online (relevance: 2)', 'The parameter plot shows that for a 512-step temporal difference learning algorithm, the preferred step size parameter is much smaller (relevance: 1)']}