title
Reinforcement Learning 1: Introduction to Reinforcement Learning

description
Hado Van Hasselt, Research Scientist, shares an introduction reinforcement learning as part of the Advanced Deep Learning & Reinforcement Learning Lectures.

detail
{'title': 'Reinforcement Learning 1: Introduction to Reinforcement Learning', 'heatmap': [{'end': 4090.97, 'start': 4027.203, 'weight': 1}], 'summary': "Provides an introduction to reinforcement learning, covering its ties with deep learning, historical development, core concepts, interdisciplinary nature, agent's policy and state, value functions, bellman equation, concepts, and applications, basics, and integration of learning and planning.", 'chapters': [{'end': 176.263, 'segs': [{'end': 38.306, 'src': 'embed', 'start': 5.293, 'weight': 0, 'content': [{'end': 11.579, 'text': 'Welcome This is going to be the first lecture in the reinforcement learning track of this course.', 'start': 5.293, 'duration': 6.286}, {'end': 19.827, 'text': 'Now, as story will have explained, there are more or less two separate tracks in this course.', 'start': 14.362, 'duration': 5.465}, {'end': 24.212, 'text': 'Of course, overlap between the deep learning side and the reinforcement learning side.', 'start': 19.847, 'duration': 4.365}, {'end': 26.114, 'text': 'Let me just turn this off in case.', 'start': 24.512, 'duration': 1.602}, {'end': 38.306, 'text': 'But they can also be viewed more or less separately, and some of the things that we will be talking about will tie into the deep learning side.', 'start': 29.178, 'duration': 9.128}], 'summary': 'First lecture in reinforcement learning track with overlap in deep learning side.', 'duration': 33.013, 'max_score': 5.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5293.jpg'}, {'end': 150.209, 'src': 'embed', 'start': 109.048, 'weight': 1, 'content': [{'end': 112.489, 'text': 'Schedule wise, most of the reinforcement learning lectures are scheduled at this time.', 'start': 109.048, 'duration': 3.441}, {'end': 113.509, 'text': 'Not all of them.', 'start': 112.969, 'duration': 0.54}, {'end': 116.51, 'text': "There's a few exceptions, which you can see in the schedule on Moodle.", 'start': 113.529, 'duration': 2.981}, {'end': 125.692, 'text': 'Of course, the schedule is what we currently believe it will remain, but feel free to keep checking it in case things change.', 'start': 118.63, 'duration': 7.062}, {'end': 128.793, 'text': 'Or just come to all lectures and they in our room is anything.', 'start': 126.732, 'duration': 2.061}, {'end': 130.795, 'text': 'So check Moodle for updates.', 'start': 129.794, 'duration': 1.001}, {'end': 131.977, 'text': 'Also use Moodle for questions.', 'start': 130.856, 'duration': 1.121}, {'end': 133.778, 'text': "We'll try to be responsive there.", 'start': 132.017, 'duration': 1.761}, {'end': 137.16, 'text': 'As you will know, grading is through assignments.', 'start': 135.379, 'duration': 1.781}, {'end': 146.586, 'text': 'And the background material for specifically this reinforcement learning side of the course will be the new edition of the Sutton and Bartow book.', 'start': 138.381, 'duration': 8.205}, {'end': 150.209, 'text': 'A full draft can be found online.', 'start': 147.507, 'duration': 2.702}], 'summary': 'Most reinforcement learning lectures are scheduled at this time, with exceptions listed on moodle. grading is through assignments and the course material is the new edition of the sutton and bartow book.', 'duration': 41.161, 'max_score': 109.048, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU109048.jpg'}], 'start': 5.293, 'title': 'Introduction to reinforcement learning', 'summary': 'Introduces the reinforcement learning track, covering high-level concepts and its ties with deep learning, with a focus on using deep learning methods at some points during the course.', 'chapters': [{'end': 176.263, 'start': 5.293, 'title': 'Introduction to reinforcement learning', 'summary': 'Introduces the reinforcement learning track of the course, covering its high-level concepts and its ties with deep learning, with a focus on using deep learning methods at some points during the course.', 'duration': 170.97, 'highlights': ['The reinforcement learning track of the course covers high-level concepts and its ties with deep learning, with a focus on using deep learning methods at some points during the course. The lecture will take a high-level view and cover a lot of the concepts of reinforcement learning, and will use deep learning methods and techniques at some points during this course.', 'The background material for the reinforcement learning side of the course will be the new edition of the Sutton and Bartow book, with specific focus on chapters one and three. The background material for the reinforcement learning side of the course will be the new edition of the Sutton and Bartow book, with specific focus on chapters one and three.', 'The schedule for reinforcement learning lectures, grading through assignments, and recommended reading materials are provided on Moodle for updates. The reinforcement learning lectures are scheduled at specific times, and grading is through assignments. The background material for the course will be the new edition of the Sutton and Bartow book, with specific focus on chapters one and three. Students are encouraged to check Moodle for updates and use it for questions.']}], 'duration': 170.97, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5293.jpg', 'highlights': ['The reinforcement learning track covers high-level concepts and its ties with deep learning, with a focus on using deep learning methods.', 'The background material for the reinforcement learning side of the course will be the new edition of the Sutton and Bartow book, with specific focus on chapters one and three.', 'The schedule for reinforcement learning lectures, grading through assignments, and recommended reading materials are provided on Moodle for updates.']}, {'end': 676.506, 'segs': [{'end': 207.572, 'src': 'embed', 'start': 176.263, 'weight': 0, 'content': [{'end': 181.944, 'text': 'which gives you a high level overview the way rich thinks about these things and also talks about many of these concepts,', 'start': 176.263, 'duration': 5.681}, {'end': 190.326, 'text': 'but also gives you large historical view on how these things got developed, which ideas came from where, and also how these ideas changed over time.', 'start': 181.944, 'duration': 8.382}, {'end': 193.767, 'text': "Because if you get everything from this course, you'll have a certain view,", 'start': 190.946, 'duration': 2.821}, {'end': 199.368, 'text': 'but you might not realize that things may have been perceived quite differently in the past and some people might still perceive quite differently right now.', 'start': 193.767, 'duration': 5.601}, {'end': 207.572, 'text': "So I'll mostly give my view, of course, but I'll try to keep as close as possible to the book.", 'start': 201.57, 'duration': 6.002}], 'summary': 'Transcript provides historical overview and different perspectives on concepts discussed.', 'duration': 31.309, 'max_score': 176.263, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU176263.jpg'}, {'end': 245.138, 'src': 'embed', 'start': 215.928, 'weight': 3, 'content': [{'end': 218.25, 'text': "I'll start by talking just about what reinforcement learning is.", 'start': 215.928, 'duration': 2.322}, {'end': 224.095, 'text': "Many of you will have a rough or detailed idea of this already, but it's good to be on the same page.", 'start': 218.63, 'duration': 5.465}, {'end': 227.298, 'text': "I'll talk about the core concepts of a reinforcement learning system.", 'start': 224.576, 'duration': 2.722}, {'end': 230.041, 'text': 'One of these concepts is an agent,', 'start': 228.6, 'duration': 1.441}, {'end': 236.487, 'text': "and then I'll talk about what are the components of such an agent and I'll talk a little bit about what are challenges in reinforcement learning.", 'start': 230.041, 'duration': 6.446}, {'end': 240.991, 'text': 'so what are research topics or things to think about within the research field of reinforcement learning?', 'start': 236.487, 'duration': 4.504}, {'end': 245.138, 'text': "But of course, it's good to start with defining what it is.", 'start': 242.363, 'duration': 2.775}], 'summary': 'Discussion on core concepts of reinforcement learning and its challenges', 'duration': 29.21, 'max_score': 215.928, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU215928.jpg'}, {'end': 589.686, 'src': 'embed', 'start': 564.514, 'weight': 4, 'content': [{'end': 570.815, 'text': "but all of a sudden it finds itself in a terrain that it has never seen before and also wasn't present when people built the robot or when the robot was learning.", 'start': 564.514, 'duration': 6.301}, {'end': 575.917, 'text': 'Then you want the robot to learn online, and you want it to maybe adapt quickly.', 'start': 571.516, 'duration': 4.401}, {'end': 583.083, 'text': 'And reinforcement learning as a field seeks to provide algorithms that can handle both these cases.', 'start': 577.701, 'duration': 5.382}, {'end': 589.686, 'text': "Sometimes they're not clearly differentiated and sometimes people don't clearly specify which goal they're after, but it's good to keep this in mind.", 'start': 583.563, 'duration': 6.123}], 'summary': 'Reinforcement learning addresses adapting to new terrains and online learning.', 'duration': 25.172, 'max_score': 564.514, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU564514.jpg'}, {'end': 626.233, 'src': 'embed', 'start': 599.75, 'weight': 1, 'content': [{'end': 605.092, 'text': "It's about that a little bit, but it's also about being able to learn online to adapt even while you're doing it.", 'start': 599.75, 'duration': 5.342}, {'end': 606.996, 'text': 'And this is fair game.', 'start': 606.055, 'duration': 0.941}, {'end': 608.717, 'text': 'We do that as well.', 'start': 608.017, 'duration': 0.7}, {'end': 611.98, 'text': 'When we enter a new situation, we do adapt still.', 'start': 609.298, 'duration': 2.682}, {'end': 615.203, 'text': "We don't have to just lean on what we've learned in the past.", 'start': 612.12, 'duration': 3.083}, {'end': 626.233, 'text': 'So another way to phrase what reinforcement learning is, is it is the science of learning to make decisions from interaction.', 'start': 620.207, 'duration': 6.026}], 'summary': 'Reinforcement learning involves learning to make decisions from interaction.', 'duration': 26.483, 'max_score': 599.75, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU599750.jpg'}], 'start': 176.263, 'title': 'Reinforcement learning', 'summary': "Provides a high-level overview of rich's perspective on reinforcement learning, covering the historical development of concepts, varying perceptions over time, and an outline of core concepts, components, and challenges. it also discusses the evolution from automating physical solutions with machines to automating mental solutions, and then to the next step of having machines learn to solve problems themselves through reinforcement learning, involving active interaction, sequential interactions, and goal-directed learning.", 'chapters': [{'end': 249.473, 'start': 176.263, 'title': 'Reinforcement learning overview', 'summary': "Provides a high-level overview of rich's perspective on reinforcement learning, covering the historical development of concepts, the varying perceptions over time, and an outline of the core concepts, components, and challenges in reinforcement learning.", 'duration': 73.21, 'highlights': ["The chapter covers the historical development and varying perceptions of concepts in reinforcement learning, providing a high-level overview of rich's perspective.", 'It outlines the core concepts, components, and challenges in reinforcement learning, emphasizing the importance of understanding these fundamental aspects.', 'The chapter emphasizes the need to be on the same page regarding the core concepts of reinforcement learning, including the definition and components of an agent.']}, {'end': 676.506, 'start': 250.454, 'title': 'Reinforcement learning: the science of decision making', 'summary': 'Discusses the evolution from automating physical solutions with machines to automating mental solutions, and then to the next step of having machines learn to solve problems themselves through reinforcement learning, which involves active interaction, sequential interactions, and goal-directed learning.', 'duration': 426.052, 'highlights': ['Reinforcement learning involves active interaction and sequential interactions with the environment. Reinforcement learning differs from other types of learning as it is active, involving interactions with the environment that are often sequential and can depend on earlier ones.', 'Reinforcement learning aims to handle both finding previously unknown solutions and learning quickly in new environments. Reinforcement learning seeks to provide algorithms that can handle finding previously unknown solutions and learning quickly in new environments, emphasizing the importance of adaptation and online learning.', 'Reinforcement learning is the science of learning to make decisions from interaction, considering time, long-term consequences, actively gathering experience, predicting the future, and dealing with uncertainty. Reinforcement learning involves learning to make decisions from interaction, requiring consideration of time, long-term consequences, actively gathering experience, predicting the future, and dealing with inherent or created uncertainty.']}], 'duration': 500.243, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU176263.jpg', 'highlights': ["The chapter provides a high-level overview of rich's perspective on reinforcement learning, covering the historical development of concepts and varying perceptions over time.", 'Reinforcement learning involves active interaction and sequential interactions with the environment, emphasizing the importance of adaptation and online learning.', 'Reinforcement learning is the science of learning to make decisions from interaction, considering time, long-term consequences, actively gathering experience, predicting the future, and dealing with uncertainty.', 'The chapter outlines the core concepts, components, and challenges in reinforcement learning, emphasizing the importance of understanding these fundamental aspects.', 'Reinforcement learning aims to handle both finding previously unknown solutions and learning quickly in new environments.']}, {'end': 1579.958, 'segs': [{'end': 751.49, 'src': 'embed', 'start': 718.498, 'weight': 0, 'content': [{'end': 719.678, 'text': 'And if so, we should probably add them.', 'start': 718.498, 'duration': 1.18}, {'end': 730.842, 'text': "So there's a lot of related disciplines and reinforcement learning has been studied in one form or another, many times, and in many forms.", 'start': 722.418, 'duration': 8.424}, {'end': 734.484, 'text': 'This is a slide that I borrowed from Dave Silver.', 'start': 732.163, 'duration': 2.321}, {'end': 738.545, 'text': 'Where he noted a few of these disciplines.', 'start': 736.705, 'duration': 1.84}, {'end': 741.487, 'text': 'there might be others, and these might not even be.', 'start': 738.545, 'duration': 2.942}, {'end': 745.569, 'text': 'these not, might not be the only, might not be the best examples, although some of them are pretty persuasive.', 'start': 741.487, 'duration': 4.082}, {'end': 751.49, 'text': 'And the disciplines that he pointed out were at the top computer science.', 'start': 747.008, 'duration': 4.482}], 'summary': 'Reinforcement learning has been studied in many related disciplines, including computer science.', 'duration': 32.992, 'max_score': 718.498, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU718498.jpg'}, {'end': 1006.68, 'src': 'embed', 'start': 978.65, 'weight': 3, 'content': [{'end': 982.594, 'text': "It doesn't give you a label or an action that you should have done.", 'start': 978.65, 'duration': 3.944}, {'end': 985.256, 'text': 'It just tells you, I like this this much.', 'start': 982.714, 'duration': 2.542}, {'end': 988.68, 'text': "But I'll go into more detail.", 'start': 987.298, 'duration': 1.382}, {'end': 997.07, 'text': 'So characteristics of reinforcement learning, and specifically, how does it differ from other machine learning paradigms,', 'start': 991.704, 'duration': 5.366}, {'end': 1000.013, 'text': "include that there's no strict supervision on your reward signal.", 'start': 997.07, 'duration': 2.943}, {'end': 1002.055, 'text': 'And also that the feedback can be delayed.', 'start': 1000.574, 'duration': 1.481}, {'end': 1006.68, 'text': 'So sometimes you take an action and this action much later leads to reward.', 'start': 1002.316, 'duration': 4.364}], 'summary': 'Reinforcement learning lacks strict supervision, allows delayed feedback, and action-reward time gap.', 'duration': 28.03, 'max_score': 978.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU978650.jpg'}, {'end': 1069.357, 'src': 'embed', 'start': 1040.472, 'weight': 1, 'content': [{'end': 1049.017, 'text': 'Some concrete examples to maybe help you think about these things include to fly a helicopter or to manage an investment portfolio,', 'start': 1040.472, 'duration': 8.545}, {'end': 1055.402, 'text': 'or to control a power station, make a robot walk or play video or board games.', 'start': 1049.017, 'duration': 6.385}, {'end': 1060.625, 'text': 'And these are actual examples where reinforcement learning has been, or versions of reinforcement learning have been applied.', 'start': 1055.702, 'duration': 4.923}, {'end': 1069.357, 'text': "And maybe it's good to note that these are reinforcement learning problems, because they are sequential decision problems,", 'start': 1063.976, 'duration': 5.381}], 'summary': 'Reinforcement learning applied to tasks like flying a helicopter, managing an investment portfolio, controlling a power station, making a robot walk, and playing games.', 'duration': 28.885, 'max_score': 1040.472, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1040472.jpg'}, {'end': 1287.03, 'src': 'embed', 'start': 1261.712, 'weight': 2, 'content': [{'end': 1266.918, 'text': "Now the agent's job is to maximize the cumulative reward, not the instantaneous reward, but the reward over time.", 'start': 1261.712, 'duration': 5.206}, {'end': 1269.06, 'text': 'And we will call this the return.', 'start': 1267.598, 'duration': 1.462}, {'end': 1272.922, 'text': 'Now this thing trails off into the end there.', 'start': 1271.021, 'duration': 1.901}, {'end': 1274.343, 'text': "I didn't specify when it stops.", 'start': 1272.982, 'duration': 1.361}, {'end': 1280.506, 'text': "The easiest way to think about it is that there's always a time somewhere in the future where it stops so that this thing is well defined and finite.", 'start': 1274.783, 'duration': 5.723}, {'end': 1284.468, 'text': "A little while later I'll talk about when that doesn't happen, when you have a continuing problem,", 'start': 1281.266, 'duration': 3.202}, {'end': 1287.03, 'text': 'and then you can still define a return that is well defined.', 'start': 1284.468, 'duration': 2.562}], 'summary': "Agent's goal is to maximize cumulative reward over time, defining a well-defined and finite return.", 'duration': 25.318, 'max_score': 1261.712, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1261712.jpg'}, {'end': 1373.578, 'src': 'embed', 'start': 1346.131, 'weight': 4, 'content': [{'end': 1349.252, 'text': "I haven't been able to find any counter examples myself, but maybe you do.", 'start': 1346.131, 'duration': 3.121}, {'end': 1355.275, 'text': "Yeah No, that's a very good question.", 'start': 1351.414, 'duration': 3.861}, {'end': 1361.419, 'text': "Sorry, we use the word reward, but we basically mean it's just a real values.", 'start': 1355.315, 'duration': 6.104}, {'end': 1367.642, 'text': 'Reinforcements signal, and sometimes we talk about negative rewards as being penalty.', 'start': 1363.239, 'duration': 4.403}, {'end': 1371.877, 'text': 'Especially this is especially common in psychology and neuroscience.', 'start': 1368.775, 'duration': 3.102}, {'end': 1373.578, 'text': 'Uh, in the.', 'start': 1372.678, 'duration': 0.9}], 'summary': 'Discussion on rewards and penalties in reinforcement signaling.', 'duration': 27.447, 'max_score': 1346.131, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1346131.jpg'}], 'start': 676.526, 'title': 'Reinforcement learning', 'summary': 'Delves into the interdisciplinary nature of reinforcement learning, its distinctions from other machine learning paradigms, and its basic concepts. it emphasizes its connections to various fields, unique characteristics, and applications, offering insights into its potential scope and relevance.', 'chapters': [{'end': 878.95, 'start': 676.526, 'title': 'Artificial intelligence and reinforcement learning', 'summary': 'Discusses the interdisciplinary nature of reinforcement learning, highlighting its connections to computer science, neuroscience, psychology, engineering, mathematics, and economics, emphasizing the potential scope and relevance of the topic.', 'duration': 202.424, 'highlights': ["Reinforcement learning's connections to computer science, neuroscience, and psychology Reinforcement learning is linked to computer science, neuroscience, and psychology due to resemblances in brain mechanisms, behavior modeling, and decision-making processes.", 'Interdisciplinary nature of reinforcement learning Reinforcement learning spans across various fields including computer science, neuroscience, psychology, engineering, mathematics, and economics, indicating its multi-disciplinary relevance.', 'Potential scope for reinforcement learning There is a huge potential scope for reinforcement learning as decisions manifest in numerous areas, indicating its wide-ranging applicability.']}, {'end': 1199.697, 'start': 879.691, 'title': 'Machine learning paradigms: supervised, unsupervised, and reinforcement learning', 'summary': 'Discusses the distinctions between supervised, unsupervised, and reinforcement learning, emphasizing the unique characteristics of reinforcement learning, such as the reinforcement signal, delayed feedback, and the impact of time and sequentiality on decision-making, with examples of applications including controlling a power station, managing an investment portfolio, and playing video or board games.', 'duration': 320.006, 'highlights': ['Reinforcement learning is distinct from supervised and unsupervised learning due to the reinforcement signal, delayed feedback, and the impact of time and sequentiality on decision-making. Reinforcement learning is characterized by the reinforcement signal, delayed feedback, and the impact of time and sequentiality on decision-making, setting it apart from supervised and unsupervised learning.', 'Examples of applications for reinforcement learning include controlling a power station, managing an investment portfolio, and playing video or board games. Reinforcement learning has been applied in controlling a power station, managing an investment portfolio, and playing video or board games, showcasing its diverse applications.', "Reinforcement learning is a framework for thinking about sequential decision problems, with a set of algorithms specific to the field, but it's possible to work on reinforcement learning problems without using those specific algorithms. Reinforcement learning serves as a framework for sequential decision problems, and while there are specific algorithms associated with the field, it's possible to work on reinforcement learning problems without using those specific algorithms."]}, {'end': 1579.958, 'start': 1199.837, 'title': 'Reinforcement learning basics', 'summary': "Introduces the basic concepts of reinforcement learning, including the agent's observation, reward, and action, the goal of maximizing cumulative reward, and the concept of return. it also discusses the generalizability of the reward hypothesis and the desirability of transitions and states in reinforcement learning.", 'duration': 380.121, 'highlights': ["The agent's job is to maximize the cumulative reward, not the instantaneous reward, but the reward over time. And we will call this the return. The primary goal of the agent is to maximize the cumulative reward over time, which is referred to as the return.", 'Reinforcements signal, and sometimes we talk about negative rewards as being penalty. Especially this is especially common in psychology and neuroscience. In reinforcement learning, negative rewards are considered penalties, commonly observed in psychology and neuroscience.', 'The returns and the values can be defined recursively. The return at the time step t is basically just the one step reward and then the return from there. Returns and values can be defined recursively, where the return at a time step is the sum of the one-step reward and the return from the next time step.']}], 'duration': 903.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU676526.jpg', 'highlights': ['Reinforcement learning spans across various fields including computer science, neuroscience, psychology, engineering, mathematics, and economics, indicating its multi-disciplinary relevance.', 'Reinforcement learning has been applied in controlling a power station, managing an investment portfolio, and playing video or board games, showcasing its diverse applications.', 'The primary goal of the agent is to maximize the cumulative reward over time, which is referred to as the return.', 'Reinforcement learning is characterized by the reinforcement signal, delayed feedback, and the impact of time and sequentiality on decision-making, setting it apart from supervised and unsupervised learning.', 'In reinforcement learning, negative rewards are considered penalties, commonly observed in psychology and neuroscience.']}, {'end': 2587.36, 'segs': [{'end': 1625.978, 'src': 'embed', 'start': 1581.099, 'weight': 0, 'content': [{'end': 1583.46, 'text': "And in, say, playing a game, you might block an opponent's move.", 'start': 1581.099, 'duration': 2.361}, {'end': 1588.662, 'text': 'Rather than going for the win, you first prevent the loss, which might then later give you higher probability of winning.', 'start': 1583.5, 'duration': 5.162}, {'end': 1595.666, 'text': 'And in any of these cases, the mapping from states to action will call a policy.', 'start': 1590.663, 'duration': 5.003}, {'end': 1601.148, 'text': 'So you can think of this as just being a function that maps each state into an action.', 'start': 1597.947, 'duration': 3.201}, {'end': 1609.067, 'text': "It's also possible to condition the value on actions.", 'start': 1605.444, 'duration': 3.623}, {'end': 1613.43, 'text': 'So instead of just conditioning on the state, you can condition on the state and action pair.', 'start': 1609.527, 'duration': 3.903}, {'end': 1615.931, 'text': 'The definition is very similar to the state value.', 'start': 1614.13, 'duration': 1.801}, {'end': 1619.494, 'text': "There's a slight difference in notation for historical reasons.", 'start': 1617.392, 'duration': 2.102}, {'end': 1620.635, 'text': 'This is called a Q function.', 'start': 1619.534, 'duration': 1.101}, {'end': 1625.978, 'text': 'So for states, we use V and for state action pairs, we use Q.', 'start': 1621.415, 'duration': 4.563}], 'summary': 'In games, prioritizing preventing loss increases probability of winning, through state-action mapping known as policy.', 'duration': 44.879, 'max_score': 1581.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1581099.jpg'}, {'end': 1712.038, 'src': 'embed', 'start': 1687.028, 'weight': 3, 'content': [{'end': 1693.53, 'text': 'So, as I said, a policy is mapping from states to actions or the other way to say that the actions depend on some state of the agent.', 'start': 1687.028, 'duration': 6.502}, {'end': 1699.751, 'text': 'both the agent and the environment might have an internal state, or typically actually do have an internal state.', 'start': 1694.648, 'duration': 5.103}, {'end': 1705.034, 'text': 'In the simplest case, there might only be one state, and both environment and agent are always in the same state.', 'start': 1700.491, 'duration': 4.543}, {'end': 1712.038, 'text': "And we'll cover that quite extensively in the next lecture, because it turns out you can already meaningfully talk about some concepts,", 'start': 1706.075, 'duration': 5.963}], 'summary': 'A policy maps states to actions; both agent and environment have internal states.', 'duration': 25.01, 'max_score': 1687.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1687028.jpg'}, {'end': 2156.038, 'src': 'embed', 'start': 2128.225, 'weight': 4, 'content': [{'end': 2131.246, 'text': "The agent action depends on the state, so it's important to have this agent state.", 'start': 2128.225, 'duration': 3.021}, {'end': 2135.488, 'text': 'And a simple example is if you just make your observation the agent state.', 'start': 2131.746, 'duration': 3.742}, {'end': 2142.589, 'text': 'More generally, we can think of the agent state as something that updates over time.', 'start': 2137.605, 'duration': 4.984}, {'end': 2149.113, 'text': 'You have your previous agent state, you have an action, a reward, and an observation, and then you construct a new agent state.', 'start': 2143.029, 'duration': 6.084}, {'end': 2153.056, 'text': 'Note that, for instance, building up your full history is of this form.', 'start': 2150.294, 'duration': 2.762}, {'end': 2154.497, 'text': 'You just append things.', 'start': 2153.456, 'duration': 1.041}, {'end': 2156.038, 'text': "But there's other things you could do.", 'start': 2155.037, 'duration': 1.001}], 'summary': "Agent's state depends on previous state, action, reward, and observation, updating over time.", 'duration': 27.813, 'max_score': 2128.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2128225.jpg'}, {'end': 2448.445, 'src': 'embed', 'start': 2418.755, 'weight': 5, 'content': [{'end': 2426.062, 'text': 'So I said many of these things already, but to summarize, to deal with partial observability, the agent can construct a suitable state representation.', 'start': 2418.755, 'duration': 7.307}, {'end': 2433.99, 'text': 'And examples of these include, as I said before, you could just have your observation be the agent state,', 'start': 2427.143, 'duration': 6.847}, {'end': 2435.592, 'text': 'but this might not be enough in certain cases.', 'start': 2433.99, 'duration': 1.602}, {'end': 2441.258, 'text': 'You could have the complete history as your agent state, but this might be too large, might be hard to compute with this full history.', 'start': 2436.213, 'duration': 5.045}, {'end': 2448.445, 'text': 'Or you might as a partial version of the one I showed before you could have some incrementally updated states,', 'start': 2442.432, 'duration': 6.013}], 'summary': 'To handle partial observability, the agent can construct a suitable state representation, such as using the agent state or incrementally updated states.', 'duration': 29.69, 'max_score': 2418.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2418755.jpg'}, {'end': 2541.077, 'src': 'embed', 'start': 2512.773, 'weight': 1, 'content': [{'end': 2517.297, 'text': "So the policy just defines the agent's behavior, and it's a map from the agent states to an action.", 'start': 2512.773, 'duration': 4.524}, {'end': 2519.178, 'text': "There's two main cases.", 'start': 2518.077, 'duration': 1.101}, {'end': 2523.361, 'text': "One is the deterministic policy, where we'll just write it as a function that outputs an action.", 'start': 2519.879, 'duration': 3.482}, {'end': 2525.003, 'text': 'State goes in, action goes out.', 'start': 2523.782, 'duration': 1.221}, {'end': 2528.786, 'text': "But there's also the important use case of a stochastic policy.", 'start': 2526.344, 'duration': 2.442}, {'end': 2535.732, 'text': "where the action is basically, there's a probability of selecting each action in each state.", 'start': 2529.847, 'duration': 5.885}, {'end': 2541.077, 'text': 'Typically we will not be too careful in differentiating these.', 'start': 2537.154, 'duration': 3.923}], 'summary': 'Policy maps agent states to actions, including deterministic and stochastic cases.', 'duration': 28.304, 'max_score': 2512.773, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2512773.jpg'}], 'start': 1581.099, 'title': "Agent's policy and state", 'summary': 'Explores the concept of policy as a mapping from states to actions, introduces the q function for conditioning on state-action pairs, and discusses agent state construction, update functions, and policies as deterministic or stochastic maps from agent states to actions.', 'chapters': [{'end': 1743.913, 'start': 1581.099, 'title': "Agent's policy and state", 'summary': 'Discusses the concept of policy as a mapping from states to actions, introduces the q function for conditioning on state-action pairs, and explains the concept of agent state and environment state, emphasizing the potential presence of infinitely many states.', 'duration': 162.814, 'highlights': ['The chapter discusses the concept of policy as a mapping from states to actions. Policy is defined as a function that maps each state into an action, emphasizing the relationship between states and actions in decision-making.', 'Introduces the Q function for conditioning on state-action pairs. The Q function allows for conditioning the value on both state and action pair, denoted as Q for state-action pairs and V for states.', 'Explains the concept of agent state and environment state, emphasizing the potential presence of infinitely many states. Both the agent and the environment may have an internal state, which can be finite or infinitely many, introducing the complexity of dealing with various states and their impact on decision-making.']}, {'end': 2587.36, 'start': 1745.7, 'title': 'Understanding agent state and policies', 'summary': 'Discusses the concept of agent state, including its construction from history, update functions, and its importance in dealing with partial observability, also covering the definition of policies as deterministic or stochastic maps from agent states to actions, with a focus on discrete actions.', 'duration': 841.66, 'highlights': ['The agent state is a function of the history, and the actions depend on it, typically smaller than the environment state, and can be constructed from the observation or incrementally updated states, possibly using recurrent neural networks. The agent state is a function of the history, and the actions depend on it, typically smaller than the environment state, and can be constructed from the observation or incrementally updated states. This can be implemented using deep learning techniques like recurrent neural networks.', 'Partial observability can be dealt with by constructing a suitable state representation, which could include using the observation as the agent state, the complete history, or incrementally updated states, with the possibility of using recurrent neural networks to implement the state update function. Partial observability can be dealt with by constructing a suitable state representation, which could include using the observation as the agent state, the complete history, or incrementally updated states, possibly using recurrent neural networks to implement the state update function.', 'Policies can be deterministic or stochastic, with the former directly mapping agent states to actions and the latter involving a probability distribution over actions for each state, with a primary focus on discrete actions. Policies can be deterministic or stochastic, with the former directly mapping agent states to actions and the latter involving a probability distribution over actions for each state, with a primary focus on discrete actions.']}], 'duration': 1006.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU1581099.jpg', 'highlights': ['Introduces the Q function for conditioning on state-action pairs. The Q function allows for conditioning the value on both state and action pair, denoted as Q for state-action pairs and V for states.', 'Policies can be deterministic or stochastic, with the former directly mapping agent states to actions and the latter involving a probability distribution over actions for each state, with a primary focus on discrete actions.', 'The chapter discusses the concept of policy as a mapping from states to actions. Policy is defined as a function that maps each state into an action, emphasizing the relationship between states and actions in decision-making.', 'Explains the concept of agent state and environment state, emphasizing the potential presence of infinitely many states. Both the agent and the environment may have an internal state, which can be finite or infinitely many, introducing the complexity of dealing with various states and their impact on decision-making.', 'The agent state is a function of the history, and the actions depend on it, typically smaller than the environment state, and can be constructed from the observation or incrementally updated states, possibly using recurrent neural networks.', 'Partial observability can be dealt with by constructing a suitable state representation, which could include using the observation as the agent state, the complete history, or incrementally updated states, with the possibility of using recurrent neural networks to implement the state update function.']}, {'end': 3064.097, 'segs': [{'end': 2682.832, 'src': 'embed', 'start': 2649.856, 'weight': 4, 'content': [{'end': 2653.738, 'text': "Basically, you're down weighing or discounting, and this is why it's called a discount factor.", 'start': 2649.856, 'duration': 3.882}, {'end': 2656.66, 'text': 'The future rewards in favor of the immediate ones.', 'start': 2654.319, 'duration': 2.341}, {'end': 2662.815, 'text': 'If you think of the maze example that I said earlier, where you get a zero reward on each step, but then you get, say,', 'start': 2657.83, 'duration': 4.985}, {'end': 2664.918, 'text': 'a reward of plus one when you exit the maze.', 'start': 2662.815, 'duration': 2.103}, {'end': 2669.502, 'text': "If you don't have discounting, the agent basically has no incentive to exit the maze quickly.", 'start': 2665.478, 'duration': 4.024}, {'end': 2675.749, 'text': 'It would just be happy if it exits the maze in like in some time into the future.', 'start': 2670.447, 'duration': 5.302}, {'end': 2682.832, 'text': "But when you have discounting, all of a sudden the trade off starts to differ and it'll favor to be as quick as possible,", 'start': 2676.37, 'duration': 6.462}], 'summary': 'Discount factor favors immediate rewards over future ones in maze example.', 'duration': 32.976, 'max_score': 2649.856, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2649856.jpg'}, {'end': 2797.008, 'src': 'embed', 'start': 2685.553, 'weight': 0, 'content': [{'end': 2691.316, 'text': 'If it takes fewer steps to go to the exit, it will have discounted this future return less.', 'start': 2685.553, 'duration': 5.763}, {'end': 2702.067, 'text': 'So the value depends on the policy, as I said, and it can then be used to evaluate the desirability of states, one state versus the other.', 'start': 2694.741, 'duration': 7.326}, {'end': 2705.489, 'text': 'And therefore, it can also be used to select between actions.', 'start': 2703.388, 'duration': 2.101}, {'end': 2707.591, 'text': 'You could say plan one step ahead.', 'start': 2705.709, 'duration': 1.882}, {'end': 2708.912, 'text': 'You could use your value.', 'start': 2707.971, 'duration': 0.941}, {'end': 2713.935, 'text': "It's actually more convenient in that case although I didn't put it on the slide to use these action values,", 'start': 2710.173, 'duration': 3.762}, {'end': 2717.118, 'text': 'because those immediately give you access to the value of each action.', 'start': 2713.935, 'duration': 3.183}, {'end': 2720.029, 'text': 'This is just the definition of the value.', 'start': 2718.288, 'duration': 1.741}, {'end': 2726.715, 'text': "Of course, we're going to approximate these things later in our agents because we don't have access basically to the true value typically.", 'start': 2720.55, 'duration': 6.165}, {'end': 2731.499, 'text': "Oh, there's a plus sign missing there on the top.", 'start': 2728.496, 'duration': 3.003}, {'end': 2738.664, 'text': 'It should have been reward plus one plus the discounted future return t plus one.', 'start': 2731.939, 'duration': 6.725}, {'end': 2741.687, 'text': "I'll fix that before the slides go on to Moodle.", 'start': 2739.225, 'duration': 2.462}, {'end': 2748.933, 'text': "And I said this before for the undiscounted case, but now I'm saying it again for the discounted case.", 'start': 2744.409, 'duration': 4.524}, {'end': 2750.315, 'text': 'The return has a recursive form.', 'start': 2748.973, 'duration': 1.342}, {'end': 2755.279, 'text': "It's a one step reward plus the remaining return, but now discounted once.", 'start': 2751.115, 'duration': 4.164}, {'end': 2761.505, 'text': 'And that means that the value also has a recursive form, because we can just write down the value as the expectation of this return.', 'start': 2756.72, 'duration': 4.785}, {'end': 2768.791, 'text': 'But then it turns out because the expectation can be put inside as well over this t plus one.', 'start': 2762.205, 'duration': 6.586}, {'end': 2770.753, 'text': 'this is equivalent to just putting the value there again.', 'start': 2768.791, 'duration': 1.962}, {'end': 2776.032, 'text': 'And this is a very important recursive relationship that will heavily exploit throughout the course.', 'start': 2772.148, 'duration': 3.884}, {'end': 2783.519, 'text': "And notation wise, note that I'm writing down a as being sampled from the policy.", 'start': 2777.693, 'duration': 5.826}, {'end': 2785.461, 'text': 'So this is basically assuming stochastic policies.', 'start': 2783.559, 'duration': 1.902}, {'end': 2788.184, 'text': 'But like I said, deterministic can be viewed as a special case of that.', 'start': 2785.501, 'duration': 2.683}, {'end': 2791.367, 'text': 'And then the equation is known as the Bellman equation.', 'start': 2789.525, 'duration': 1.842}, {'end': 2797.008, 'text': 'By Richard Bellman from 1957.', 'start': 2792.668, 'duration': 4.34}], 'summary': 'Value depends on policy, used to evaluate states and select actions, bellman equation by richard bellman from 1957.', 'duration': 111.455, 'max_score': 2685.553, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2685553.jpg'}, {'end': 2861.411, 'src': 'embed', 'start': 2822.219, 'weight': 1, 'content': [{'end': 2827.004, 'text': 'If there is a limited number of states and a limited number of actions, this is just a system of equations that you could solve.', 'start': 2822.219, 'duration': 4.785}, {'end': 2831.209, 'text': 'And thereby you can get the optimal values and the optimal policy.', 'start': 2827.785, 'duration': 3.424}, {'end': 2838.196, 'text': "In order to do that, you need to know you need to be able to compute this expectation, and that's something that will cover later as well.", 'start': 2833.651, 'duration': 4.545}, {'end': 2840.659, 'text': "And you'll use dynamic programming techniques and to solve this.", 'start': 2838.316, 'duration': 2.343}, {'end': 2861.411, 'text': "Yeah, so it's basically the top line there which is missing the plus, unfortunately, but it's basically based on the recurrence of the return,", 'start': 2852.506, 'duration': 8.905}], 'summary': 'Limited states and actions can be solved using dynamic programming techniques to find optimal values and policy.', 'duration': 39.192, 'max_score': 2822.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2822219.jpg'}], 'start': 2587.36, 'title': 'Value functions and bellman equation', 'summary': 'Explains the concept of value functions, discount factors, and their impact on rewards, and discusses the bellman equation, recursive relationships, stochastic and deterministic policies, and solving for optimal values using dynamic programming techniques.', 'chapters': [{'end': 2726.715, 'start': 2587.36, 'title': 'Value functions and discount factors', 'summary': 'Explains the concept of value functions, introducing the discount factor and its impact on immediate and long-term rewards, and the use of value functions in evaluating states and selecting actions.', 'duration': 139.355, 'highlights': ["The chapter introduces the concept of value functions and explains the impact of discount factor on rewards, emphasizing its role in favoring immediate rewards over long-term rewards. The introduction of the discount factor is highlighted, as it impacts the trade-off between immediate and long-term rewards. The explanation of how a discount factor favors immediate rewards and its impact on the agent's behavior is emphasized.", 'The chapter discusses the use of value functions to evaluate the desirability of states and to select between actions, emphasizing their role in planning and decision-making. The discussion on the use of value functions in evaluating states and selecting actions is highlighted, emphasizing their role in decision-making and planning.', 'The chapter mentions the future approximation of value functions due to limited access to true values, indicating a plan for future exploration and understanding of value function approximation in agents. The mention of future approximation of value functions in agents due to limited access to true values is highlighted, indicating a plan for future exploration and understanding of value function approximation.']}, {'end': 3064.097, 'start': 2728.496, 'title': 'Bellman equation and recursive relationships', 'summary': 'Discusses the bellman equation, recursive relationships in value and return, stochastic and deterministic policies, and solving for optimal values and policies using dynamic programming techniques.', 'duration': 335.601, 'highlights': ['The return has a recursive form, consisting of a one-step reward plus the remaining return, discounted once. The return is recursively defined as a one-step reward plus the remaining return, discounted once, forming a recursive relationship.', 'The value also has a recursive form, being equivalent to the expectation of the return, which heavily exploits the recursive relationship throughout the course. The value has a recursive form as the expectation of the return, which exploits the recursive relationship extensively.', 'The Bellman equation, assuming stochastic policies, is a key notation and forms a system of equations for solving optimal values and policies. The Bellman equation, assuming stochastic policies, forms a system of equations for solving optimal values and policies.', 'The process involves solving a system of equations using dynamic programming techniques to obtain optimal values and policies, considering limited states and actions. Solving a system of equations using dynamic programming techniques yields optimal values and policies, considering limited states and actions.']}], 'duration': 476.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU2587360.jpg', 'highlights': ['The Bellman equation, assuming stochastic policies, forms a system of equations for solving optimal values and policies.', 'Solving a system of equations using dynamic programming techniques yields optimal values and policies, considering limited states and actions.', 'The value has a recursive form as the expectation of the return, which exploits the recursive relationship extensively.', 'The return is recursively defined as a one-step reward plus the remaining return, discounted once, forming a recursive relationship.', 'The chapter introduces the concept of value functions and explains the impact of discount factor on rewards, emphasizing its role in favoring immediate rewards over long-term rewards.', 'The chapter discusses the use of value functions to evaluate the desirability of states and to select between actions, emphasizing their role in decision-making and planning.', 'The mention of future approximation of value functions in agents due to limited access to true values is highlighted, indicating a plan for future exploration and understanding of value function approximation.']}, {'end': 4045.294, 'segs': [{'end': 3128.357, 'src': 'embed', 'start': 3064.097, 'weight': 0, 'content': [{'end': 3067.599, 'text': "or you could just do all of them at the same time and you'll still get to the optimal solution.", 'start': 3064.097, 'duration': 3.502}, {'end': 3089.685, 'text': "It's a very good question.", 'start': 3087.623, 'duration': 2.062}, {'end': 3093.849, 'text': "So the question is here, we're approximating expected expected.", 'start': 3089.745, 'duration': 4.104}, {'end': 3099.456, 'text': 'You know, cumulative rewards are expected returns, but sometimes you care more about the whole distribution of returns.', 'start': 3094.793, 'duration': 4.663}, {'end': 3101.437, 'text': 'And this is definitely true.', 'start': 3100.496, 'duration': 0.941}, {'end': 3103.918, 'text': "And it's actually.", 'start': 3103.018, 'duration': 0.9}, {'end': 3106.92, 'text': "It hasn't hadn't been studied that much.", 'start': 3103.938, 'duration': 2.982}, {'end': 3114.385, 'text': "so there's been quite a bit of work on things like safe reinforcement learning, where people, for instance, want to optimize the expected return,", 'start': 3106.92, 'duration': 7.465}, {'end': 3117.687, 'text': 'but conditional on not ever having a return that is lower than a certain thing.', 'start': 3114.385, 'duration': 3.302}, {'end': 3128.357, 'text': 'Um, but recently and with that I mean like last year um a paper was published on distributional reinforcement learning,', 'start': 3118.807, 'duration': 9.55}], 'summary': 'Studying distributional reinforcement learning for optimizing returns.', 'duration': 64.26, 'max_score': 3064.097, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3064097.jpg'}, {'end': 3185.941, 'src': 'embed', 'start': 3146.934, 'weight': 3, 'content': [{'end': 3152.258, 'text': 'You can help it to steer your decision away from risky situations, or sometimes you actually want to be.', 'start': 3146.934, 'duration': 5.324}, {'end': 3159.343, 'text': 'So that that would be called risk averse, say, in economics lingo, or you could be more risk seeking, which could also sometimes be useful.', 'start': 3153.479, 'duration': 5.864}, {'end': 3162.286, 'text': 'Depending on what you want to achieve.', 'start': 3160.845, 'duration': 1.441}, {'end': 3163.487, 'text': 'So, yes, very good question.', 'start': 3162.606, 'duration': 0.881}, {'end': 3165.388, 'text': "That's very current research.", 'start': 3163.927, 'duration': 1.461}, {'end': 3185.941, 'text': 'Yeah We think of it as your kind of marginalizing.', 'start': 3170.472, 'duration': 15.469}], 'summary': 'Understanding risk preferences can influence decision-making in economics and research.', 'duration': 39.007, 'max_score': 3146.934, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3146934.jpg'}, {'end': 3274.186, 'src': 'embed', 'start': 3219.064, 'weight': 5, 'content': [{'end': 3225.265, 'text': "But in this case, we're not interested in a fixed distribution over actions, a fixed policy, but instead we're choosing to maximise over it.", 'start': 3219.064, 'duration': 6.201}, {'end': 3228.086, 'text': "But yes, it's otherwise very similar.", 'start': 3226.226, 'duration': 1.86}, {'end': 3255.218, 'text': "Yeah In practice, in reinforcement learning, there's lots of problems where you have continuous Yes, so that's two parts of that question.", 'start': 3228.787, 'duration': 26.431}, {'end': 3259.7, 'text': 'One is how to deal with continuous domains, for instance, continuous time.', 'start': 3255.318, 'duration': 4.382}, {'end': 3267.803, 'text': "And the other one is how to deal with approximations, because even if you don't have continuous time, the state space, for instance,", 'start': 3261.5, 'duration': 6.303}, {'end': 3270.504, 'text': 'might be huge and that also might require you to approximate.', 'start': 3267.803, 'duration': 2.701}, {'end': 3274.186, 'text': "So approximations are going to be very central in this course, and we're going to bump into them all the time.", 'start': 3270.524, 'duration': 3.662}], 'summary': 'Reinforcement learning deals with continuous domains and approximations, which are central in the course.', 'duration': 55.122, 'max_score': 3219.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3219064.jpg'}, {'end': 3395.882, 'src': 'embed', 'start': 3363.881, 'weight': 6, 'content': [{'end': 3366.082, 'text': 'No, I actually already gave an example.', 'start': 3363.881, 'duration': 2.201}, {'end': 3371.847, 'text': "So sometimes people set up an environment in which these probabilities change over time, which means it's already not Markov.", 'start': 3366.122, 'duration': 5.725}, {'end': 3373.969, 'text': 'We would call that a non stationary environment.', 'start': 3371.887, 'duration': 2.082}, {'end': 3379.253, 'text': "In that case, you could still I mean, there's always ways to kind of work your way around that, which is a bit.", 'start': 3374.67, 'duration': 4.583}, {'end': 3386.017, 'text': 'peculiar and mathematical in some sense, you could say the way it changes might itself be a function of something.', 'start': 3380.074, 'duration': 5.943}, {'end': 3391.32, 'text': "So if you take that into account, maybe the whole thing becomes Markov again, but it's usually complex, so you don't want to care.", 'start': 3386.077, 'duration': 5.243}, {'end': 3395.882, 'text': "So it's often much simpler to say just it changes over time and then it wouldn't be Markov.", 'start': 3391.86, 'duration': 4.022}], 'summary': 'In non-stationary environments, probabilities change over time, making it not markov, but adjustments can be made to make it markov again.', 'duration': 32.001, 'max_score': 3363.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3363881.jpg'}, {'end': 3565.095, 'src': 'embed', 'start': 3506.027, 'weight': 2, 'content': [{'end': 3510.048, 'text': "Sometimes, though, it's very tricky to set up these events easily.", 'start': 3506.027, 'duration': 4.021}, {'end': 3516.349, 'text': 'So this is why, for instance in safe reinforcement learning people more typically they still model, say, the expected money,', 'start': 3510.528, 'duration': 5.821}, {'end': 3520.01, 'text': "but then they just add the condition that they don't want to drop below something.", 'start': 3516.349, 'duration': 3.661}, {'end': 3524.911, 'text': "It might be possible to phrase the problem differently where it's more weighted, differently.", 'start': 3520.33, 'duration': 4.581}, {'end': 3530.632, 'text': "where certain negative rewards are weighted more heavily, say, and that's the reward that the learning system gets.", 'start': 3524.911, 'duration': 5.721}, {'end': 3535.093, 'text': "But sometimes it's harder to do that than just to solve it with certain constraints in place.", 'start': 3530.992, 'duration': 4.101}, {'end': 3537.454, 'text': 'Very good questions.', 'start': 3536.854, 'duration': 0.6}, {'end': 3546.121, 'text': "One high level thing that I wanted to say here is a lot of these things that I've shown right now are basically just definitions.", 'start': 3538.836, 'duration': 7.285}, {'end': 3549.724, 'text': 'For instance, the return and the value.', 'start': 3546.742, 'duration': 2.982}, {'end': 3559.831, 'text': "they're defined in a certain way, and this way they're defined might depend on the indefinite future into the, into the infinite future, essentially.", 'start': 3549.724, 'duration': 10.107}, {'end': 3563.494, 'text': "Which means that you don't have access to these things in practice.", 'start': 3561.112, 'duration': 2.382}, {'end': 3565.095, 'text': 'This is just a definition.', 'start': 3563.955, 'duration': 1.14}], 'summary': 'In reinforcement learning, setting up events can be challenging, with potential for weighted negative rewards and indefinite future dependencies.', 'duration': 59.068, 'max_score': 3506.027, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3506027.jpg'}, {'end': 3656.256, 'src': 'embed', 'start': 3627.981, 'weight': 10, 'content': [{'end': 3634.287, 'text': 'one reason to approximate them is your state space might be too big to actually model these things exactly to even fit it in memory.', 'start': 3627.981, 'duration': 6.306}, {'end': 3638.932, 'text': 'So then you might want to generalize across that, as you would typically also do with neural networks in deep learning.', 'start': 3634.688, 'duration': 4.244}, {'end': 3647.689, 'text': 'Another reason to approximate is just that you might not have access to the model needed to compute these expectations.', 'start': 3641.023, 'duration': 6.666}, {'end': 3656.256, 'text': 'So you might need to sample, which means you will end up with approximations which will get better when you sample more and more and more,', 'start': 3648.089, 'duration': 8.167}], 'summary': 'Approximation is necessary for large state spaces and when access to the model is limited, improving with more sampling.', 'duration': 28.275, 'max_score': 3627.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3627981.jpg'}, {'end': 3834.725, 'src': 'embed', 'start': 3805.285, 'weight': 11, 'content': [{'end': 3810.909, 'text': 'Whatever we do, when we do get an accurate value function, we can use that to behave optimally.', 'start': 3805.285, 'duration': 5.624}, {'end': 3816.252, 'text': 'I said accurate here, with which I mean basically an exact optimal value function, and you can behave optimally.', 'start': 3811.109, 'duration': 5.143}, {'end': 3821.976, 'text': 'More generally, with suitable approximations, we can behave well, even in interactively big domains.', 'start': 3816.893, 'duration': 5.083}, {'end': 3828.701, 'text': "We lose the optimality in that case, because we're learning, we're approximating, there's no way you can get the actual optimal policy,", 'start': 3822.436, 'duration': 6.265}, {'end': 3834.725, 'text': "but turns out in practice you don't actually care that much, because good performance is also already very useful.", 'start': 3828.701, 'duration': 6.024}], 'summary': 'With suitable approximations, we can behave well even in big domains, although we may lose optimality.', 'duration': 29.44, 'max_score': 3805.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3805285.jpg'}, {'end': 4031.546, 'src': 'embed', 'start': 4004.608, 'weight': 12, 'content': [{'end': 4008.431, 'text': 'In other cases, like a robot walking around, say, through a corridor.', 'start': 4004.608, 'duration': 3.823}, {'end': 4015.196, 'text': 'this is much trickier and you might not have access to the true model and it might be very hard to learn true model.', 'start': 4008.431, 'duration': 6.765}, {'end': 4017.778, 'text': "so it's very dependent on the domain whether it makes sense.", 'start': 4015.196, 'duration': 2.582}, {'end': 4021.901, 'text': 'this is why i basically put down the model as being an optional part of your agent.', 'start': 4017.778, 'duration': 4.123}, {'end': 4025.043, 'text': "many reinforcement learning agents don't have a model component.", 'start': 4021.901, 'duration': 3.142}, {'end': 4025.564, 'text': 'some of them do.', 'start': 4025.043, 'duration': 0.521}, {'end': 4031.546, 'text': "um, there's also in between versions where we might have something that looks a lot like a model,", 'start': 4027.203, 'duration': 4.343}], 'summary': 'Reinforcement learning agents may not require a true model, depending on the domain, with some agents not having a model component.', 'duration': 26.938, 'max_score': 4004.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU4004608.jpg'}], 'start': 3064.097, 'title': 'Reinforcement learning concepts and applications', 'summary': 'Covers distributional reinforcement learning, emphasizing recent advancements and potential benefits, while also delving into concepts such as risk aversion, marginalization, continuous domains, approximations, non-stationary environments, defining returns and rewards, and the challenges of modeling distributions. additionally, it discusses value functions, including their approximation and implications for behavior, as well as the role of models in reinforcement learning for predicting environment dynamics and planning.', 'chapters': [{'end': 3146.574, 'start': 3064.097, 'title': 'Distributional reinforcement learning', 'summary': 'Discusses the importance of modeling the distribution of returns in reinforcement learning, highlighting the recent advancements in distributional reinforcement learning and its potential benefits in optimizing the whole distribution of returns.', 'duration': 82.477, 'highlights': ['Modeling the distribution of returns in reinforcement learning is a relatively new area of study, with recent advancements in distributional reinforcement learning.', 'Optimizing the whole distribution of returns is an important consideration in reinforcement learning, with potential benefits in understanding the entire range of possible outcomes.', 'Safe reinforcement learning has been extensively studied, focusing on optimizing the expected return while imposing constraints on the minimum acceptable return.']}, {'end': 3584.33, 'start': 3146.934, 'title': 'Reinforcement learning concepts', 'summary': 'Delves into the concepts of risk aversion, marginalization, continuous domains, approximations, non-stationary environments, defining returns and rewards, and the challenges of modeling distributions in reinforcement learning.', 'duration': 437.396, 'highlights': ['The chapter discusses risk aversion and risk seeking in decision-making, highlighting the relevance of these concepts in economics and their applications in achieving specific goals. Discussion of risk aversion and risk seeking, their relevance in decision-making, and applications in achieving specific goals.', 'The concept of marginalization and its relation to maximizing in reinforcement learning is explained, emphasizing its significance in eliminating dependence on policies. Explanation of marginalization and its relation to maximizing in reinforcement learning, and its significance in eliminating dependence on policies.', 'The challenges of dealing with continuous domains and approximations in reinforcement learning are addressed, emphasizing their central role in the learning process. Challenges of dealing with continuous domains and approximations, and their central role in the learning process.', 'The concept of non-stationary environments is discussed, highlighting the complexities and potential workarounds in dealing with such environments in reinforcement learning. Discussion of non-stationary environments, their complexities, and potential workarounds.', 'The definition of returns and rewards in reinforcement learning is explored, emphasizing the variations in defining rewards and their impact on the learning process. Exploration of the definition of returns and rewards, variations in defining rewards, and their impact on the learning process.', 'The challenges of modeling distributions in reinforcement learning are outlined, discussing the event-based approach and the complexities of defining events easily. Challenges of modeling distributions, the event-based approach, and the complexities of defining events.', 'The chapter emphasizes the impact of indefinite future definitions on learning in reinforcement learning, highlighting the challenges of accessing and learning from rewards and returns in practice. Emphasis on the impact of indefinite future definitions on learning, and the challenges of accessing and learning from rewards and returns in practice.']}, {'end': 4045.294, 'start': 3584.91, 'title': 'Value functions and model learning', 'summary': 'Discusses the definition and approximation of value functions, including reasons for approximation, such as large state space and lack of access to the model, and the implications of approximating value functions for behavior. it also touches on the concept of models in reinforcement learning, including predictions of environment dynamics, the role of models in planning, and their relevance in different domains.', 'duration': 460.384, 'highlights': ['Approximating value functions due to large state space and lack of access to the model The discussion emphasizes the need to approximate value functions in scenarios where the state space is too large to model exactly and when access to the model needed to compute expectations is limited.', 'Implications of approximating value functions on behavior The accurate value function enables optimal behavior, while suitable approximations allow for good performance, even in large domains, albeit losing optimality.', 'The role of models in reinforcement learning and planning Models are discussed as predictions of environment dynamics and their significance in planning, especially in domains where the exact model is available, such as the game of Go.']}], 'duration': 981.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU3064097.jpg', 'highlights': ['Recent advancements in distributional reinforcement learning', 'Optimizing the whole distribution of returns in reinforcement learning', 'Safe reinforcement learning with constraints on the minimum acceptable return', 'Risk aversion and risk seeking in decision-making', 'Significance of marginalization in reinforcement learning', 'Challenges of dealing with continuous domains and approximations', 'Complexities and potential workarounds in non-stationary environments', 'Variations in defining rewards and their impact on the learning process', 'Challenges of modeling distributions in reinforcement learning', 'Impact of indefinite future definitions on learning', 'Need to approximate value functions in large state space scenarios', 'Implications of approximating value functions on behavior', 'Role of models in reinforcement learning and planning']}, {'end': 5353.854, 'segs': [{'end': 4391.14, 'src': 'embed', 'start': 4366.899, 'weight': 2, 'content': [{'end': 4372.883, 'text': "whenever you have an explicit representation of your policy and value and you're learning both then I'll just call it an actor critic system,", 'start': 4366.899, 'duration': 5.984}, {'end': 4373.543, 'text': 'for simplicity.', 'start': 4372.883, 'duration': 0.66}, {'end': 4377.286, 'text': 'Where the policy is the actor and the value function is the critic.', 'start': 4375.124, 'duration': 2.162}, {'end': 4386.038, 'text': "Separately, there's this distinction between having a model-free agent and a model-based agent.", 'start': 4381.437, 'duration': 4.601}, {'end': 4391.14, 'text': 'Basically, each of these from the previous slide could also have a model.', 'start': 4386.839, 'duration': 4.301}], 'summary': 'Actor-critic system combines policy and value learning. distinction between model-free and model-based agents.', 'duration': 24.241, 'max_score': 4366.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU4366899.jpg'}, {'end': 4675.79, 'src': 'embed', 'start': 4643.883, 'weight': 1, 'content': [{'end': 4645.846, 'text': 'Um, a learned model is also a prediction.', 'start': 4643.883, 'duration': 1.963}, {'end': 4647.248, 'text': "It's a prediction of the dynamics.", 'start': 4646.066, 'duration': 1.182}, {'end': 4652.3, 'text': 'Control means to optimise the future.', 'start': 4649.519, 'duration': 2.781}, {'end': 4657.863, 'text': 'And this difference is also clear when we talked about these definitions of these value functions,', 'start': 4653.421, 'duration': 4.442}, {'end': 4661.044, 'text': 'where one value function was defined for a given policy.', 'start': 4657.863, 'duration': 3.181}, {'end': 4665.986, 'text': 'So, this would be a prediction problem where we have a policy and we just want to predict how good that policy is.', 'start': 4661.484, 'duration': 4.502}, {'end': 4669.307, 'text': 'And the other value function was defined as the optimal value function.', 'start': 4666.706, 'duration': 2.601}, {'end': 4675.79, 'text': 'So, for any policy, what would be the optimal thing to do? That would be the control problem, finding the optimal policy.', 'start': 4669.567, 'duration': 6.223}], 'summary': 'Learned model predicts dynamics and optimizes future policies for control.', 'duration': 31.907, 'max_score': 4643.883, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU4643883.jpg'}, {'end': 5031.48, 'src': 'embed', 'start': 4996.625, 'weight': 3, 'content': [{'end': 4999.046, 'text': "or maybe you can't think about because you don't know enough about the problem.", 'start': 4996.625, 'duration': 2.421}, {'end': 5002.949, 'text': "So it's just something to keep in mind.", 'start': 5001.808, 'duration': 1.141}, {'end': 5009.892, 'text': "So here's an example of how that then looks for Atari.", 'start': 5005.95, 'duration': 3.942}, {'end': 5013.653, 'text': 'So, as I said, there was one system that basically learned these Atari games.', 'start': 5010.052, 'duration': 3.601}, {'end': 5017.855, 'text': 'That system assumed that the rules of the game are unknown.', 'start': 5014.213, 'duration': 3.642}, {'end': 5023.537, 'text': 'So there was no known model of the environment.', 'start': 5018.435, 'duration': 5.102}, {'end': 5031.48, 'text': 'And then the system would learn by playing, just playing the game and then directly learn from the interaction.', 'start': 5024.737, 'duration': 6.743}], 'summary': 'Atari system learned games without known rules or model, by playing and interacting.', 'duration': 34.855, 'max_score': 4996.625, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU4996625.jpg'}, {'end': 5325.271, 'src': 'embed', 'start': 5298.434, 'weight': 0, 'content': [{'end': 5305.881, 'text': 'exploration finds more information and exploitation exploits the information you have right now to maximize the reward as best as you can right now.', 'start': 5298.434, 'duration': 7.447}, {'end': 5311.928, 'text': "It's important to do both, and it's a fundamental problem that doesn't naturally occur in supervised learning.", 'start': 5307.323, 'duration': 4.605}, {'end': 5317.373, 'text': 'And in fact, we can already look at this without considering sequentiality and without considering states,', 'start': 5312.868, 'duration': 4.505}, {'end': 5318.855, 'text': "and that's what we'll do in the next lecture.", 'start': 5317.373, 'duration': 1.482}, {'end': 5325.271, 'text': 'Simple examples include if you want to say find a good restaurant, you could go to your favorite restaurant right now.', 'start': 5319.87, 'duration': 5.401}], 'summary': 'Balancing exploration and exploitation is crucial for maximizing rewards in reinforcement learning.', 'duration': 26.837, 'max_score': 5298.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5298434.jpg'}], 'start': 4045.294, 'title': 'Reinforcement learning basics', 'summary': 'Explains reinforcement learning, its application in training an agent to play atari games, challenges of balancing exploration and exploitation, and discovering a good policy while maximizing reward.', 'chapters': [{'end': 4276.131, 'start': 4045.294, 'title': 'Generative models in deep learning', 'summary': 'Explores the concept of generative models, emphasizing the importance of sample models over expected state models, using a simple maze example to illustrate the policy mapping, optimal policy, value function, and the impact of partial models on planning and solutions.', 'duration': 230.837, 'highlights': ['Generative models, also known as sample models or stochastic models, are essential in deep learning for providing a sample next state and enabling the construction of full trajectories. These models, often referred to as generative models in deep learning, facilitate the generation of sample next states and the construction of full trajectories.', 'Expected state models are limited as they cannot incorporate the expected state into the model again, making sample models more practical for real-world problems. The limitations of expected state models make them impractical for real-world problems, as they cannot incorporate the expected state into the model again.', 'The chapter presents a simple maze example to demonstrate the concept of policy mapping, optimal policy, and the value function, showcasing the deterministic nature of the optimal policy for quickly exiting the maze. A simple maze example is used to illustrate policy mapping, optimal policy, and the value function, highlighting the deterministic nature of the optimal policy for swiftly exiting the maze.', 'The impact of partial models on planning and solutions is discussed, demonstrating how approximate models can still lead to finding optimal solutions despite not being fully correct in all states. The discussion of partial models emphasizes how approximate models can still yield optimal solutions, despite not being entirely accurate in all states.']}, {'end': 4502.53, 'start': 4280.936, 'title': 'Agent components and classifications', 'summary': 'Discusses the different components of an agent, such as value functions, policies, and models, and their classifications into actor-critic, model-based, and model-free agents, with an emphasis on the high-level view and challenges in reinforcement learning.', 'duration': 221.594, 'highlights': ['The chapter discusses the different components of an agent, such as value functions, policies, and models, and their classifications into actor-critic, model-based, and model-free agents. The chapter covers the various components of an agent, including value functions, policies, and models, and explains their categorization into actor-critic, model-based, and model-free agents.', 'The emphasis is on the high-level view and challenges in reinforcement learning. The chapter provides a high-level overview of reinforcement learning and highlights some of the challenges associated with it.', 'The terminology actor-critic is used when an agent has an explicit policy and a value function. The concept of actor-critic is explained, where an agent with an explicit policy and a value function is referred to as an actor-critic system.']}, {'end': 4996.625, 'start': 4503.171, 'title': 'Rl: learning, planning, prediction, and control', 'summary': 'Discusses the distinction between learning and planning in reinforcement learning, the importance of prediction and control, the challenges of applying deep reinforcement learning, and the considerations when choosing function classes like neural networks or linear functions.', 'duration': 493.454, 'highlights': ['The distinction between learning and planning in reinforcement learning, and the importance of prediction and control The chapter discusses the difference between learning and planning in reinforcement learning, emphasizing the importance of prediction and control in evaluating and optimizing future outcomes.', 'Challenges of applying deep reinforcement learning and the considerations when choosing function classes The challenges of applying deep reinforcement learning are highlighted, including the violation of assumptions made in typical supervised learning, and the considerations when choosing function classes like neural networks or linear functions.', 'The distinction between prediction and control in reinforcement learning The chapter explains the distinction between prediction and control in reinforcement learning, emphasizing the importance of evaluating future outcomes versus optimizing future outcomes.']}, {'end': 5353.854, 'start': 4996.625, 'title': 'Reinforcement learning basics', 'summary': 'Explains the concept of reinforcement learning, highlighting its application in training an agent to play atari games by learning from interaction, the challenges of balancing exploration and exploitation, and the importance of discovering a good policy from new experiences while maximizing reward.', 'duration': 357.229, 'highlights': ['The chapter focuses on training an agent to play Atari games by learning from interaction. It describes a system that learned Atari games by directly learning from interaction, where the joystick defines the action and the system receives rewards and observations as pixels.', 'The dilemma between exploration and exploitation is a fundamental problem in reinforcement learning. The chapter explains the balance between exploration, which finds more information, and exploitation, which exploits existing information to maximize reward, and the challenges in achieving this balance.', 'The importance of discovering a good policy from new experiences while maximizing reward is emphasized. It discusses the significance of actively searching for new data and the need to balance exploration and exploitation to achieve optimal performance in reinforcement learning.']}], 'duration': 1308.56, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU4045294.jpg', 'highlights': ['The chapter explains the balance between exploration and exploitation, finding more information and exploiting existing information to maximize reward.', 'The chapter discusses the difference between learning and planning in reinforcement learning, emphasizing the importance of prediction and control in evaluating and optimizing future outcomes.', 'The chapter covers the various components of an agent, including value functions, policies, and models, and explains their categorization into actor-critic, model-based, and model-free agents.', 'The chapter focuses on training an agent to play Atari games by learning from interaction, describing a system that learned Atari games by directly learning from interaction.']}, {'end': 6187.767, 'segs': [{'end': 5489.708, 'src': 'embed', 'start': 5461.179, 'weight': 1, 'content': [{'end': 5464.22, 'text': 'which basically, together with the reward, defines what the goal is.', 'start': 5461.179, 'duration': 3.041}, {'end': 5468.641, 'text': 'So the goal here is not just to find high reward, but also to do it reasonably quickly.', 'start': 5465.001, 'duration': 3.64}, {'end': 5472.042, 'text': 'Because future rewards are discounted.', 'start': 5470.342, 'duration': 1.7}, {'end': 5477.504, 'text': 'Here, under B, the value function is given.', 'start': 5473.283, 'duration': 4.221}, {'end': 5482.324, 'text': "It's a state value function for the uniform random policy.", 'start': 5478.342, 'duration': 3.982}, {'end': 5486.866, 'text': 'And what we see is that actually the most preferred state that you can possibly be in is state A.', 'start': 5482.504, 'duration': 4.362}, {'end': 5489.708, 'text': "Because you'll always reliably get that reward of 10.", 'start': 5486.866, 'duration': 2.842}], 'summary': 'The goal is to find high rewards quickly, with state a being the most preferred for its reliable reward of 10.', 'duration': 28.529, 'max_score': 5461.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5461179.jpg'}, {'end': 5626.813, 'src': 'embed', 'start': 5595.845, 'weight': 2, 'content': [{'end': 5599.027, 'text': 'The optimal value function is now all strictly positive.', 'start': 5595.845, 'duration': 3.182}, {'end': 5605.25, 'text': 'For the simple reason that the policy here can choose never to bump into a wall.', 'start': 5599.967, 'duration': 5.283}, {'end': 5607.851, 'text': 'So there are no negative rewards for the optimal policy.', 'start': 5605.79, 'duration': 2.061}, {'end': 5610.312, 'text': 'It just avoids doing that altogether.', 'start': 5608.751, 'duration': 1.561}, {'end': 5613.734, 'text': 'And therefore it can go and collect these positive rewards.', 'start': 5611.053, 'duration': 2.681}, {'end': 5618.056, 'text': 'And now notice as well that the value of state a is much higher than ten.', 'start': 5614.354, 'duration': 3.702}, {'end': 5626.813, 'text': 'Because It can get the immediate reward of ten, but then a couple of steps later, it can again get a reward of ten and so on and so on.', 'start': 5618.836, 'duration': 7.977}], 'summary': 'The optimal value function is all positive, avoiding wall collisions and collecting positive rewards, leading to state a having a much higher value than ten.', 'duration': 30.968, 'max_score': 5595.845, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5595845.jpg'}, {'end': 5801.519, 'src': 'embed', 'start': 5771.319, 'weight': 0, 'content': [{'end': 5777.721, 'text': 'There will be something called policy gradient methods, which is a way of family of algorithms that allow you to learn the policy directly,', 'start': 5771.319, 'duration': 6.402}, {'end': 5778.461, 'text': "which we'll talk about.", 'start': 5777.721, 'duration': 0.74}, {'end': 5783.963, 'text': "And we'll talk about challenges of deep reinforcement learning, how to set up like a complete agent.", 'start': 5779.602, 'duration': 4.361}, {'end': 5801.519, 'text': "How do you combine these things and how to integrate learning and planning? Are there any questions before we wrap up? Yeah? I don't know.", 'start': 5784.063, 'duration': 17.456}], 'summary': 'Policy gradient methods enable direct policy learning in deep reinforcement learning.', 'duration': 30.2, 'max_score': 5771.319, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5771319.jpg'}, {'end': 5998.988, 'src': 'embed', 'start': 5971.521, 'weight': 4, 'content': [{'end': 5980.866, 'text': 'And based on the body of the agents and the environments, the agent has learned to go forward, but also in interesting ways.', 'start': 5971.521, 'duration': 9.345}, {'end': 5991.145, 'text': 'specifically note that we that that nobody put any information in here on how to move or how to walk.', 'start': 5984.463, 'duration': 6.682}, {'end': 5997.847, 'text': "there wasn't anything pre-coded in terms of how do you move your joints, which means you can also apply this to different bodies.", 'start': 5991.145, 'duration': 6.702}, {'end': 5998.988, 'text': 'same learning algorithm.', 'start': 5997.847, 'duration': 1.141}], 'summary': 'The agent learned to move forward without pre-coded instructions, applicable to various bodies.', 'duration': 27.467, 'max_score': 5971.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5971521.jpg'}, {'end': 6055.342, 'src': 'embed', 'start': 6018.931, 'weight': 3, 'content': [{'end': 6025.215, 'text': "There's a general principle here that when you do cut up a reinforcement tooling system and you have to define the reward function,", 'start': 6018.931, 'duration': 6.284}, {'end': 6028.738, 'text': "it's typically good to define exactly what you want.", 'start': 6025.215, 'duration': 3.523}, {'end': 6038.785, 'text': 'As you can tell, sometimes you might get slightly unexpected solutions.', 'start': 6030.979, 'duration': 7.806}, {'end': 6043.008, 'text': 'And not quite optimal.', 'start': 6042.247, 'duration': 0.761}, {'end': 6055.342, 'text': 'So why does anybody know the reason why this agent was making these weird movements? So it might be balance.', 'start': 6046.475, 'duration': 8.867}], 'summary': 'Defining clear reward function is crucial for optimal reinforcement learning outcomes.', 'duration': 36.411, 'max_score': 6018.931, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU6018931.jpg'}], 'start': 5353.854, 'title': 'Reinforcement learning and integration of learning and planning', 'summary': 'Provides an overview of reinforcement learning, covering core principles, learning algorithms, and challenges. it also discusses the integration of learning and planning, emphasizing the role of reward functions in achieving desired outcomes.', 'chapters': [{'end': 5783.963, 'start': 5353.854, 'title': 'Reinforcement learning overview', 'summary': 'Introduces a maze example illustrating the concept of value function and discount factor, and discusses the core principles and learning algorithms of reinforcement learning, including topics such as bandit problems, markov decision processes, model free prediction and control, policy gradient methods, and challenges of deep reinforcement learning.', 'duration': 430.109, 'highlights': ['The value function is conditional on both the uniformly random policy and the discount factor, defining the goal to find high rewards reasonably quickly. The value function is conditional on both the uniformly random policy and the discount factor, which defines the goal to find high rewards reasonably quickly.', 'The optimal value function is all strictly positive as the policy can choose never to bump into a wall, leading to no negative rewards for the optimal policy. The optimal value function is all strictly positive as the policy can choose never to bump into a wall, leading to no negative rewards for the optimal policy.', 'The chapter outlines the core principles and learning algorithms of reinforcement learning, including topics such as bandit problems, Markov decision processes, model free prediction and control, policy gradient methods, and challenges of deep reinforcement learning. The chapter outlines the core principles and learning algorithms of reinforcement learning, including topics such as bandit problems, Markov decision processes, model free prediction and control, policy gradient methods, and challenges of deep reinforcement learning.']}, {'end': 6187.767, 'start': 5784.063, 'title': 'Integrating learning and planning', 'summary': 'Discusses the integration of learning and planning, emphasizing the importance of defining the reward function in reinforcement learning to achieve desired outcomes, with examples of a learning system and the impact of reward functions on movement and behavior.', 'duration': 403.704, 'highlights': ['The concept of defining the reward function in reinforcement learning is emphasized, highlighting the importance of aligning the reward with the desired outcome. None', "A learning system's ability to achieve locomotion without pre-coded information on movement or walking is highlighted, showcasing the adaptability of the learning algorithm. None", 'The impact of reward functions on movement and behavior, including unexpected solutions and the optimization of given tasks, is discussed in the context of reinforcement learning. None']}], 'duration': 833.913, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ISk80iLhdfU/pics/ISk80iLhdfU5353854.jpg', 'highlights': ['The chapter outlines the core principles and learning algorithms of reinforcement learning, including topics such as bandit problems, Markov decision processes, model free prediction and control, policy gradient methods, and challenges of deep reinforcement learning.', 'The value function is conditional on both the uniformly random policy and the discount factor, defining the goal to find high rewards reasonably quickly.', 'The optimal value function is all strictly positive as the policy can choose never to bump into a wall, leading to no negative rewards for the optimal policy.', 'The concept of defining the reward function in reinforcement learning is emphasized, highlighting the importance of aligning the reward with the desired outcome.', "A learning system's ability to achieve locomotion without pre-coded information on movement or walking is highlighted, showcasing the adaptability of the learning algorithm.", 'The impact of reward functions on movement and behavior, including unexpected solutions and the optimization of given tasks, is discussed in the context of reinforcement learning.']}], 'highlights': ['The reinforcement learning track covers high-level concepts and its ties with deep learning, with a focus on using deep learning methods.', 'Reinforcement learning involves active interaction and sequential interactions with the environment, emphasizing the importance of adaptation and online learning.', 'Reinforcement learning spans across various fields including computer science, neuroscience, psychology, engineering, mathematics, and economics, indicating its multi-disciplinary relevance.', 'Introduces the Q function for conditioning on state-action pairs. The Q function allows for conditioning the value on both state and action pair, denoted as Q for state-action pairs and V for states.', 'The Bellman equation, assuming stochastic policies, forms a system of equations for solving optimal values and policies.', 'Recent advancements in distributional reinforcement learning', 'The chapter explains the balance between exploration and exploitation, finding more information and exploiting existing information to maximize reward.', 'The chapter outlines the core principles and learning algorithms of reinforcement learning, including topics such as bandit problems, Markov decision processes, model free prediction and control, policy gradient methods, and challenges of deep reinforcement learning.']}