title
DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]
description
Research Scientist Hado van Hasselt introduces the reinforcement learning course and explains how reinforcement learning relates to AI.
Slides: https://dpmd.ai/introslides
Full video lecture series: https://dpmd.ai/DeepMindxUCL21
detail
{'title': 'DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]', 'heatmap': [{'end': 1997.242, 'start': 1830.866, 'weight': 1}, {'end': 2107.35, 'start': 2049.52, 'weight': 0.713}, {'end': 2485.356, 'start': 2424.641, 'weight': 0.715}, {'end': 3346.91, 'start': 3287.005, 'weight': 0.744}], 'summary': "The introduction to the reinforcement learning course features lectures by harof and hasselt, exploring ai concepts, alan turing's proposal, active learning, dqn role, markov property, importance of policies and value predictions, and deep reinforcement learning using atari and grid world examples.", 'chapters': [{'end': 94.926, 'segs': [{'end': 67.221, 'src': 'embed', 'start': 21.401, 'weight': 0, 'content': [{'end': 25.884, 'text': "so instead of talking from a lecture hall, I'm now talking to you from my home.", 'start': 21.401, 'duration': 4.483}, {'end': 30.339, 'text': 'The topic of the course, as mentioned, is reinforcement learning.', 'start': 27.818, 'duration': 2.521}, {'end': 34.019, 'text': 'I will explain what that means, what those words mean, reinforcement learning,', 'start': 30.399, 'duration': 3.62}, {'end': 43.541, 'text': "and we'll go into some depth in multiple lectures to explain different concepts and different algorithms that we can build.", 'start': 34.019, 'duration': 9.522}, {'end': 46.121, 'text': "I'm not teaching this course by myself.", 'start': 44.821, 'duration': 1.3}, {'end': 50.802, 'text': 'Some of the lectures will be taught by Diana Borsa and some will be taught by Matteo Hessel.', 'start': 46.721, 'duration': 4.081}, {'end': 56.083, 'text': 'And today will be about introducing reinforcement learning.', 'start': 52.122, 'duration': 3.961}, {'end': 63.699, 'text': "There's also a really good book on this topic by Rich Sutton and Andy Bartow, which I highly recommend.", 'start': 56.974, 'duration': 6.725}, {'end': 67.221, 'text': 'And this is also going to be used as background material for this course.', 'start': 63.859, 'duration': 3.362}], 'summary': 'Reinforcement learning course will cover multiple lectures, taught by different instructors, using a recommended book as background material.', 'duration': 45.82, 'max_score': 21.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc21401.jpg'}], 'start': 1.169, 'title': 'Reinforcement learning course', 'summary': 'Introduces a course on reinforcement learning taught by harof and hasselt, with contributions from diana borsa and matteo hessel, focusing on explaining the concepts and algorithms, and referencing a recommended book by rich sutton and andy bartow.', 'chapters': [{'end': 94.926, 'start': 1.169, 'title': 'Reinforcement learning: introduction course', 'summary': 'Introduces a course on reinforcement learning taught by harof and hasselt, with contributions from diana borsa and matteo hessel, focusing on explaining the concepts and algorithms, and referencing a recommended book by rich sutton and andy bartow.', 'duration': 93.757, 'highlights': ['The course is on reinforcement learning and will cover different concepts and algorithms. The course will delve into multiple lectures to explain different concepts and algorithms.', 'The lectures will be taught by Harof and Hasselt, Diana Borsa, and Matteo Hessel. The course will be taught by multiple instructors, including Harof and Hasselt, Diana Borsa, and Matteo Hessel.', 'The course references a recommended book by Rich Sutton and Andy Bartow. The course recommends a book by Rich Sutton and Andy Bartow as background material.']}], 'duration': 93.757, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1169.jpg', 'highlights': ['The course will cover different concepts and algorithms in reinforcement learning.', 'The lectures will be taught by Harof, Hasselt, Diana Borsa, and Matteo Hessel.', 'The course recommends a book by Rich Sutton and Andy Bartow as background material.']}, {'end': 481.161, 'segs': [{'end': 221.095, 'src': 'embed', 'start': 191.941, 'weight': 1, 'content': [{'end': 195.302, 'text': 'And in addition to that, also coming up with new things that we could solve with machines.', 'start': 191.941, 'duration': 3.361}, {'end': 199.743, 'text': "So even things that we weren't doing before, we could now make machines that could do those things for us.", 'start': 195.742, 'duration': 4.001}, {'end': 209.965, 'text': 'Of course, this led to huge productivity increase worldwide and it also fed into a new stage, you could argue, comes after this,', 'start': 201.038, 'duration': 8.927}, {'end': 211.887, 'text': 'which you could call the digital revolution.', 'start': 209.965, 'duration': 1.922}, {'end': 218.332, 'text': 'And one way to interpret this is to say the digital revolution was all about automating repeated mental solutions.', 'start': 212.988, 'duration': 5.344}, {'end': 221.095, 'text': 'So a classic example here would be a calculator.', 'start': 219.113, 'duration': 1.982}], 'summary': 'Advancements in automation led to a significant increase in productivity, fueling the digital revolution.', 'duration': 29.154, 'max_score': 191.941, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc191941.jpg'}, {'end': 333.259, 'src': 'embed', 'start': 301.748, 'weight': 0, 'content': [{'end': 304.67, 'text': 'In addition, this requires you to autonomously make decisions.', 'start': 301.748, 'duration': 2.922}, {'end': 309.914, 'text': "So I'm putting these terms basically up front and center.", 'start': 304.83, 'duration': 5.084}, {'end': 313.216, 'text': "So there's learning and autonomy and decisions.", 'start': 310.434, 'duration': 2.782}, {'end': 317.466, 'text': 'And these are all quite central to this generic problem of trying to find solutions.', 'start': 313.356, 'duration': 4.11}, {'end': 322.33, 'text': "Of course, we're not the first to talk about artificial intelligence.", 'start': 319.648, 'duration': 2.682}, {'end': 327.254, 'text': 'This has been a topic of investigation for many decades now.', 'start': 323.071, 'duration': 4.183}, {'end': 333.259, 'text': "And there's this wonderful paper by Alan Turing from 1950 called Computing Machinery and Intelligence.", 'start': 327.975, 'duration': 5.284}], 'summary': 'Autonomous decision-making is central to finding ai solutions, a topic under investigation for decades, notably by alan turing in 1950.', 'duration': 31.511, 'max_score': 301.748, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc301748.jpg'}, {'end': 473.297, 'src': 'embed', 'start': 445.491, 'weight': 2, 'content': [{'end': 450.514, 'text': "And maybe it's easier to actually write a program that can itself learn in the same way.", 'start': 445.491, 'duration': 5.023}, {'end': 453.636, 'text': 'maybe that we do, or maybe in a similar way, or maybe in a slightly different way.', 'start': 450.514, 'duration': 3.122}, {'end': 463.331, 'text': 'but it can learn by interacting with the world, by, in his words, subjecting itself to education,', 'start': 454.346, 'duration': 8.985}, {'end': 466.213, 'text': 'maybe to find similar solutions as the adult mind has.', 'start': 463.331, 'duration': 2.882}, {'end': 468.694, 'text': "And he's conjecturing that maybe this is easier.", 'start': 467.053, 'duration': 1.641}, {'end': 473.297, 'text': "Now, this is a really interesting thought, and it's really interesting to think about this a little bit.", 'start': 469.995, 'duration': 3.302}], 'summary': 'Exploring the possibility of a program learning by interacting with the world and finding solutions like the adult mind.', 'duration': 27.806, 'max_score': 445.491, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc445491.jpg'}], 'start': 94.926, 'title': "Reinforcement learning, ai, and alan turing's proposal", 'summary': "Discusses the concepts of reinforcement learning and artificial intelligence, their relation, and the evolution from the industrial revolution to the digital revolution, emphasizing the potential of machines to find solutions themselves, leading to autonomy and decision-making. it also explores alan turing's proposal to simulate the child mind, suggesting that it might be easier to write a program that can itself learn.", 'chapters': [{'end': 327.254, 'start': 94.926, 'title': 'Reinforcement learning and artificial intelligence', 'summary': 'Discusses the concepts of reinforcement learning and artificial intelligence, their relation, and the evolution from the industrial revolution to the digital revolution, emphasizing the potential of machines to find solutions themselves, leading to autonomy and decision-making.', 'duration': 232.328, 'highlights': ['The chapter explains the concept of reinforcement learning and its relation to artificial intelligence, emphasizing the potential of machines to find solutions themselves, leading to autonomy and decision-making. Reinforcement learning is discussed in relation to artificial intelligence, highlighting the potential of machines to find solutions themselves, leading to autonomy and decision-making.', 'The chapter explores the evolution from the Industrial Revolution to the digital revolution, emphasizing the potential of machines to find solutions themselves, leading to autonomy and decision-making. The evolution from the Industrial Revolution to the digital revolution is explored, emphasizing the potential of machines to find solutions themselves, leading to autonomy and decision-making.', 'The chapter emphasizes the potential of machines to find solutions themselves, leading to autonomy and decision-making. The potential of machines to find solutions themselves, leading to autonomy and decision-making, is emphasized.']}, {'end': 481.161, 'start': 327.975, 'title': 'Alan turing on simulating the child mind', 'summary': "Discusses alan turing's proposal to consider simulating the child mind instead of the adult mind, conjecturing that writing a program to simulate the adult mind might be quite complicated and it might be easier to write a program that can itself learn.", 'duration': 153.186, 'highlights': ['Alan Turing proposes simulating the child mind instead of the adult mind, conjecturing that it might be easier to write a program that can itself learn. Turing suggests that instead of simulating the adult mind, it might be easier to simulate the child mind and subject it to education, as it could potentially lead to obtaining the adult brain.', 'Turing conjectures that writing a program to simulate the adult mind might be quite complicated due to the complexity of human experience and learning. Turing suggests that the complexity of human experience and learning might make it tricky to write a program that simulates the adult mind, as individuals learn a lot of rules, pattern matching, and skills throughout their lives.', "Turing's idea revolves around the concept of a program that can learn by interacting with the world, subjecting itself to education, and finding similar solutions as the adult mind. Turing proposes the idea of a program that can learn by interacting with the world and subjecting itself to education, aiming to find similar solutions as the adult mind, which he conjectures might be an easier approach."]}], 'duration': 386.235, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc94926.jpg', 'highlights': ["The chapter discusses reinforcement learning and its relation to artificial intelligence, emphasizing machines' potential for autonomy and decision-making.", 'It explores the evolution from the Industrial Revolution to the digital revolution, highlighting the potential of machines for autonomy and decision-making.', 'Alan Turing proposes simulating the child mind, suggesting it might be easier to write a program that can itself learn.']}, {'end': 1659.187, 'segs': [{'end': 573.969, 'src': 'embed', 'start': 542.321, 'weight': 0, 'content': [{'end': 544.902, 'text': 'So this brings us to this question what is reinforcement learning?', 'start': 542.321, 'duration': 2.581}, {'end': 552.345, 'text': 'And this is related to this experience that Alan Turing was also talking about,', 'start': 545.862, 'duration': 6.483}, {'end': 555.846, 'text': 'because we know that people and animals learn by interacting with our environment.', 'start': 552.345, 'duration': 3.501}, {'end': 560.067, 'text': "And this differs from certain other types of learning, and that's good to appreciate.", 'start': 557.226, 'duration': 2.841}, {'end': 562.268, 'text': "First of all, it's active rather than passive.", 'start': 560.327, 'duration': 1.941}, {'end': 565.145, 'text': "And we'll get back to that extensively in the next lecture.", 'start': 562.864, 'duration': 2.281}, {'end': 573.969, 'text': 'What this means is that you are subjected to some data or experience, if you will, but the experience is not fully out of your control.', 'start': 565.905, 'duration': 8.064}], 'summary': "Reinforcement learning involves active learning through interaction with the environment, as discussed in alan turing's work.", 'duration': 31.648, 'max_score': 542.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc542321.jpg'}, {'end': 832.359, 'src': 'embed', 'start': 810.106, 'weight': 1, 'content': [{'end': 820.952, 'text': 'So the main purpose of this course is then to go basically inside that agent and figure out how we could build learning algorithms that can help that agent learn to interact better.', 'start': 810.106, 'duration': 10.846}, {'end': 826.655, 'text': 'And what does better mean here? Well, the agent is going to try to optimize some reward signal.', 'start': 821.092, 'duration': 5.563}, {'end': 829.677, 'text': "This is how we're going to specify the goal.", 'start': 827.996, 'duration': 1.681}, {'end': 832.359, 'text': 'And the goal is not to optimize the immediate reward.', 'start': 830.257, 'duration': 2.102}], 'summary': 'Course aims to develop learning algorithms to help agents optimize reward signals for improved interactions.', 'duration': 22.253, 'max_score': 810.106, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc810106.jpg'}, {'end': 1122.884, 'src': 'embed', 'start': 1099.135, 'weight': 2, 'content': [{'end': 1108.559, 'text': 'And I put that on the slide because sometimes people conflate the current set of algorithms that we have in reinforcement learning to solve these type of problems with the field of reinforcement learning.', 'start': 1099.135, 'duration': 9.424}, {'end': 1113.08, 'text': "But it's good to separate that out and to appreciate that there is a reinforcement learning problem.", 'start': 1109.099, 'duration': 3.981}, {'end': 1117.782, 'text': "And then there's a set of current solutions that people have considered to solve these problems.", 'start': 1113.661, 'duration': 4.121}, {'end': 1122.884, 'text': 'And that set of solutions might be under a lot of development.', 'start': 1119.043, 'duration': 3.841}], 'summary': 'Reinforcement learning has evolving solutions for current problems.', 'duration': 23.749, 'max_score': 1099.135, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1099135.jpg'}, {'end': 1400.727, 'src': 'embed', 'start': 1376.552, 'weight': 3, 'content': [{'end': 1382.657, 'text': "And I'm going to say that, basically, reinforcement learning is the science and framework of learning to make decisions from interaction.", 'start': 1376.552, 'duration': 6.105}, {'end': 1387.582, 'text': 'So reinforcement learning is not a set of algorithms, also not a set of problems.', 'start': 1383.718, 'duration': 3.864}, {'end': 1391.726, 'text': 'Sometimes, in shorthand, we say reinforcement learning when referring to the algorithms,', 'start': 1388.183, 'duration': 3.543}, {'end': 1398.186, 'text': "but maybe it's better to say reinforcement learning problems Or reinforcement learning algorithms,", 'start': 1391.726, 'duration': 6.46}, {'end': 1400.727, 'text': 'if you want to specify those two different parts of it.', 'start': 1398.186, 'duration': 2.541}], 'summary': 'Reinforcement learning is the science of decision-making from interaction, encompassing algorithms and problems.', 'duration': 24.175, 'max_score': 1376.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1376552.jpg'}], 'start': 481.161, 'title': 'Reinforcement learning', 'summary': 'Introduces the concept of reinforcement learning, emphasizing its active, goal-directed nature and ability to learn without examples. it explores applications in fields like flying helicopters and managing investment portfolios, highlighting the goal of optimizing cumulative rewards. additionally, it discusses the distinction between reinforcement learning problems and current algorithms, emphasizing the reasons to learn and how reinforcement learning provides algorithms for finding solutions and adapting online.', 'chapters': [{'end': 807.834, 'start': 481.161, 'title': 'Understanding reinforcement learning', 'summary': 'Introduces the concept of artificial intelligence, with a focus on reinforcement learning, emphasizing the active, goal-directed, and sequential nature of interactions, as well as the ability to learn without examples, all within the framework of an agent interacting with an environment.', 'duration': 326.673, 'highlights': ['Artificial intelligence aims to learn, make decisions, and achieve goals, potentially through programs that can learn rather than those with fixed capabilities, emphasizing the active, goal-directed, and sequential nature of interactions. The primary goal of artificial intelligence is to learn, make decisions, and achieve goals, possibly through programs that can learn. This emphasizes the active, goal-directed, and sequential nature of interactions, setting the stage for understanding reinforcement learning.', 'Reinforcement learning involves active, goal-directed interactions where the agent learns without examples, optimizing a reward signal to achieve satisfying outcomes. Reinforcement learning encompasses active, goal-directed interactions where the agent learns without examples, optimizing a reward signal to achieve satisfying outcomes, highlighting the autonomous nature of learning within this framework.', "The interaction loop, where the agent interacts with the environment, forms the basis of reinforcement learning, with actions and observations defining the agent's interface and sensory motor stream. The interaction loop, where the agent interacts with the environment, forms the basis of reinforcement learning, with actions and observations defining the agent's interface and sensory motor stream, providing a foundational understanding of the reinforcement learning framework."]}, {'end': 1099.095, 'start': 810.106, 'title': 'Reinforcement learning: goals and applications', 'summary': 'Explores the concept of reinforcement learning, emphasizing the goal of optimizing cumulative rewards and its application in various fields such as flying helicopters, managing investment portfolios, and playing games.', 'duration': 288.989, 'highlights': ['The chapter emphasizes the goal of optimizing cumulative rewards to help the agent learn to interact better. The learning algorithms aim to help the agent learn to optimize the sum of rewards into the future, rather than just focusing on immediate rewards.', 'The concept of the reward signal as a preference function over observations or sequences of observations is discussed. The reward signal is explained as a way for the agent to observe the world, feel happier or less happy about what it sees, and optimize its behavior to achieve more rewards.', 'Concrete examples of reinforcement learning problems are provided, such as flying helicopters, managing investment portfolios, and playing games. Various applications of reinforcement learning are presented, including flying helicopters, managing investment portfolios, controlling power stations, making robots walk, and playing video or board games.']}, {'end': 1355.813, 'start': 1099.135, 'title': 'Reinforcement learning: goals and solutions', 'summary': 'Discusses the distinction between reinforcement learning problems and current algorithms, highlighting the two distinct reasons to learn in such problems: finding solutions and adapting online, and how reinforcement learning can provide algorithms for both cases.', 'duration': 256.678, 'highlights': ['Reinforcement learning problems vs current algorithms The distinction between reinforcement learning problems and the current set of algorithms is emphasized, encouraging flexible thinking about solutions.', 'Two distinct reasons to learn in reinforcement learning problems The chapter outlines the two distinct reasons to learn in reinforcement learning problems: finding solutions and adapting online to deal with unforeseen circumstances.', 'Adapting online to deal with unforeseen circumstances The need for systems to adapt online to handle unforeseen circumstances, such as environmental changes or wear and tear, is highlighted as an important aspect of reinforcement learning.']}, {'end': 1659.187, 'start': 1355.813, 'title': 'Understanding reinforcement learning', 'summary': 'Explains reinforcement learning as the science and framework of learning to make decisions from interaction, highlighting its unique properties such as the need to consider time and consequences of actions, and the use of learning systems to find solutions without prior knowledge of the environment.', 'duration': 303.374, 'highlights': ['The chapter defines reinforcement learning as the science and framework of learning to make decisions from interaction, emphasizing its unique properties like the need to consider time and consequences of actions and the use of learning systems to find solutions without prior knowledge of the environment.', 'Reinforcement learning requires actively gathering experience as actions will change the data that is observed, and it involves considering future steps farther into the future, making it a challenging and interesting subject.', 'The examples of Atari games demonstrate how reinforcement learning can enable an agent to learn and play different games without prior knowledge or strategy, solely based on observations, motor controls, and a reward signal.']}], 'duration': 1178.026, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc481161.jpg', 'highlights': ['Reinforcement learning involves active, goal-directed interactions where the agent learns without examples, optimizing a reward signal to achieve satisfying outcomes.', 'The chapter emphasizes the goal of optimizing cumulative rewards to help the agent learn to interact better.', 'The distinction between reinforcement learning problems and the current set of algorithms is emphasized, encouraging flexible thinking about solutions.', 'The chapter defines reinforcement learning as the science and framework of learning to make decisions from interaction, emphasizing its unique properties like the need to consider time and consequences of actions and the use of learning systems to find solutions without prior knowledge of the environment.']}, {'end': 2035.641, 'segs': [{'end': 1687.266, 'src': 'embed', 'start': 1659.187, 'weight': 0, 'content': [{'end': 1661.428, 'text': 'or that it was controlling one of these boxes in this example.', 'start': 1659.187, 'duration': 2.241}, {'end': 1664.99, 'text': 'And that is the benefit of having a generic learning algorithm.', 'start': 1662.789, 'duration': 2.201}, {'end': 1669.014, 'text': 'In this case, this algorithm is called DQN, and we will discuss it later in the course as well.', 'start': 1665.632, 'duration': 3.382}, {'end': 1676.099, 'text': "Okay, so now I'll go back to the slides.", 'start': 1673.157, 'duration': 2.942}, {'end': 1680.582, 'text': "So now I've given you a couple of examples.", 'start': 1679.041, 'duration': 1.541}, {'end': 1685.565, 'text': "I've shown you these Atari games and now is a good time to start formalizing things a little bit more concretely,", 'start': 1680.622, 'duration': 4.943}, {'end': 1687.266, 'text': "so that we know a little bit more what's happening.", 'start': 1685.565, 'duration': 1.701}], 'summary': 'Generic learning algorithm dqn discussed in the course.', 'duration': 28.079, 'max_score': 1659.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1659187.jpg'}, {'end': 1738.136, 'src': 'embed', 'start': 1698.671, 'weight': 1, 'content': [{'end': 1699.711, 'text': "What's inside that agent??", 'start': 1698.671, 'duration': 1.04}, {'end': 1700.331, 'text': 'How could this work?', 'start': 1699.731, 'duration': 0.6}, {'end': 1708.733, 'text': "So we're going to go back to this interaction loop and we're going to introduce a little bit of notation where we basically say that every time,", 'start': 1701.471, 'duration': 7.262}, {'end': 1713.514, 'text': 'step T, we receive some observation OT and some reward RT.', 'start': 1708.733, 'duration': 4.781}, {'end': 1717.976, 'text': 'As I mentioned, the reward could also be thought of as being inside the agent.', 'start': 1714.355, 'duration': 3.621}, {'end': 1719.496, 'text': "Maybe it's some function of the observations.", 'start': 1717.996, 'duration': 1.5}, {'end': 1722.781, 'text': 'Or you could think of this as coming with the observations from the environment.', 'start': 1720.219, 'duration': 2.562}, {'end': 1724.923, 'text': 'And then the agent executes some action.', 'start': 1723.422, 'duration': 1.501}, {'end': 1729.708, 'text': 'So the action can be based on this observation OT in terms of our sequence of interactions.', 'start': 1725.384, 'duration': 4.324}, {'end': 1734.152, 'text': 'And then the environment receives that action and emits a new observation.', 'start': 1730.609, 'duration': 3.543}, {'end': 1738.136, 'text': 'Or we could think of the agent as pulling in a new observation and the next reward.', 'start': 1734.773, 'duration': 3.363}], 'summary': 'Introduction to agent-environment interaction loop with observations and rewards.', 'duration': 39.465, 'max_score': 1698.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1698671.jpg'}, {'end': 1997.242, 'src': 'heatmap', 'start': 1830.866, 'weight': 1, 'content': [{'end': 1836.149, 'text': 'Note that the return is only about the future, right? So this is at some time set T.', 'start': 1830.866, 'duration': 5.283}, {'end': 1840.672, 'text': 'So this is useful to determine which action to take because your actions cannot influence the past.', 'start': 1836.149, 'duration': 4.523}, {'end': 1841.993, 'text': 'They can only influence the future.', 'start': 1840.752, 'duration': 1.241}, {'end': 1848.016, 'text': 'So, when we define the return, the return is defined as all of the future rewards summed together,', 'start': 1842.573, 'duration': 5.443}, {'end': 1850.958, 'text': "but the past rewards are in the past and we can't change them anymore.", 'start': 1848.016, 'duration': 2.942}, {'end': 1858.362, 'text': "Then we can't maybe always hope to optimize the return itself.", 'start': 1853.499, 'duration': 4.863}, {'end': 1863.145, 'text': "So instead we're going to define the expected return, which we'll call a value.", 'start': 1858.562, 'duration': 4.583}, {'end': 1868.489, 'text': 'So the value at time S would simply be the expectation of the return.', 'start': 1863.826, 'duration': 4.663}, {'end': 1875.353, 'text': "So that's the sum of the rewards going into the future, conditioned on the fact that you're in that state S.", 'start': 1868.509, 'duration': 6.844}, {'end': 1880.277, 'text': "I haven't defined what a state is yet, but for simplicity, you could now think of this as just being your observation,", 'start': 1875.353, 'duration': 4.924}, {'end': 1881.658, 'text': "but I'll talk more about that in a moment.", 'start': 1880.277, 'duration': 1.381}, {'end': 1887.618, 'text': 'So this value does depend on the actions the agent takes.', 'start': 1884.414, 'duration': 3.204}, {'end': 1890.881, 'text': 'And I will also make that a little bit more clear in the notation later on.', 'start': 1887.798, 'duration': 3.083}, {'end': 1896.387, 'text': "So it's good to know that the expectation depends on the dynamics of the world, but also the policy that the agent is following.", 'start': 1891.482, 'duration': 4.905}, {'end': 1898.95, 'text': 'And then the goal is to maximize the value.', 'start': 1896.948, 'duration': 2.002}, {'end': 1902.394, 'text': 'So we want to pick actions such that this value becomes large.', 'start': 1898.97, 'duration': 3.424}, {'end': 1911.455, 'text': "So one way to think about that is that rewards and values together define the utility of states and actions, and there's no supervised feedback.", 'start': 1904.332, 'duration': 7.123}, {'end': 1914.737, 'text': "So we're not saying this action is correct, that action is wrong.", 'start': 1911.475, 'duration': 3.262}, {'end': 1919.319, 'text': "Instead, we're saying this sequence of actions has this value, that sequence of actions has that value.", 'start': 1914.857, 'duration': 4.462}, {'end': 1922.201, 'text': 'And then maybe pick the one that has the highest value.', 'start': 1920.059, 'duration': 2.142}, {'end': 1930.224, 'text': 'Conveniently, and this is used in many algorithms, the returns and the values can be defined recursively.', 'start': 1923.821, 'duration': 6.403}, {'end': 1937.025, 'text': 'So the return at time step t can be thought of as simply the first reward plus the return from that time step t plus one.', 'start': 1930.665, 'duration': 6.36}, {'end': 1940.768, 'text': 'Similarly, the value can be defined recursively.', 'start': 1938.947, 'duration': 1.821}, {'end': 1952.379, 'text': 'So the value at some state s is the expected first reward you get after being in that state and then the value of the state you expect to be in after being in that state.', 'start': 1941.129, 'duration': 11.25}, {'end': 1962.627, 'text': 'So the goal is maximizing value by taking actions And actions might have long-term consequences.', 'start': 1956.543, 'duration': 6.084}, {'end': 1968.613, 'text': 'So this is captured in this value function because the value is defined as the expected return, where the return sums the rewards into the future.', 'start': 1962.667, 'duration': 5.946}, {'end': 1974.439, 'text': 'And one way to think about this is that actual rewards associated with certain actions can be delayed.', 'start': 1969.574, 'duration': 4.865}, {'end': 1980.929, 'text': 'What I mean with that is, you might pick an action that might have consequences later on that are important to keep in mind,', 'start': 1975.305, 'duration': 5.624}, {'end': 1984.812, 'text': 'but that do not show up immediately in the reward that you get immediately after taking that action.', 'start': 1980.929, 'duration': 3.883}, {'end': 1989.596, 'text': "This also means that sometimes it's better to sacrifice immediate reward to gain more long-term reward.", 'start': 1985.553, 'duration': 4.043}, {'end': 1991.277, 'text': "And I'll talk more about that in the next lecture.", 'start': 1989.616, 'duration': 1.661}, {'end': 1997.242, 'text': 'So some examples of this might be one that I mentioned before today.', 'start': 1993.619, 'duration': 3.623}], 'summary': 'Maximize value by taking actions with long-term consequences to influence the future rewards, as rewards and values define the utility of states and actions.', 'duration': 166.376, 'max_score': 1830.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1830866.jpg'}, {'end': 1962.627, 'src': 'embed', 'start': 1941.129, 'weight': 3, 'content': [{'end': 1952.379, 'text': 'So the value at some state s is the expected first reward you get after being in that state and then the value of the state you expect to be in after being in that state.', 'start': 1941.129, 'duration': 11.25}, {'end': 1962.627, 'text': 'So the goal is maximizing value by taking actions And actions might have long-term consequences.', 'start': 1956.543, 'duration': 6.084}], 'summary': 'Maximize value by taking actions with long-term consequences.', 'duration': 21.498, 'max_score': 1941.129, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1941129.jpg'}, {'end': 2010.953, 'src': 'embed', 'start': 1980.929, 'weight': 4, 'content': [{'end': 1984.812, 'text': 'but that do not show up immediately in the reward that you get immediately after taking that action.', 'start': 1980.929, 'duration': 3.883}, {'end': 1989.596, 'text': "This also means that sometimes it's better to sacrifice immediate reward to gain more long-term reward.", 'start': 1985.553, 'duration': 4.043}, {'end': 1991.277, 'text': "And I'll talk more about that in the next lecture.", 'start': 1989.616, 'duration': 1.661}, {'end': 1997.242, 'text': 'So some examples of this might be one that I mentioned before today.', 'start': 1993.619, 'duration': 3.623}, {'end': 2003.887, 'text': 'Refueling a helicopter might be an important action to take, even if it takes you slightly farther away from where you want to go.', 'start': 1998.162, 'duration': 5.725}, {'end': 2010.953, 'text': 'So this could be formalized in such a way that the rewards for that are low or even negative for the act of refueling,', 'start': 2004.451, 'duration': 6.502}], 'summary': 'Delayed rewards may be better for long-term gain. refueling may have low or negative immediate rewards.', 'duration': 30.024, 'max_score': 1980.929, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1980929.jpg'}, {'end': 2046.945, 'src': 'embed', 'start': 2018.136, 'weight': 5, 'content': [{'end': 2022.238, 'text': 'Or to pick the last example, learning a new skill might be something that is costly and time consuming.', 'start': 2018.136, 'duration': 4.102}, {'end': 2024.639, 'text': 'at first might not be hugely enjoyable,', 'start': 2022.238, 'duration': 2.401}, {'end': 2033.062, 'text': 'but maybe in the long term it will yield you more benefits and therefore you learn this new skill to maximize your value rather than the instantaneous reward.', 'start': 2024.639, 'duration': 8.423}, {'end': 2035.641, 'text': "For instance, maybe that's why you're following this course.", 'start': 2033.84, 'duration': 1.801}, {'end': 2043.444, 'text': 'Just in terms of terminology, we call a mapping from states to actions a policy.', 'start': 2039.322, 'duration': 4.122}, {'end': 2046.945, 'text': 'This is just shorthand in some sense for an action selection policy.', 'start': 2043.804, 'duration': 3.141}], 'summary': 'Learning a new skill may be costly and time-consuming but yields long-term benefits, maximizing value.', 'duration': 28.809, 'max_score': 2018.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2018136.jpg'}], 'start': 1659.187, 'title': 'Reinforcement learning basics and maximizing value', 'summary': 'Covers the fundamentals of reinforcement learning, including the role of dqn and the interaction loop, as well as the concept of maximizing cumulative rewards and the importance of value maximization for states and actions in achieving long-term goals.', 'chapters': [{'end': 1759.98, 'start': 1659.187, 'title': 'Understanding reinforcement learning basics', 'summary': 'Introduces the concept of reinforcement learning, explaining the role of the learning algorithm dqn and the interaction loop involving observations, rewards, and actions.', 'duration': 100.793, 'highlights': ["The chapter discusses the use of the DQN algorithm in reinforcement learning, offering a glimpse into the concept's future exploration in the course.", 'Reinforcement learning is formalized through an interaction loop, where at each time step T, an observation OT and a reward RT are received, leading to the execution of an action by the agent.', 'The interaction loop in reinforcement learning involves the agent receiving observations, executing actions, and the environment emitting new observations and rewards, forming a sequence of interactions.']}, {'end': 2035.641, 'start': 1761.4, 'title': 'Maximizing value in reinforcement learning', 'summary': 'Introduces the concept of maximizing cumulative rewards in reinforcement learning, defining the return as the sum of rewards into the future and emphasizing the importance of maximizing the value of states and actions to achieve long-term goals.', 'duration': 274.241, 'highlights': ['The return is defined as the sum of rewards into the future, with actions influencing future outcomes, and the goal is to maximize the value of states and actions. The return is the accumulation of rewards over time, emphasizing the influence of actions on future outcomes, and the goal is to maximize the value of states and actions.', 'Actions may have delayed consequences, requiring sacrifice of immediate reward to gain long-term benefits, such as refueling a helicopter or learning a new skill. Actions can result in delayed consequences, necessitating the sacrifice of immediate rewards for long-term gains, exemplified by refueling a helicopter or learning a new skill.', 'Learning a new skill may initially be costly and time-consuming but can yield long-term benefits, emphasizing the importance of maximizing value over instantaneous reward. The initial cost and time investment in learning a new skill can result in long-term benefits, highlighting the importance of prioritizing value over immediate rewards.']}], 'duration': 376.454, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc1659187.jpg', 'highlights': ["The chapter discusses the use of the DQN algorithm in reinforcement learning, offering a glimpse into the concept's future exploration in the course.", 'Reinforcement learning is formalized through an interaction loop, where at each time step T, an observation OT and a reward RT are received, leading to the execution of an action by the agent.', 'The interaction loop in reinforcement learning involves the agent receiving observations, executing actions, and the environment emitting new observations and rewards, forming a sequence of interactions.', 'The return is defined as the sum of rewards into the future, with actions influencing future outcomes, and the goal is to maximize the value of states and actions.', 'Actions may have delayed consequences, requiring sacrifice of immediate reward to gain long-term benefits, such as refueling a helicopter or learning a new skill.', 'Learning a new skill may initially be costly and time-consuming but can yield long-term benefits, emphasizing the importance of maximizing value over instantaneous reward.']}, {'end': 2470.939, 'segs': [{'end': 2081.754, 'src': 'embed', 'start': 2058.428, 'weight': 2, 'content': [{'end': 2065.594, 'text': 'We have the letter V to denote the value function of states, and we have the letter Q to denote the value function of states and actions.', 'start': 2058.428, 'duration': 7.166}, {'end': 2072.543, 'text': 'And this is simply defined as the expected return conditioned on being in that state and then taking that action A.', 'start': 2066.315, 'duration': 6.228}, {'end': 2078.949, 'text': "So, instead of considering some sort of a policy which immediately could pick a different action in state S, we're saying no, no,", 'start': 2072.543, 'duration': 6.406}, {'end': 2081.754, 'text': "we're in state S and we're considering taking this first action A.", 'start': 2078.949, 'duration': 2.805}], 'summary': 'V represents value function of states, q represents value function of states and actions, defined as expected return conditioned on state and action.', 'duration': 23.326, 'max_score': 2058.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2058428.jpg'}, {'end': 2122.173, 'src': 'heatmap', 'start': 2049.52, 'weight': 1, 'content': [{'end': 2054.304, 'text': "It's also possible to define values on not just states, but on actions.", 'start': 2049.52, 'duration': 4.784}, {'end': 2058.388, 'text': 'These are typically denoted with the letter Q for historical reasons.', 'start': 2054.525, 'duration': 3.863}, {'end': 2065.594, 'text': 'We have the letter V to denote the value function of states, and we have the letter Q to denote the value function of states and actions.', 'start': 2058.428, 'duration': 7.166}, {'end': 2072.543, 'text': 'And this is simply defined as the expected return conditioned on being in that state and then taking that action A.', 'start': 2066.315, 'duration': 6.228}, {'end': 2078.949, 'text': "So, instead of considering some sort of a policy which immediately could pick a different action in state S, we're saying no, no,", 'start': 2072.543, 'duration': 6.406}, {'end': 2081.754, 'text': "we're in state S and we're considering taking this first action A.", 'start': 2078.949, 'duration': 2.805}, {'end': 2086.579, 'text': 'Now this total expectation will then of course still depend on the future actions that you take.', 'start': 2082.677, 'duration': 3.902}, {'end': 2090.341, 'text': 'So this still depends on some policy that we have to define for the future actions.', 'start': 2086.978, 'duration': 3.363}, {'end': 2094.023, 'text': "But we're just pinning down the first action and conditioning the expectation on that.", 'start': 2090.781, 'duration': 3.242}, {'end': 2099.586, 'text': "We'll talk much more in depth about this in lectures three, four, five, and six.", 'start': 2094.583, 'duration': 5.003}, {'end': 2107.35, 'text': 'So now we can basically summarize the course concepts before we continue.', 'start': 2103.848, 'duration': 3.502}, {'end': 2113.509, 'text': 'So we said that the reinforcement learning formalism includes an environment, which basically defines the dynamics of the problem.', 'start': 2108.226, 'duration': 5.283}, {'end': 2116.67, 'text': 'It includes a reward signal, which specifies the goal.', 'start': 2114.189, 'duration': 2.481}, {'end': 2122.173, 'text': "And sometimes this is taken to be part of the environment, but it's good to basically list it separately.", 'start': 2117.971, 'duration': 4.202}], 'summary': 'Reinforcement learning involves defining values for states and actions, denoted by v and q, and conditioning the expected return on taking a specific action in a state. this will be further discussed in lectures three, four, five, and six.', 'duration': 72.653, 'max_score': 2049.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2049520.jpg'}, {'end': 2286.642, 'src': 'embed', 'start': 2251.513, 'weight': 3, 'content': [{'end': 2252.894, 'text': 'But there might be other things in the state as well.', 'start': 2251.513, 'duration': 1.381}, {'end': 2254.315, 'text': 'There might be some memory in the state.', 'start': 2252.914, 'duration': 1.401}, {'end': 2256.657, 'text': 'There might be learned components in the state.', 'start': 2254.335, 'duration': 2.322}, {'end': 2260.7, 'text': 'Everything that you take along with you from one time to the next, we could call the agent state.', 'start': 2257.077, 'duration': 3.623}, {'end': 2270.167, 'text': 'We can also talk about the environment states, which is the other side of that coin.', 'start': 2265.523, 'duration': 4.644}, {'end': 2274.972, 'text': 'In many cases, the environment will have some really complicated internal state.', 'start': 2271.348, 'duration': 3.624}, {'end': 2280.076, 'text': 'For instance in the example where the agent is a robot and the environment is a real world,', 'start': 2275.012, 'duration': 5.064}, {'end': 2286.642, 'text': 'then the state of the environment is basically just the state of all of the physical quantities of the world, all of the atoms,', 'start': 2280.076, 'duration': 6.566}], 'summary': 'The agent state may include memory and learned components, while the environment state can involve complicated internal state, such as the physical quantities of the world.', 'duration': 35.129, 'max_score': 2251.513, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2251513.jpg'}, {'end': 2485.356, 'src': 'embed', 'start': 2454.711, 'weight': 0, 'content': [{'end': 2460.354, 'text': 'And this has been used to formulate essentially the reinforcement training problem and also precursors to this.', 'start': 2454.711, 'duration': 5.643}, {'end': 2462.855, 'text': 'And importantly,', 'start': 2461.795, 'duration': 1.06}, {'end': 2470.939, 'text': 'a Markov decision process is essentially a very useful mathematical framework that allows us to reason about algorithms that can be used to solve these decision problems.', 'start': 2462.855, 'duration': 8.084}, {'end': 2485.356, 'text': "The Markov property itself states that a process is Markovian or a state is Markovian for this process if the probability of a reward and a subsequent state doesn't change if we add more history.", 'start': 2471.82, 'duration': 13.536}], 'summary': 'Markov decision process is a useful framework for reinforcement training and decision problems.', 'duration': 30.645, 'max_score': 2454.711, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2454711.jpg'}], 'start': 2039.322, 'title': 'Rl concepts', 'summary': 'Discusses policy and value functions, emphasizing q as the expected return and introduces reinforcement learning concepts including agent states, environment states, and the markov property in the context of a markov decision process.', 'chapters': [{'end': 2094.023, 'start': 2039.322, 'title': 'Value and policy functions in rl', 'summary': 'Discusses the terminology of policy, value functions for states and actions, and the definition of q as the expected return conditioned on a state and action. it emphasizes the conditioning of the expectation on the first action and its dependence on future actions.', 'duration': 54.701, 'highlights': ['The letter Q denotes the value function of states and actions, defined as the expected return conditioned on being in a state and taking an action A.', 'The chapter explains that instead of considering a policy that immediately picks a different action in state S, it focuses on conditioning the expectation on the first action and its dependence on future actions.', 'The letter V is used to denote the value function of states, and Q for the value function of states and actions, providing a clear distinction between the two.']}, {'end': 2470.939, 'start': 2094.583, 'title': 'Reinforcement learning concepts', 'summary': 'Introduces key concepts of reinforcement learning, including the formalism, components of an agent, agent states, environment states, full observability, and the markov property in the context of a markov decision process.', 'duration': 376.356, 'highlights': ['The reinforcement learning formalism includes an environment, a reward signal, and an agent, with the course focusing on the components of the agent, such as agent states, policy, and value function estimates. The reinforcement learning formalism consists of an environment, a reward signal, and an agent, with a focus on the components of the agent, including agent states, policy, and value function estimates.', 'The agent state encompasses everything the agent carries from one time step to the next, including memory and learned components, while the environment state represents the state of all physical quantities of the world, often invisible and too large to process. The agent state comprises everything carried from one time step to the next, including memory and learned components, while the environment state encompasses all physical quantities of the world, often invisible and too large to process.', 'The history of the agent includes observations, actions, and rewards, serving as the basis for constructing the agent state, and in the fully observable case, the agent state can be solely based on the observation of the full environment state. The history of the agent includes observations, actions, and rewards, serving as the basis for constructing the agent state, and in the fully observable case, the agent state can be solely based on the observation of the full environment state.', 'The concept of the Markov property and Markov decision processes form a crucial mathematical framework for reasoning about algorithms used in solving decision problems within reinforcement training. The Markov property and Markov decision processes constitute a crucial mathematical framework for reasoning about algorithms used in solving decision problems within reinforcement training.']}], 'duration': 431.617, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2039322.jpg', 'highlights': ['The concept of the Markov property and Markov decision processes form a crucial mathematical framework for reasoning about algorithms used in solving decision problems within reinforcement training.', 'The reinforcement learning formalism includes an environment, a reward signal, and an agent, with the course focusing on the components of the agent, such as agent states, policy, and value function estimates.', 'The letter Q denotes the value function of states and actions, defined as the expected return conditioned on being in a state and taking an action A.', 'The agent state encompasses everything the agent carries from one time step to the next, including memory and learned components, while the environment state represents the state of all physical quantities of the world, often invisible and too large to process.']}, {'end': 3176.72, 'segs': [{'end': 2496.383, 'src': 'embed', 'start': 2471.82, 'weight': 0, 'content': [{'end': 2485.356, 'text': "The Markov property itself states that a process is Markovian or a state is Markovian for this process if the probability of a reward and a subsequent state doesn't change if we add more history.", 'start': 2471.82, 'duration': 13.536}, {'end': 2487.677, 'text': "That's what the equation on the slide means.", 'start': 2485.976, 'duration': 1.701}, {'end': 2491, 'text': 'So we can see the probability of a reward and a state.', 'start': 2488.458, 'duration': 2.542}, {'end': 2496.383, 'text': 'You should interpret this as the probability of those occurring on time set t plus one.', 'start': 2492.741, 'duration': 3.642}], 'summary': 'Markov property: reward and state probability unchanged with more history.', 'duration': 24.563, 'max_score': 2471.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2471820.jpg'}, {'end': 2561.21, 'src': 'embed', 'start': 2519.202, 'weight': 1, 'content': [{'end': 2521.164, 'text': "It just means that adding more history doesn't help.", 'start': 2519.202, 'duration': 1.962}, {'end': 2529.491, 'text': 'For instance, if your observations are particularly uninformative, then adding more uninformative observations might not help.', 'start': 2521.644, 'duration': 7.847}, {'end': 2534.095, 'text': "So that might lead to a Markovian state, but that doesn't mean that you can observe the full environment state.", 'start': 2529.551, 'duration': 4.544}, {'end': 2538.656, 'text': "However, if you can observe the full environment state, then you're also Markovian.", 'start': 2534.974, 'duration': 3.682}, {'end': 2544.38, 'text': 'So once the state is known, the history might be thrown away if you have this Markov property.', 'start': 2540.297, 'duration': 4.083}, {'end': 2550.243, 'text': 'And of course, this sounds very useful because the state itself might be a lot smaller than the full history.', 'start': 2544.46, 'duration': 5.783}, {'end': 2559.149, 'text': 'So, as an example, the full agent and environment state is Markov, but it might be really really large because, as I mentioned,', 'start': 2551.364, 'duration': 7.785}, {'end': 2561.21, 'text': 'the environment state might be humongous.', 'start': 2559.149, 'duration': 2.061}], 'summary': 'Markov property means the state is known, history thrown away. environment state might be humongous.', 'duration': 42.008, 'max_score': 2519.202, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2519202.jpg'}, {'end': 2670.86, 'src': 'embed', 'start': 2640.422, 'weight': 4, 'content': [{'end': 2643.743, 'text': 'So in this case, the observations are not assumed to be Markovian.', 'start': 2640.422, 'duration': 3.321}, {'end': 2646.124, 'text': "And I'll give you an example or a couple of examples.", 'start': 2643.983, 'duration': 2.141}, {'end': 2647.065, 'text': 'So, for instance,', 'start': 2646.524, 'duration': 0.541}, {'end': 2657.529, 'text': 'a robot with a camera which is not told this absolute location would not have Markovian observations because at some point it might be staring at a wall and it might not be able to tell where it is.', 'start': 2647.065, 'duration': 10.464}, {'end': 2660.211, 'text': "It might not be able to tell what's behind it or behind the wall.", 'start': 2657.549, 'duration': 2.662}, {'end': 2665.035, 'text': 'It can maybe just see the wall, and then this observation will not be Markovian,', 'start': 2660.892, 'duration': 4.143}, {'end': 2670.86, 'text': "because the probability of something happening might depend on things that it has seen before, but it doesn't see right now.", 'start': 2665.035, 'duration': 5.825}], 'summary': 'Non-markovian observations in robotics illustrated by a scenario with a robot and a camera.', 'duration': 30.438, 'max_score': 2640.422, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2640422.jpg'}, {'end': 3150.487, 'src': 'embed', 'start': 3122.86, 'weight': 2, 'content': [{'end': 3128.821, 'text': "what I'm doing there is basically trying to construct a suitable state representation to deal with the partial observability in the maze.", 'start': 3122.86, 'duration': 5.961}, {'end': 3133.582, 'text': 'And as examples, I mentioned using just the observation might not be enough.', 'start': 3130.341, 'duration': 3.241}, {'end': 3135.622, 'text': 'Using the full history might be too large.', 'start': 3134.142, 'duration': 1.48}, {'end': 3138.102, 'text': 'But generically you can think of some update function.', 'start': 3136.121, 'duration': 1.981}, {'end': 3141.963, 'text': "And then the question is, how do we pick that update function? And that's actually what we were doing just now.", 'start': 3138.482, 'duration': 3.481}, {'end': 3150.487, 'text': 'Like we were trying to hand pick a function U that updates the state in such a way to take into account the stream of observations.', 'start': 3142.103, 'duration': 8.384}], 'summary': 'Constructing state representation for partial observability in maze', 'duration': 27.627, 'max_score': 3122.86, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3122860.jpg'}], 'start': 2471.82, 'title': 'Markov property in processes and partial observability', 'summary': 'Discusses the markov property in processes, emphasizing the condition for a process to be markovian and the significance of suitable state representations, providing practical examples to illustrate these concepts. it also addresses the impact of uninformative observations on achieving a markovian state.', 'chapters': [{'end': 2592.399, 'start': 2471.82, 'title': 'Markov property in processes', 'summary': "Discusses the markov property, stating that a process is markovian if the probability of a reward and a subsequent state doesn't change with more history, leading to a smaller state representation, but adding more uninformative observations might not help achieve a markovian state.", 'duration': 120.579, 'highlights': ["The Markov property states that a process is Markovian if the probability of a reward and a subsequent state doesn't change with more history, leading to a smaller state representation.", 'If observations are uninformative, adding more uninformative observations might not help achieve a Markovian state.', 'The state might be a lot smaller than the full history, and if the full environment state is observable, then it is also Markovian.']}, {'end': 3176.72, 'start': 2593.717, 'title': 'Markov property and partial observability', 'summary': 'Discusses the markov property, partial observability, and constructing a markovian agent state, highlighting examples of observations not being markovian and the importance of selecting suitable state representations.', 'duration': 583.003, 'highlights': ['The chapter discusses the Markov property, partial observability, and constructing a Markovian agent state The chapter delves into the concepts of the Markov property, partial observability, and the construction of a Markovian agent state to address partial observability.', 'Examples of observations not being Markovian Examples are given, such as a robot with a camera and a poker playing agent, to illustrate non-Markovian observations, indicating that the probability of events can depend on past observations not currently visible.', 'Importance of selecting suitable state representations The importance of selecting suitable state representations is emphasized, with mention of using the full history being too large, the need for a state update function to compress information, and the challenge of constructing a Markovian agent state in the context of partial observability.']}], 'duration': 704.9, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc2471820.jpg', 'highlights': ["The Markov property states that a process is Markovian if the probability of a reward and a subsequent state doesn't change with more history, leading to a smaller state representation.", 'The state might be a lot smaller than the full history, and if the full environment state is observable, then it is also Markovian.', 'Importance of selecting suitable state representations The importance of selecting suitable state representations is emphasized, with mention of using the full history being too large, the need for a state update function to compress information, and the challenge of constructing a Markovian agent state in the context of partial observability.', 'If observations are uninformative, adding more uninformative observations might not help achieve a Markovian state.', 'Examples of observations not being Markovian Examples are given, such as a robot with a camera and a poker playing agent, to illustrate non-Markovian observations, indicating that the probability of events can depend on past observations not currently visible.']}, {'end': 3655.28, 'segs': [{'end': 3212.201, 'src': 'embed', 'start': 3178.925, 'weight': 0, 'content': [{'end': 3183.588, 'text': "But it's good to note that constructing a full Markovian agent state might not be feasible.", 'start': 3178.925, 'duration': 4.663}, {'end': 3189.072, 'text': 'Like your observation might be really complicated and it might be really hard to construct a full Markovian agent state.', 'start': 3183.748, 'duration': 5.324}, {'end': 3194.855, 'text': "And so instead of trying to always shoot for complete Markovianness, maybe that's not necessary.", 'start': 3189.712, 'duration': 5.143}, {'end': 3198.578, 'text': "Maybe it's more important that we allow good policies and good value predictions.", 'start': 3195.396, 'duration': 3.182}, {'end': 3199.739, 'text': "And sometimes that's easier.", 'start': 3198.638, 'duration': 1.101}, {'end': 3205.879, 'text': 'Sometimes going for optimal is really, really hard, but going for very good is substantially easier.', 'start': 3200.518, 'duration': 5.361}, {'end': 3210.84, 'text': "And that's something more generally that we'll keep in mind when we want to deal with messy, big real world problems,", 'start': 3205.899, 'duration': 4.941}, {'end': 3212.201, 'text': 'where optimality might be out of reach.', 'start': 3210.84, 'duration': 1.361}], 'summary': 'Constructing a full markovian agent state may not be feasible, focus on good policies and value predictions instead.', 'duration': 33.276, 'max_score': 3178.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3178925.jpg'}, {'end': 3285.224, 'src': 'embed', 'start': 3255.638, 'weight': 3, 'content': [{'end': 3259.521, 'text': 'pi means the probability of an action given a state.', 'start': 3255.638, 'duration': 3.883}, {'end': 3263.624, 'text': 'Pi is just conventional notation for policies.', 'start': 3261.002, 'duration': 2.622}, {'end': 3268.787, 'text': 'We often use pi to denote a policy and stochastic policy is in some sense a more general case.', 'start': 3263.684, 'duration': 5.103}, {'end': 3271.469, 'text': 'So typically we consider this a probability distribution over actions.', 'start': 3268.807, 'duration': 2.662}, {'end': 3274.659, 'text': "And that's basically it in terms of policies.", 'start': 3273.038, 'duration': 1.621}, {'end': 3279.601, 'text': "Of course, we're going to say a lot more about how to optimize these policies, how to represent them, how to optimize them, and so on.", 'start': 3274.699, 'duration': 4.902}, {'end': 3285.224, 'text': 'But in terms of definitions, all that you need to remember is that pi denotes the probability of an action given a state.', 'start': 3280.001, 'duration': 5.223}], 'summary': 'Pi represents the probability of an action given a state in policies.', 'duration': 29.586, 'max_score': 3255.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3255638.jpg'}, {'end': 3346.91, 'src': 'heatmap', 'start': 3287.005, 'weight': 0.744, 'content': [{'end': 3289.766, 'text': 'And then we can move on to value functions and value estimates.', 'start': 3287.005, 'duration': 2.761}, {'end': 3298.139, 'text': 'And what I have here on the slide is a version of the value function as I defined it earlier.', 'start': 3292.515, 'duration': 5.624}, {'end': 3300.881, 'text': 'And I want to mention a couple of things about this.', 'start': 3298.5, 'duration': 2.381}, {'end': 3304.524, 'text': "First of all, it's good to appreciate that this is the definition of the value.", 'start': 3301.121, 'duration': 3.403}, {'end': 3307.126, 'text': "Later, we'll talk about how to approximate that.", 'start': 3305.164, 'duration': 1.962}, {'end': 3308.727, 'text': 'This is just defining it.', 'start': 3307.566, 'duration': 1.161}, {'end': 3313.07, 'text': "And I've extended it in two different ways from the previous definition that I had.", 'start': 3309.488, 'duration': 3.582}, {'end': 3317.393, 'text': 'First, I made it very explicit now that the value function depends on the policy.', 'start': 3313.631, 'duration': 3.762}, {'end': 3319.875, 'text': 'And the way to reason about this.', 'start': 3318.254, 'duration': 1.621}, {'end': 3329.561, 'text': 'if I have this conditioning on pi means that I could write this long form to say that every action at subsequent time steps is selected according to this policy,', 'start': 3319.875, 'duration': 9.686}, {'end': 3329.781, 'text': 'pi.', 'start': 3329.561, 'duration': 0.22}, {'end': 3333.283, 'text': "So know that we're not conditioning on a sequence of actions.", 'start': 3330.842, 'duration': 2.441}, {'end': 3340.708, 'text': "No, we're conditioning on a function that is allowed to look at the states that we encounter and then pick an action, which is slightly different.", 'start': 3333.924, 'duration': 6.784}, {'end': 3346.91, 'text': "The other thing that we've done now on this slide is introduce a discount factor.", 'start': 3343.069, 'duration': 3.841}], 'summary': 'Introduction to value functions and estimates in reinforcement learning.', 'duration': 59.905, 'max_score': 3287.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3287005.jpg'}, {'end': 3424.871, 'src': 'embed', 'start': 3390.436, 'weight': 4, 'content': [{'end': 3394.117, 'text': 'then any policy that eventually reaches the goals gets a value of one.', 'start': 3390.436, 'duration': 3.681}, {'end': 3396.938, 'text': "So then we can't distinguish between getting there quickly.", 'start': 3394.617, 'duration': 2.321}, {'end': 3405.524, 'text': "So sometimes discount factors are used to define goals in the sense of oh, maybe it's better to look at the near-term rewards a little bit more,", 'start': 3397.796, 'duration': 7.728}, {'end': 3407.185, 'text': "unless it's a long-term reward.", 'start': 3405.524, 'duration': 1.661}, {'end': 3411.169, 'text': 'So this allows us to trade off the importance of immediate versus long-term rewards.', 'start': 3407.265, 'duration': 3.904}, {'end': 3413.972, 'text': 'So to look at the extremes, to make it a bit more concrete.', 'start': 3411.87, 'duration': 2.102}, {'end': 3415.733, 'text': 'you can consider a discount factor of zero.', 'start': 3413.972, 'duration': 1.761}, {'end': 3424.871, 'text': "If you plug that into the definition of the value as it's written on the slide there you see that then the value function just becomes the immediate reward.", 'start': 3416.764, 'duration': 8.107}], 'summary': 'Policy evaluation assigns value of 1 to achieving goals, discount factors balance immediate vs long-term rewards.', 'duration': 34.435, 'max_score': 3390.436, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3390436.jpg'}, {'end': 3508.708, 'src': 'embed', 'start': 3482.184, 'weight': 5, 'content': [{'end': 3486.827, 'text': 'And we can now do that because the value function can now be used to evaluate the desirability of states.', 'start': 3482.184, 'duration': 4.643}, {'end': 3490.329, 'text': 'And also we can compare different policies on the same state.', 'start': 3487.407, 'duration': 2.922}, {'end': 3495.751, 'text': 'We can say one value might have a different, sorry, one policy might have a higher value than a different policy.', 'start': 3490.509, 'duration': 5.242}, {'end': 3498.693, 'text': 'And then we can maybe talk about the desirability of different policies.', 'start': 3495.791, 'duration': 2.902}, {'end': 3503.926, 'text': 'And ultimately, we can also then use this to select between actions.', 'start': 3500.805, 'duration': 3.121}, {'end': 3508.708, 'text': "So we could do, so note here we've defined a value function as a function of a policy.", 'start': 3504.026, 'duration': 4.682}], 'summary': 'The value function can evaluate states and policies, aiding in policy comparison and action selection.', 'duration': 26.524, 'max_score': 3482.184, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3482184.jpg'}], 'start': 3178.925, 'title': 'Importance of good policies and value predictions', 'summary': 'Emphasizes prioritizing good policies and value predictions in constructing a full markovian agent state, rather than achieving complete markovianness, particularly for complex real-world problems. it also covers the definitions of policy, value function, and model in reinforcement learning, introducing the concept of policy as a mapping from agent state to actions and the recursive forms of value function and returns.', 'chapters': [{'end': 3212.201, 'start': 3178.925, 'title': 'Importance of good policies and value predictions', 'summary': 'Discusses the challenges of constructing a full markovian agent state and emphasizes the importance of prioritizing good policies and value predictions over achieving complete markovianness, particularly when dealing with complex real-world problems.', 'duration': 33.276, 'highlights': ["It might be really hard to construct a full Markovian agent state, and instead of always aiming for complete Markovianness, it's more important to prioritize good policies and value predictions.", 'Going for very good policies and value predictions is substantially easier than aiming for optimal, especially in messy, big real world problems where optimality might be unattainable.']}, {'end': 3655.28, 'start': 3215.141, 'title': 'Reinforcement learning: policy, value function and model', 'summary': 'Covers the definitions of policy, value function, and model in reinforcement learning, highlighting the concept of policy as a mapping from agent state to actions, the introduction of discount factor in value function, and the recursive forms of value function and returns.', 'duration': 440.139, 'highlights': ['The policy is defined as a mapping from agent state to actions, with stochastic policies denoted by pi as the probability of an action given a state.', 'The introduction of a discount factor in the value function allows for trade-offs between immediate and long-term rewards, with extreme cases represented by discount factors of 0 and 1.', 'The value function is recursive and can be used to evaluate the desirability of states, compare different policies, and select between actions, ultimately leading to the improvement of policies based on value estimates.']}], 'duration': 476.355, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3178925.jpg', 'highlights': ['Prioritize good policies and value predictions over complete Markovianness for complex real-world problems.', 'Constructing a full Markovian agent state can be challenging, prioritize good policies and value predictions.', 'In messy, big real-world problems, aiming for very good policies and value predictions is substantially easier than aiming for optimal.', 'Policy is defined as a mapping from agent state to actions, with stochastic policies denoted by pi as the probability of an action given a state.', 'Introduction of a discount factor in the value function allows for trade-offs between immediate and long-term rewards.', 'The value function is recursive and can be used to evaluate the desirability of states, compare different policies, and select between actions.']}, {'end': 4631.834, 'segs': [{'end': 3700.974, 'src': 'embed', 'start': 3675.08, 'weight': 0, 'content': [{'end': 3679.342, 'text': 'And the goal of this would be that if you have an accurate value function, then we can behave optimally.', 'start': 3675.08, 'duration': 4.262}, {'end': 3683.725, 'text': 'I mean, if we have a fully accurate value function, because then you can just look at the value function.', 'start': 3679.683, 'duration': 4.042}, {'end': 3689.148, 'text': 'We could define a similar equation that we had on the previous slide for state action values rather than just for state values.', 'start': 3683.765, 'duration': 5.383}, {'end': 3694.871, 'text': 'And then the optimal policy could just be picking the optimal action according to those values.', 'start': 3689.768, 'duration': 5.103}, {'end': 3698.953, 'text': 'So if we have a fully accurate value function, we can use that to construct an optimal policy.', 'start': 3695.311, 'duration': 3.642}, {'end': 3700.974, 'text': 'This is why these value functions are important.', 'start': 3699.353, 'duration': 1.621}], 'summary': 'Accurate value function leads to optimal behavior and policy.', 'duration': 25.894, 'max_score': 3675.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3675080.jpg'}, {'end': 3737.997, 'src': 'embed', 'start': 3709.484, 'weight': 1, 'content': [{'end': 3711.185, 'text': 'even in interactively large domains.', 'start': 3709.484, 'duration': 1.701}, {'end': 3721.051, 'text': "And this is kind of the promise for these approximations, that we don't need to find the precise optimal value in many cases.", 'start': 3712.765, 'duration': 8.286}, {'end': 3726.694, 'text': 'It might be good enough to get close, and then the resulting policies might also perform very well.', 'start': 3721.271, 'duration': 5.423}, {'end': 3733.532, 'text': 'Okay, so the final component inside the agent would be a potential model.', 'start': 3730.089, 'duration': 3.443}, {'end': 3737.997, 'text': 'This is an optional component, similar to how the value functions are optional, although they are very common.', 'start': 3733.973, 'duration': 4.024}], 'summary': 'Approximations can be sufficient, resulting in effective policies. optional potential model, similar to value functions.', 'duration': 28.513, 'max_score': 3709.484, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3709484.jpg'}, {'end': 4012.425, 'src': 'embed', 'start': 3985.782, 'weight': 3, 'content': [{'end': 3990.286, 'text': "So if you're one step away from the goal, the value will just be minus one for that optimal policy.", 'start': 3985.782, 'duration': 4.504}, {'end': 3992.228, 'text': "If you're two steps away, it will be minus two and so on.", 'start': 3990.346, 'duration': 1.882}, {'end': 4001.415, 'text': 'This is a model and specifically this is an inaccurate model because note that all of a sudden a part of the maze went missing.', 'start': 3995.35, 'duration': 6.065}, {'end': 4006.44, 'text': 'So in this case, the numbers inside the squares are the rewards.', 'start': 4002.356, 'duration': 4.084}, {'end': 4010.904, 'text': "So these are modeled as just, oh, we've learned the reward is basically minus one everywhere.", 'start': 4006.88, 'duration': 4.024}, {'end': 4012.425, 'text': 'Maybe this is very quick and easy to learn.', 'start': 4010.964, 'duration': 1.461}], 'summary': 'Inaccurate model with rewards as minus one, quick to learn', 'duration': 26.643, 'max_score': 3985.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3985782.jpg'}, {'end': 4083.488, 'src': 'embed', 'start': 4041.839, 'weight': 4, 'content': [{'end': 4045.082, 'text': "But it's just an example to say, oh, yeah, your model doesn't have to be perfect.", 'start': 4041.839, 'duration': 3.243}, {'end': 4046.483, 'text': 'If you learn it, it could be imperfect.', 'start': 4045.122, 'duration': 1.361}, {'end': 4048.885, 'text': 'The same, of course, holds for the policy and value function.', 'start': 4046.543, 'duration': 2.342}, {'end': 4049.966, 'text': 'These could also be imperfect.', 'start': 4048.905, 'duration': 1.061}, {'end': 4059.834, 'text': "Okay Now, finally, before we reach the end of this lecture, I'm going to talk about some different agent categories.", 'start': 4053.068, 'duration': 6.766}, {'end': 4064.81, 'text': 'And in particular, this is basically categorization.', 'start': 4061.887, 'duration': 2.923}, {'end': 4069.875, 'text': "It's good to have this terminology in mind, which refers to which part of the agent are used or not used.", 'start': 4064.85, 'duration': 5.025}, {'end': 4075.32, 'text': 'And a value-based agent is a very common version of an agent.', 'start': 4070.716, 'duration': 4.604}, {'end': 4080.725, 'text': "And in this agent, we'll learn a value function, but there's not explicitly a policy separately.", 'start': 4075.941, 'duration': 4.784}, {'end': 4083.488, 'text': 'Instead, the policy is based on the value function.', 'start': 4081.105, 'duration': 2.383}], 'summary': 'Lecture covers imperfect models, policies, value functions & agent categorization.', 'duration': 41.649, 'max_score': 4041.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4041839.jpg'}, {'end': 4307.047, 'src': 'embed', 'start': 4278.151, 'weight': 6, 'content': [{'end': 4281.693, 'text': "In addition, there's this interesting question that I encourage you to ponder a little bit,", 'start': 4278.151, 'duration': 3.542}, {'end': 4291.991, 'text': 'which is that this is something that Rich Sutton often says that, in one way or the other, prediction is maybe very good form of knowledge and,', 'start': 4281.693, 'duration': 10.298}, {'end': 4295.555, 'text': 'in particular, if we could predict everything.', 'start': 4291.991, 'duration': 3.564}, {'end': 4301.962, 'text': "it's unclear that we need additional types of knowledge, and i want you to ponder that and think about whether you agree with this or not.", 'start': 4295.555, 'duration': 6.407}, {'end': 4307.047, 'text': 'so if you could predict everything, is there anything else that we need?', 'start': 4301.962, 'duration': 5.085}], 'summary': 'Prediction as a form of knowledge, pondering the necessity of additional knowledge if everything could be predicted.', 'duration': 28.896, 'max_score': 4278.151, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4278151.jpg'}, {'end': 4470.189, 'src': 'embed', 'start': 4443.697, 'weight': 7, 'content': [{'end': 4447.599, 'text': 'But we can think of planning more generally as some sort of an internal computation process.', 'start': 4443.697, 'duration': 3.902}, {'end': 4453.701, 'text': 'So then learning refers to absorbing new experiences from this interaction loop.', 'start': 4448.099, 'duration': 5.602}, {'end': 4457.703, 'text': "And planning is something that sits internally inside the agent's head.", 'start': 4454.262, 'duration': 3.441}, {'end': 4459.764, 'text': "It's a purely computational process.", 'start': 4458.043, 'duration': 1.721}, {'end': 4470.189, 'text': 'And indeed I personally like to define planning as any computational process that helps you improve your policies or predictions or other things inside the agent without looking at new experience.', 'start': 4460.344, 'duration': 9.845}], 'summary': 'Planning is an internal computational process that improves policies or predictions without new experience.', 'duration': 26.492, 'max_score': 4443.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4443697.jpg'}, {'end': 4501.441, 'src': 'embed', 'start': 4475.744, 'weight': 2, 'content': [{'end': 4483.41, 'text': "And planning is the part that does the additional compute that maybe turns a model that you've learned into a new policy.", 'start': 4475.744, 'duration': 7.666}, {'end': 4491.848, 'text': "It's important also to know that all of these components that we've talked about so far can be represented as functions.", 'start': 4487.321, 'duration': 4.527}, {'end': 4496.574, 'text': 'We could have policies that map states to actions or to probabilities over actions,', 'start': 4492.148, 'duration': 4.426}, {'end': 4501.441, 'text': 'value functions that map states to expected rewards or indeed also to probabilities of these.', 'start': 4496.574, 'duration': 4.867}], 'summary': 'Planning involves additional compute for model to new policy. components represented as functions.', 'duration': 25.697, 'max_score': 4475.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4475744.jpg'}, {'end': 4556.124, 'src': 'embed', 'start': 4528.979, 'weight': 8, 'content': [{'end': 4532.94, 'text': 'And the field of researching how to train neural networks is called deep learning.', 'start': 4528.979, 'duration': 3.961}, {'end': 4538.441, 'text': 'And indeed, in reinforcement learning, we can use these deep learning techniques to learn each of these functions.', 'start': 4534, 'duration': 4.441}, {'end': 4540.502, 'text': 'And this has been done with great success.', 'start': 4538.922, 'duration': 1.58}, {'end': 4548.059, 'text': 'It is good to take a little bit of care when we do so because we do often violate assumptions from say supervised learning.', 'start': 4542.096, 'duration': 5.963}, {'end': 4553.742, 'text': 'For instance, the data coming at us might be correlated because for instance, think of a robot operating in a room.', 'start': 4548.179, 'duration': 5.563}, {'end': 4556.124, 'text': 'It might spend some substantial time in that room.', 'start': 4554.163, 'duration': 1.961}], 'summary': 'Deep learning techniques are used in reinforcement learning with great success, but require caution due to potential violations of assumptions from supervised learning.', 'duration': 27.145, 'max_score': 4528.979, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4528979.jpg'}], 'start': 3659.031, 'title': 'Reinforcement learning fundamentals', 'summary': 'Covers approximating value functions and models, agent components and categorization, and the role of prediction and planning in reinforcement learning. it emphasizes the importance of accurate value functions, agent categorization, and the potential of prediction and planning processes in reinforcement learning.', 'chapters': [{'end': 3874.546, 'start': 3659.031, 'title': 'Approximating value functions and models in reinforcement learning', 'summary': 'Discusses the importance of approximating value functions and models in reinforcement learning, highlighting the significance of accurate value functions for optimal behavior, the promise of suitable approximations in large domains, and the role of models in predicting future states and rewards for extracting good policies.', 'duration': 215.515, 'highlights': ['Accurate value functions are crucial for optimal behavior, allowing the construction of an optimal policy based on the state and action values. The chapter emphasizes the importance of accurate value functions in enabling optimal behavior through the construction of an optimal policy based on the state and action values.', 'Suitable approximations of value functions can still lead to effective behavior in large domains, showcasing the promise of close approximations and their impact on resulting policies. The chapter underscores the potential of suitable approximations in enabling effective behavior in large domains, highlighting that close approximations can lead to resulting policies performing very well.', 'Models play a key role in predicting future states and rewards, requiring additional computation for extracting a good policy and offering various design choices. The discussion emphasizes the role of models in predicting future states and rewards, highlighting the need for additional computation to extract a good policy and the existence of various design choices for models.']}, {'end': 4276.911, 'start': 3877.849, 'title': 'Agent components and categorization', 'summary': 'Discusses agent components, including an example of a maze with a reward function of minus one per time step, optimal policy, value function, and inaccurate model. it also covers categorization of agent types such as value-based, policy-based, and actor critic, and sub problems of reinforcement learning - prediction and control.', 'duration': 399.062, 'highlights': ['Optimal Policy and Value Function The optimal policy in the maze aims to minimize the number of minus ones per time step, while the value function increments as the agent moves away from the goal.', 'Inaccurate Model and Imperfect Components The example illustrates the possibility of an inaccurate model due to incomplete learning of the environment, emphasizing the imperfection tolerance of the model, policy, and value function in reinforcement learning.', 'Agent Categorization The lecture introduces categorization of agents including value-based, policy-based, and actor critic, explaining their differences in learning value functions and policies as well as the notion of model-free and model-based agents.', 'Prediction and Control The sub problems of reinforcement learning are presented as prediction and control, where prediction involves evaluating the future through learning value functions, while control focuses on optimizing the future by finding the best policies.']}, {'end': 4631.834, 'start': 4278.151, 'title': 'The role of prediction and planning in reinforcement learning', 'summary': 'Explores the importance of prediction and planning in reinforcement learning, emphasizing the potential of prediction as a form of knowledge and distinguishing between learning and planning processes, while also highlighting the use of deep learning in reinforcement learning.', 'duration': 353.683, 'highlights': ['The potential of prediction as a form of knowledge is emphasized, suggesting that if one could predict everything about the world, additional types of knowledge may not be necessary. Importance of prediction as a form of knowledge, implications of being able to predict everything about the world.', 'The distinction between learning and planning in reinforcement learning is discussed, with learning involving the absorption of new experiences and planning involving internal computational processes to improve policies or predictions. Differentiation between learning and planning processes, the role of learning in absorbing new experiences and planning in internal computational processes.', 'The use of deep learning techniques in reinforcement learning is noted, highlighting the success of utilizing neural networks for learning functions, while also cautioning about potential violations of assumptions from supervised learning. Utilization of deep learning techniques in reinforcement learning, success of neural networks in learning functions, caution about potential violations of assumptions from supervised learning.']}], 'duration': 972.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc3659031.jpg', 'highlights': ['Accurate value functions are crucial for optimal behavior, enabling the construction of an optimal policy based on state and action values.', 'Suitable approximations of value functions can lead to effective behavior in large domains, showcasing the promise of close approximations and their impact on resulting policies.', 'Models play a key role in predicting future states and rewards, requiring additional computation for extracting a good policy and offering various design choices.', 'The optimal policy in the maze aims to minimize the number of minus ones per time step, while the value function increments as the agent moves away from the goal.', 'The example illustrates the possibility of an inaccurate model due to incomplete learning of the environment, emphasizing the imperfection tolerance of the model, policy, and value function in reinforcement learning.', 'The lecture introduces categorization of agents including value-based, policy-based, and actor critic, explaining their differences in learning value functions and policies as well as the notion of model-free and model-based agents.', 'The potential of prediction as a form of knowledge is emphasized, suggesting that if one could predict everything about the world, additional types of knowledge may not be necessary.', 'The distinction between learning and planning in reinforcement learning is discussed, with learning involving the absorption of new experiences and planning involving internal computational processes to improve policies or predictions.', 'The use of deep learning techniques in reinforcement learning is noted, highlighting the success of utilizing neural networks for learning functions, while also cautioning about potential violations of assumptions from supervised learning.']}, {'end': 5389.862, 'segs': [{'end': 4683.105, 'src': 'embed', 'start': 4653.481, 'weight': 0, 'content': [{'end': 4657.483, 'text': "Let's make it a little bit more specific now, what was happening in the Atari game that I showed you.", 'start': 4653.481, 'duration': 4.002}, {'end': 4662.145, 'text': 'So you can think of the observations as the pixels, as I mentioned at that point in time as well.', 'start': 4658.143, 'duration': 4.002}, {'end': 4666.707, 'text': 'The output is the action, which is the joystick controls, and the input is the reward.', 'start': 4662.785, 'duration': 3.922}, {'end': 4671.83, 'text': 'Here on the slide, it actually shows the score, but the actual reward was the difference in score on every time step.', 'start': 4666.787, 'duration': 5.043}, {'end': 4677.243, 'text': 'Note that the rules of the game are unknown and you learn directly from interactive gameplay.', 'start': 4672.682, 'duration': 4.561}, {'end': 4683.105, 'text': 'So you pick actions on the joystick, you see pixels and scores, and this is a well-defined reforce and printing problem.', 'start': 4677.543, 'duration': 5.562}], 'summary': 'Atari game: observations as pixels, joystick controls, rewards, learning from interactive gameplay.', 'duration': 29.624, 'max_score': 4653.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4653481.jpg'}, {'end': 4736.951, 'src': 'embed', 'start': 4713.769, 'weight': 1, 'content': [{'end': 4720.73, 'text': 'And we do this because we can often learn something from these smaller problems that we can then apply to these much harder to understand difficult,', 'start': 4713.769, 'duration': 6.961}, {'end': 4721.33, 'text': 'big problems.', 'start': 4720.73, 'duration': 0.6}, {'end': 4728.446, 'text': "So, in this specific example, which is from the Susan and Bartow book, it's basically a grid world without any walls,", 'start': 4722.222, 'duration': 6.224}, {'end': 4736.951, 'text': 'although there might be walls at the edges, essentially, but not any walls inside the 5x5 grid.', 'start': 4728.446, 'duration': 8.505}], 'summary': 'Learn from smaller problems to apply to bigger ones, e.g. 5x5 grid world', 'duration': 23.182, 'max_score': 4713.769, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4713769.jpg'}, {'end': 5019.994, 'src': 'embed', 'start': 4994.593, 'weight': 2, 'content': [{'end': 5003.16, 'text': "going around state B and then going all the way to A would take so long that it's actually more beneficial to jump into state B,", 'start': 4994.593, 'duration': 8.567}, {'end': 5007.404, 'text': 'which will transition you to B prime and then from there on go to state A and then loop indefinitely.', 'start': 5003.16, 'duration': 4.244}, {'end': 5009.425, 'text': 'So this is quite subtle.', 'start': 5008.505, 'duration': 0.92}, {'end': 5013.929, 'text': "I wouldn't have been able to tell you, just from looking at the problem, that this would be the optimal policy,", 'start': 5009.465, 'duration': 4.464}, {'end': 5019.994, 'text': 'but fortunately we have learning and planning algorithms that can sort that out for us and they can find this optimal solution without us having to find it.', 'start': 5013.929, 'duration': 6.065}], 'summary': 'Optimal route: b -> b prime -> a loop indefinitely', 'duration': 25.401, 'max_score': 4994.593, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4994593.jpg'}, {'end': 5129.805, 'src': 'embed', 'start': 5103.918, 'weight': 3, 'content': [{'end': 5110.287, 'text': 'You may have heard of an algorithm called Q-learning, or I mentioned earlier in this lecture, an algorithm called DQN.', 'start': 5103.918, 'duration': 6.369}, {'end': 5113.631, 'text': 'DQN is short for Deep Q Network.', 'start': 5110.687, 'duration': 2.944}, {'end': 5118.799, 'text': 'Q, as I mentioned, is often used to refer to state action values.', 'start': 5114.417, 'duration': 4.382}, {'end': 5122.441, 'text': 'Q-learning is an algorithm that can learn state action values.', 'start': 5119.38, 'duration': 3.061}, {'end': 5129.805, 'text': 'And then the DQN algorithm is an algorithm that uses Q-learning in combination with deep neural networks to learn these Atari games.', 'start': 5123.022, 'duration': 6.783}], 'summary': 'Dqn and q-learning are algorithms for learning state action values, with dqn using deep neural networks to learn atari games.', 'duration': 25.887, 'max_score': 5103.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc5103918.jpg'}, {'end': 5179.818, 'src': 'embed', 'start': 5143.776, 'weight': 4, 'content': [{'end': 5149.242, 'text': 'And these are methods that can be used to learn policies directly without necessarily using a value function.', 'start': 5143.776, 'duration': 5.466}, {'end': 5157.272, 'text': 'But we also discuss actor-critic algorithms in which you have both an explicit policy network or function, and you have an explicit value function.', 'start': 5149.703, 'duration': 7.569}, {'end': 5161.206, 'text': 'And this brings us also to deep reinforcement learning, because, as I mentioned,', 'start': 5158.464, 'duration': 2.742}, {'end': 5164.748, 'text': 'these functions are often represented these days with deep neural networks.', 'start': 5161.206, 'duration': 3.542}, {'end': 5165.748, 'text': "That's not the only choice.", 'start': 5164.788, 'duration': 0.96}, {'end': 5167.91, 'text': 'They could also be linear or they could be something else.', 'start': 5165.888, 'duration': 2.022}, {'end': 5172.853, 'text': "But it's a popular choice for a reason, and it works really well.", 'start': 5168.65, 'duration': 4.203}, {'end': 5174.894, 'text': "And we'll discuss that at some length later in this course.", 'start': 5172.893, 'duration': 2.001}, {'end': 5179.818, 'text': 'And also we will talk about how to integrate learning and planning.', 'start': 5177.316, 'duration': 2.502}], 'summary': 'Discusses actor-critic algorithms and deep reinforcement learning using deep neural networks for policy representation.', 'duration': 36.042, 'max_score': 5143.776, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc5143776.jpg'}], 'start': 4632.294, 'title': 'Reinforcement learning', 'summary': 'Explores deep reinforcement learning using atari and grid world examples, core principles and algorithms including q-learning, dqn, policy gradient methods, and actor-critic algorithms, emphasizing learning by interaction and the integration of learning and planning.', 'chapters': [{'end': 5013.929, 'start': 4632.294, 'title': 'Deep reinforcement learning: atari and grid world', 'summary': 'Explores the intersection of deep learning and deep reinforcement learning, using atari and grid world examples to illustrate the complexity and optimization of value functions and policies in reinforcement learning problems.', 'duration': 381.635, 'highlights': ['The Atari example demonstrates learning directly from interactive gameplay through observations as pixels, actions as joystick controls, and rewards as the difference in score on every time step. The Atari game exemplifies learning from interactive gameplay using pixels as observations, joystick controls as actions, and score differences as rewards.', 'The grid world example showcases the complexity of value functions and optimal policies, revealing the desirability of states and the subtleties in determining the optimal strategy. The grid world example illustrates the intricacies of determining optimal value functions and policies, highlighting the desirability of states and the subtleties in choosing the best strategy.', 'The optimal policy in the grid world problem involves transitioning to state A, which results in lucrative rewards, emphasizing the importance of long-term profitability over short-term gains. The optimal policy in the grid world problem prioritizes transitioning to state A, showcasing the significance of long-term profitability over immediate rewards.']}, {'end': 5389.862, 'start': 5013.929, 'title': 'Reinforcement learning principles and algorithms', 'summary': 'Discusses the core principles and algorithms of reinforcement learning, including q-learning, dqn, policy gradient methods, actor-critic algorithms, and deep reinforcement learning, with a focus on learning by interaction and understanding the integration of learning and planning.', 'duration': 375.933, 'highlights': ['The importance of understanding core principles and learning algorithms The focus is on understanding core principles and learning algorithms, as the current state-of-the-art algorithms will change, and understanding the core principles can enable understanding and even inventing new algorithms.', 'Introduction to Q-learning and DQN Q-learning is an algorithm for learning state action values, and DQN is an algorithm that uses Q-learning with deep neural networks to learn Atari games.', 'Discussion of policy gradient methods and actor-critic algorithms Policy gradient methods are used to learn policies directly without necessarily using a value function, and actor-critic algorithms involve both an explicit policy network or function and an explicit value function.', 'Deep reinforcement learning and integration of learning and planning Deep reinforcement learning involves representing functions with deep neural networks, and the chapter also covers the integration of learning and planning for an agent, which involves both internal computation process and learning from new experiences.', 'Example of a reinforcement learning problem - learning to control a body An example is provided where an algorithm learns to control a body to produce forward motion, with the reward being proportional to the speed of motion, demonstrating the ability to traverse complex terrains and different body types through learning.']}], 'duration': 757.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TCCjZe0y4Qc/pics/TCCjZe0y4Qc4632294.jpg', 'highlights': ['The Atari example demonstrates learning directly from interactive gameplay through observations as pixels, actions as joystick controls, and rewards as the difference in score on every time step.', 'The grid world example showcases the complexity of value functions and optimal policies, revealing the desirability of states and the subtleties in determining the optimal strategy.', 'The optimal policy in the grid world problem involves transitioning to state A, which results in lucrative rewards, emphasizing the importance of long-term profitability over short-term gains.', 'Introduction to Q-learning and DQN Q-learning is an algorithm for learning state action values, and DQN is an algorithm that uses Q-learning with deep neural networks to learn Atari games.', 'Discussion of policy gradient methods and actor-critic algorithms Policy gradient methods are used to learn policies directly without necessarily using a value function, and actor-critic algorithms involve both an explicit policy network or function and an explicit value function.', 'Deep reinforcement learning involves representing functions with deep neural networks, and the chapter also covers the integration of learning and planning for an agent, which involves both internal computation process and learning from new experiences.']}], 'highlights': ['The course will cover different concepts and algorithms in reinforcement learning.', 'The lectures will be taught by Harof, Hasselt, Diana Borsa, and Matteo Hessel.', 'The course recommends a book by Rich Sutton and Andy Bartow as background material.', 'Reinforcement learning involves active, goal-directed interactions where the agent learns without examples, optimizing a reward signal to achieve satisfying outcomes.', "The chapter discusses reinforcement learning and its relation to artificial intelligence, emphasizing machines' potential for autonomy and decision-making.", 'The concept of the Markov property and Markov decision processes form a crucial mathematical framework for reasoning about algorithms used in solving decision problems within reinforcement training.', "The Markov property states that a process is Markovian if the probability of a reward and a subsequent state doesn't change with more history, leading to a smaller state representation.", 'Prioritize good policies and value predictions over complete Markovianness for complex real-world problems.', 'Accurate value functions are crucial for optimal behavior, enabling the construction of an optimal policy based on state and action values.', 'The Atari example demonstrates learning directly from interactive gameplay through observations as pixels, actions as joystick controls, and rewards as the difference in score on every time step.']}