title
MIT 6.S191 (2019): Deep Reinforcement Learning

description
MIT Introduction to Deep Learning 6.S191: Lecture 5 Deep Reinforcement Learning Lecturer: Alexander Amini January 2019 For all lectures, slides and lab materials: http://introtodeeplearning.com

detail
{'title': 'MIT 6.S191 (2019): Deep Reinforcement Learning', 'heatmap': [{'end': 998.292, 'start': 909.792, 'weight': 0.881}, {'end': 1346.751, 'start': 1291.971, 'weight': 0.72}, {'end': 1514.194, 'start': 1373.127, 'weight': 0.77}, {'end': 1895.166, 'start': 1833.12, 'weight': 0.977}, {'end': 1977.319, 'start': 1938.552, 'weight': 0.713}, {'end': 2083.92, 'start': 2044.929, 'weight': 0.786}], 'summary': "Delves into deep reinforcement learning, alphago's impact, q and policy function learning, understanding q functions in atari breakout game, reinforcement learning algorithms, and the advancements in alphago, including alphazero's superior performance in chess, shogi, and go.", 'chapters': [{'end': 47.16, 'segs': [{'end': 47.16, 'src': 'embed', 'start': 22.914, 'weight': 0, 'content': [{'end': 47.16, 'text': 'reinforcement learning provides us with a set of mathematical tools and methods for teaching agents how to actually go from perceiving the world which is the way we usually talk about deep learning or machine learning problems in the context of computer vision perception to actually go beyond this perception to actually acting in the world and figuring out how to optimally act in that world.', 'start': 22.914, 'duration': 24.246}], 'summary': 'Reinforcement learning teaches agents to act in the world beyond perception.', 'duration': 24.246, 'max_score': 22.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA22914.jpg'}], 'start': 3.854, 'title': 'Deep reinforcement learning', 'summary': 'Delves into deep reinforcement learning, an integration of deep learning and reinforcement learning, offering mathematical tools and techniques for guiding agents to act optimally in their environment.', 'chapters': [{'end': 47.16, 'start': 3.854, 'title': 'Deep reinforcement learning', 'summary': 'Discusses deep reinforcement learning, a combination of disciplines between deep learning and reinforcement learning, providing mathematical tools and methods for teaching agents to act optimally in the world.', 'duration': 43.306, 'highlights': ['Reinforcement learning combines deep learning and reinforcement learning disciplines to teach agents to act optimally in the world.', 'It provides mathematical tools and methods for teaching agents to go beyond perceiving the world to actually acting in the world.']}], 'duration': 43.306, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA3854.jpg', 'highlights': ['Reinforcement learning combines deep learning and reinforcement learning disciplines to teach agents to act optimally in the world.', 'It provides mathematical tools and methods for teaching agents to go beyond perceiving the world to actually acting in the world.']}, {'end': 647.579, 'segs': [{'end': 118.975, 'src': 'embed', 'start': 73.798, 'weight': 0, 'content': [{'end': 78.34, 'text': 'Beating a professional player at Go is a long-standing challenge of artificial intelligence.', 'start': 73.798, 'duration': 4.542}, {'end': 84.989, 'text': "Everything we've ever tried in AI just falls over when you try the game of Go.", 'start': 81.866, 'duration': 3.123}, {'end': 89.954, 'text': 'The number of possible configurations of the board is more than the number of atoms in the universe.', 'start': 85.009, 'duration': 4.945}, {'end': 92.896, 'text': 'AlphaGo found a way to learn how to play Go.', 'start': 90.855, 'duration': 2.041}, {'end': 96.8, 'text': "So far, AlphaGo has beaten every challenge we've given it.", 'start': 93.597, 'duration': 3.203}, {'end': 102.646, 'text': "But we won't know its true strength until we play somebody who is at the top of the world, likely Saddam.", 'start': 97.301, 'duration': 5.345}, {'end': 107.193, 'text': 'A match like no other is about to get underway in South Korea.', 'start': 103.632, 'duration': 3.561}, {'end': 109.613, 'text': 'Lee Sedol is to go what Roger Federer is to tennis.', 'start': 107.233, 'duration': 2.38}, {'end': 113.494, 'text': 'Just the very thought of a machine playing a human is inherently intriguing.', 'start': 109.653, 'duration': 3.841}, {'end': 114.594, 'text': 'The place is a madhouse.', 'start': 113.534, 'duration': 1.06}, {'end': 117.614, 'text': 'Welcome to the DeepMind Challenge.', 'start': 114.614, 'duration': 3}, {'end': 118.975, 'text': 'The full world is watching.', 'start': 117.634, 'duration': 1.341}], 'summary': 'Alphago, an ai, beat every challenge in go, ready to face top player lee sedol in south korea.', 'duration': 45.177, 'max_score': 73.798, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA73798.jpg'}, {'end': 174.362, 'src': 'embed', 'start': 143.658, 'weight': 1, 'content': [{'end': 147.481, 'text': 'These ideas that are driving AlphaGo are going to drive our future.', 'start': 143.658, 'duration': 3.823}, {'end': 148.522, 'text': 'This is it, folks.', 'start': 147.721, 'duration': 0.801}, {'end': 157.186, 'text': "So for those of you interested, that's actually a movie that came out about a year or two ago.", 'start': 152.202, 'duration': 4.984}, {'end': 159.308, 'text': "And it's available on Netflix now.", 'start': 157.827, 'duration': 1.481}, {'end': 165.274, 'text': "It's a rather dramatic depiction of the true story of AlphaGo facing LisaDole.", 'start': 159.328, 'duration': 5.946}, {'end': 168.096, 'text': "But it's an incredibly powerful story at the same time,", 'start': 165.874, 'duration': 2.222}, {'end': 174.362, 'text': 'because it really shows the impact that this algorithm had on the world and the press that it received as a result.', 'start': 168.096, 'duration': 6.266}], 'summary': "Alphago's impact is depicted in a powerful movie available on netflix.", 'duration': 30.704, 'max_score': 143.658, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA143658.jpg'}, {'end': 305.977, 'src': 'embed', 'start': 274.059, 'weight': 4, 'content': [{'end': 278.843, 'text': 'And the goal of reinforcement learning as opposed to supervised or unsupervised learning,', 'start': 274.059, 'duration': 4.784}, {'end': 284.508, 'text': 'is to actually maximize the future rewards that it could see in any future time set.', 'start': 278.843, 'duration': 5.665}, {'end': 289.912, 'text': 'So to act optimally in this environment such that it can maximize all future rewards that it sees.', 'start': 284.528, 'duration': 5.384}, {'end': 295.713, 'text': 'Going back to the apple example, if I show it this image of an apple,', 'start': 292.011, 'duration': 3.702}, {'end': 305.977, 'text': "the agent might now respond in a reinforcement learning setting by saying I should eat that thing because I've seen in the past that it helps me get nutrition and it helps keep me alive.", 'start': 295.713, 'duration': 10.264}], 'summary': 'Reinforcement learning aims to maximize future rewards for acting optimally in an environment.', 'duration': 31.918, 'max_score': 274.059, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA274059.jpg'}, {'end': 384.904, 'src': 'embed', 'start': 357.55, 'weight': 6, 'content': [{'end': 364.154, 'text': 'So the idea of reinforcement learning deals with, the central component of reinforcement learning deals with an agent.', 'start': 357.55, 'duration': 6.604}, {'end': 368.652, 'text': "So an agent, for example, is like a drone that's making a delivery.", 'start': 365.049, 'duration': 3.603}, {'end': 373.195, 'text': "It could be also Super Mario that's trying to navigate a video game.", 'start': 369.292, 'duration': 3.903}, {'end': 375.997, 'text': 'The algorithm is the agent.', 'start': 374.516, 'duration': 1.481}, {'end': 378.579, 'text': 'And in real life, you are the agent.', 'start': 376.538, 'duration': 2.041}, {'end': 384.904, 'text': "So you're trying to build an algorithm or a machine learning model that models that agent.", 'start': 379.46, 'duration': 5.444}], 'summary': 'Reinforcement learning involves agents like drones or video game characters navigating environments to achieve goals.', 'duration': 27.354, 'max_score': 357.55, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA357550.jpg'}, {'end': 571.322, 'src': 'embed', 'start': 547.102, 'weight': 5, 'content': [{'end': 555.589, 'text': 'So, building on this concept of rewards, we can define this notion of a total reward that the agent obtains at any given time,', 'start': 547.102, 'duration': 8.487}, {'end': 561.634, 'text': 'the total future reward, rather as just the sum of all rewards from that time step into the future.', 'start': 555.589, 'duration': 6.045}, {'end': 571.322, 'text': "So, for example, if we're starting at time t, we're looking at the reward that it obtains at time t, plus the reward that it obtains at time t plus 1,", 'start': 562.635, 'duration': 8.687}], 'summary': 'Total reward is the sum of all rewards obtained from a specific time step into the future.', 'duration': 24.22, 'max_score': 547.102, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA547102.jpg'}], 'start': 48.888, 'title': 'Alphago and reinforcement learning', 'summary': 'Introduces the game-changing story of alphago, showcasing its victory over the top go player and its impact on surpassing human intelligence. additionally, it defines reinforcement learning, compares it with other types of learning, and explains key concepts such as rewards and total future rewards.', 'chapters': [{'end': 189.772, 'start': 48.888, 'title': 'Alphago: a game-changing story', 'summary': "Introduces the power of alphago with a dramatic video showing its victory over the world's top go player, highlighting its ability to surpass human intelligence and the impactful influence of the algorithm on the world.", 'duration': 140.884, 'highlights': ["AlphaGo's victory over the top Go player demonstrates its ability to surpass human intelligence and its impactful influence on the world.", "The dramatic depiction of AlphaGo's true story in a movie on Netflix showcases the algorithm's impact and the press it received.", 'The complexity of Go is highlighted by the statement that the number of possible configurations of the board is more than the number of atoms in the universe.', 'The match between AlphaGo and Lee Sedol is likened to a historic event, emphasizing the intrigue of a machine playing a human in a game like Go.']}, {'end': 647.579, 'start': 190.712, 'title': 'Reinforcement learning paradigm', 'summary': 'Introduces the comparison of reinforcement learning with supervised and unsupervised learning, defines key concepts of reinforcement learning, and explains the concept of rewards and total future rewards.', 'duration': 456.867, 'highlights': ['Reinforcement learning compared with supervised and unsupervised learning', 'Definition of key concepts in reinforcement learning', 'Explanation of rewards and total future rewards']}], 'duration': 598.691, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA48888.jpg', 'highlights': ["AlphaGo's victory over the top Go player demonstrates its ability to surpass human intelligence and its impactful influence on the world.", "The dramatic depiction of AlphaGo's true story in a movie on Netflix showcases the algorithm's impact and the press it received.", 'The complexity of Go is highlighted by the statement that the number of possible configurations of the board is more than the number of atoms in the universe.', 'The match between AlphaGo and Lee Sedol is likened to a historic event, emphasizing the intrigue of a machine playing a human in a game like Go.', 'Reinforcement learning compared with supervised and unsupervised learning', 'Explanation of rewards and total future rewards', 'Definition of key concepts in reinforcement learning']}, {'end': 1039.925, 'segs': [{'end': 779.142, 'src': 'embed', 'start': 748.376, 'weight': 1, 'content': [{'end': 755.641, 'text': 'And we want that Q function to represent the expected total discounted reward that it could obtain in the future.', 'start': 748.376, 'duration': 7.265}, {'end': 761.704, 'text': "So now to give an example of this, let's assume, let's go back to the self-driving car example.", 'start': 757.222, 'duration': 4.482}, {'end': 766.948, 'text': "You're placing your self-driving car on a position on the road.", 'start': 763.826, 'duration': 3.122}, {'end': 768.729, 'text': "That's your state.", 'start': 768.008, 'duration': 0.721}, {'end': 779.142, 'text': 'And you want to know, for any given action, what is the total amount of reward, future reward, that that car can achieve by executing that action.', 'start': 769.975, 'duration': 9.167}], 'summary': 'Q function represents future rewards for a self-driving car in different positions.', 'duration': 30.766, 'max_score': 748.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA748376.jpg'}, {'end': 885.576, 'src': 'embed', 'start': 820.76, 'weight': 0, 'content': [{'end': 826.566, 'text': 'Okay so, the key part of this problem in reinforcement learning is actually learning this function.', 'start': 820.76, 'duration': 5.806}, {'end': 827.667, 'text': 'This is the hard thing.', 'start': 826.786, 'duration': 0.881}, {'end': 829.869, 'text': 'So we want to learn this Q value function.', 'start': 827.707, 'duration': 2.162}, {'end': 834.754, 'text': 'So, given a state and given an input, how can we compute that expected return of reward?', 'start': 829.909, 'duration': 4.845}, {'end': 841.75, 'text': "But ultimately, what we need to actually act in the environment is a new function that I haven't defined yet.", 'start': 836.305, 'duration': 5.445}, {'end': 843.312, 'text': "And that's called the policy function.", 'start': 841.77, 'duration': 1.542}, {'end': 846.555, 'text': "So here, we're calling pi of s the policy.", 'start': 843.752, 'duration': 2.803}, {'end': 850.418, 'text': 'And here, the policy only takes as input just the state.', 'start': 847.195, 'duration': 3.223}, {'end': 853.681, 'text': "So it doesn't care about the action that the agent takes.", 'start': 850.959, 'duration': 2.722}, {'end': 858.126, 'text': 'In fact, it wants to output the desired action given any state.', 'start': 854.422, 'duration': 3.704}, {'end': 860.587, 'text': 'So the agent obtains some state.', 'start': 858.706, 'duration': 1.881}, {'end': 861.727, 'text': 'It perceives the world.', 'start': 860.767, 'duration': 0.96}, {'end': 867.689, 'text': 'And ultimately, you want your policy to output the optimal action to take given that state.', 'start': 862.227, 'duration': 5.462}, {'end': 870.21, 'text': "That's ultimately the goal of reinforcement learning.", 'start': 868.11, 'duration': 2.1}, {'end': 873.291, 'text': 'You want to see a state and then know how to act in that state.', 'start': 870.23, 'duration': 3.061}, {'end': 885.576, 'text': 'Now, the question I want to pose here is assuming we can learn this Q function, is there a way that we can now create or infer our policy function?', 'start': 874.592, 'duration': 10.984}], 'summary': 'In reinforcement learning, the goal is to learn the q value function and create an optimal policy for action selection.', 'duration': 64.816, 'max_score': 820.76, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA820760.jpg'}, {'end': 949.583, 'src': 'embed', 'start': 909.792, 'weight': 4, 'content': [{'end': 918.435, 'text': "So what we're going to define the policy function here as is just the arg max over all possible actions of that Q value function.", 'start': 909.792, 'duration': 8.643}, {'end': 927.653, 'text': "So what that means just one more time is that we're going to plug in all possible actions given the state into the Q value,", 'start': 919.929, 'duration': 7.724}, {'end': 934.876, 'text': "find the action that results in the highest possible total return and rewards, and that's going to be the action that we take at that given state.", 'start': 927.653, 'duration': 7.223}, {'end': 940.639, 'text': 'In deep reinforcement learning, there are two main ways that we can try to learn policy functions.', 'start': 935.457, 'duration': 5.182}, {'end': 947.483, 'text': 'The first way is actually by, like I was alluding to before, trying to first learn the Q value function.', 'start': 941.46, 'duration': 6.023}, {'end': 949.583, 'text': "So that's on the left-hand side.", 'start': 948.362, 'duration': 1.221}], 'summary': 'Defining policy function as the arg max of q value function for finding the action with the highest possible total return and rewards in deep reinforcement learning.', 'duration': 39.791, 'max_score': 909.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA909792.jpg'}, {'end': 998.292, 'src': 'heatmap', 'start': 909.792, 'weight': 0.881, 'content': [{'end': 918.435, 'text': "So what we're going to define the policy function here as is just the arg max over all possible actions of that Q value function.", 'start': 909.792, 'duration': 8.643}, {'end': 927.653, 'text': "So what that means just one more time is that we're going to plug in all possible actions given the state into the Q value,", 'start': 919.929, 'duration': 7.724}, {'end': 934.876, 'text': "find the action that results in the highest possible total return and rewards, and that's going to be the action that we take at that given state.", 'start': 927.653, 'duration': 7.223}, {'end': 940.639, 'text': 'In deep reinforcement learning, there are two main ways that we can try to learn policy functions.', 'start': 935.457, 'duration': 5.182}, {'end': 947.483, 'text': 'The first way is actually by, like I was alluding to before, trying to first learn the Q value function.', 'start': 941.46, 'duration': 6.023}, {'end': 949.583, 'text': "So that's on the left-hand side.", 'start': 948.362, 'duration': 1.221}, {'end': 959.488, 'text': 'So we try and learn this Q function that goes from states and actions, and then use that to infer a deterministic signal of which action to take,', 'start': 950.003, 'duration': 9.485}, {'end': 962.15, 'text': "given the state that we're currently in, using this argmax function.", 'start': 959.488, 'duration': 2.662}, {'end': 965.647, 'text': "And that's like what we just saw.", 'start': 964.366, 'duration': 1.281}, {'end': 972.049, 'text': "Another alternative approach that we'll discuss later in the class is using what's called as policy learning.", 'start': 966.527, 'duration': 5.522}, {'end': 976.731, 'text': "And here, we don't care about explicitly modeling the Q function.", 'start': 972.729, 'duration': 4.002}, {'end': 983.294, 'text': 'But instead, we want to just have our output of our model be the policy that the agent should take.', 'start': 977.551, 'duration': 5.743}, {'end': 989.796, 'text': "So here, the model that we're creating is not taking as input both the state and the action.", 'start': 983.334, 'duration': 6.462}, {'end': 998.292, 'text': "It's only taking as input the state, and it's predicting a probability distribution, which is pi, over all possible actions.", 'start': 989.836, 'duration': 8.456}], 'summary': 'In deep reinforcement learning, policy functions can be learned by learning the q value function or using policy learning to predict a probability distribution for actions based on the state.', 'duration': 88.5, 'max_score': 909.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA909792.jpg'}, {'end': 1039.925, 'src': 'embed', 'start': 999.053, 'weight': 6, 'content': [{'end': 1001.175, 'text': 'Probability distributions sum up to 1.', 'start': 999.053, 'duration': 2.122}, {'end': 1002.216, 'text': 'They have some nice properties.', 'start': 1001.175, 'duration': 1.041}, {'end': 1008.563, 'text': 'And then what we can do is we can actually just sample an action from that probability distribution in order to act in that state.', 'start': 1002.296, 'duration': 6.267}, {'end': 1014.724, 'text': 'So like I said, these are two different approaches for reinforcement learning, two main approaches.', 'start': 1010.441, 'duration': 4.283}, {'end': 1016.426, 'text': 'In the first part of the class,', 'start': 1015.525, 'duration': 0.901}, {'end': 1022.651, 'text': "we'll focus on value learning and then we'll come back to policy learning as a more general framework and more powerful framework.", 'start': 1016.426, 'duration': 6.225}, {'end': 1025.973, 'text': "We'll see that actually, this is what AlphaGo uses.", 'start': 1022.671, 'duration': 3.302}, {'end': 1030.958, 'text': "policy learning is what AlphaGo uses, and that's kind of what we'll end on and touch on how that works.", 'start': 1025.973, 'duration': 4.985}, {'end': 1039.925, 'text': "So before we get there, let's keep going digger, let's keep going deeper into the Q function.", 'start': 1033.839, 'duration': 6.086}], 'summary': 'Probability distributions sum to 1, key in reinforcement learning.', 'duration': 40.872, 'max_score': 999.053, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA999053.jpg'}], 'start': 647.879, 'title': 'Q and policy function learning', 'summary': 'Delves into learning q function and policy function in reinforcement learning, emphasizing the process of learning, the role of policy function, and the focus on q value function and policy learning.', 'chapters': [{'end': 861.727, 'start': 647.879, 'title': 'Learning q function in reinforcement learning', 'summary': 'Discusses the concept of q function in reinforcement learning, emphasizing the process of learning the function and the role of the policy function in determining the desired action based on the state.', 'duration': 213.848, 'highlights': ['Learning the Q function is the key challenge in reinforcement learning, as it involves computing the expected total discounted reward for a given state and action.', 'The Q function represents the expected future reward an agent can achieve by executing a specific action in a given state, which is crucial for decision-making in reinforcement learning.', "The policy function, denoted as pi of s, determines the desired action based on the state, playing a vital role in guiding the agent's behavior in the environment."]}, {'end': 1039.925, 'start': 862.227, 'title': 'Learning policy functions in reinforcement learning', 'summary': 'Introduces the concept of learning policy functions in reinforcement learning, focusing on two main approaches: learning the q value function and policy learning, with the former being the initial focus before delving into the more general and powerful framework of the latter.', 'duration': 177.698, 'highlights': ['The ultimate goal of reinforcement learning is to output the optimal action given a state, achieved by learning the Q function to infer the policy function.', 'The policy function is defined as the arg max over all possible actions of the Q value function, where the action with the highest Q value is the one taken at a given state.', 'In deep reinforcement learning, two main approaches to learn policy functions are discussed: learning the Q value function to infer a deterministic signal for action selection, and policy learning, which focuses on creating the policy that the agent should take without explicitly modeling the Q function.', 'Policy learning, which involves predicting a probability distribution over all possible actions based on the state and sampling an action from that distribution to act in that state, is highlighted as a more general and powerful framework, used by AlphaGo.', 'The class will initially focus on value learning before delving into policy learning, with the latter being identified as a more general framework and more powerful, as utilized by AlphaGo.']}], 'duration': 392.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA647879.jpg', 'highlights': ['Learning the Q function is crucial in reinforcement learning, involving computing expected total discounted reward for a state and action.', 'The Q function represents the expected future reward for executing a specific action in a state, crucial for decision-making.', "The policy function, denoted as pi of s, determines the desired action based on the state, guiding the agent's behavior.", 'The ultimate goal in reinforcement learning is to output the optimal action given a state, achieved by learning the Q function.', 'The policy function is defined as the arg max over all possible actions of the Q value function, guiding action selection.', 'In deep reinforcement learning, two main approaches to learn policy functions are discussed: learning the Q value function and policy learning.', 'Policy learning involves predicting a probability distribution over all possible actions based on the state and sampling an action from that distribution.', 'The class will initially focus on value learning before delving into policy learning, identified as a more general and powerful framework.']}, {'end': 1702.271, 'segs': [{'end': 1115.369, 'src': 'embed', 'start': 1041.161, 'weight': 0, 'content': [{'end': 1043.383, 'text': "So here's an example game that we'll consider.", 'start': 1041.161, 'duration': 2.222}, {'end': 1045.204, 'text': 'This is the Atari Breakout game.', 'start': 1043.643, 'duration': 1.561}, {'end': 1047.806, 'text': "And the way it works is you're this agent.", 'start': 1045.964, 'duration': 1.842}, {'end': 1049.828, 'text': "You're this little paddle on the bottom.", 'start': 1047.946, 'duration': 1.882}, {'end': 1054.972, 'text': 'And you can either choose to move left or right in the world at any given frame.', 'start': 1050.769, 'duration': 4.203}, {'end': 1063.019, 'text': "And there is this ball also in the world that's coming either towards you or away from you.", 'start': 1057.414, 'duration': 5.605}, {'end': 1067.422, 'text': 'Your job as the agent is to move your paddle left and right to hit that ball.', 'start': 1063.259, 'duration': 4.163}, {'end': 1074.092, 'text': 'and reflect it so that you can try to knock off a lot of these blocks on the top part of the screen.', 'start': 1068.328, 'duration': 5.764}, {'end': 1077.594, 'text': 'Every time you hit a block on the top part of the screen, you get a reward.', 'start': 1074.772, 'duration': 2.822}, {'end': 1079.895, 'text': "If you don't hit a block, you don't get a reward.", 'start': 1078.114, 'duration': 1.781}, {'end': 1085.019, 'text': 'And if that ball passes your paddle without you hitting it, you lose the game.', 'start': 1080.456, 'duration': 4.563}, {'end': 1095.686, 'text': 'So your goal is to keep hitting that ball back onto the top of the board and breaking off as many of these colored blocks as possible,', 'start': 1086.62, 'duration': 9.066}, {'end': 1097.567, 'text': 'each time getting a brand new reward.', 'start': 1095.686, 'duration': 1.881}, {'end': 1108.084, 'text': 'And the point I want to make here is that understanding Q functions or understanding optimal Q values is actually a really tough problem.', 'start': 1099.278, 'duration': 8.806}, {'end': 1115.369, 'text': 'And if I show you two possible example states and actions that an agent could take.', 'start': 1108.785, 'duration': 6.584}], 'summary': 'Atari breakout game: paddle reflects ball to hit blocks, earn rewards, avoid losing game.', 'duration': 74.208, 'max_score': 1041.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1041161.jpg'}, {'end': 1231.241, 'src': 'embed', 'start': 1198.034, 'weight': 4, 'content': [{'end': 1202.236, 'text': "Even though it's not going directly up and down, it is targeting the center more than the side.", 'start': 1198.034, 'duration': 4.202}, {'end': 1212.586, 'text': "Now I want to show you an alternative policy Now it's B that's explicitly trying to hit the side of the paddle every time.", 'start': 1204.737, 'duration': 7.849}, {'end': 1213.927, 'text': 'No matter where the ball is.', 'start': 1212.787, 'duration': 1.14}, {'end': 1219.631, 'text': "it's trying to move away from the ball and then come back towards it, so it hits the side and just barely hits the ball,", 'start': 1213.927, 'duration': 5.704}, {'end': 1223.154, 'text': 'so it can send it ricocheting off into the corner of the screen.', 'start': 1219.631, 'duration': 3.523}, {'end': 1231.241, 'text': "And what you're going to see is it's going to basically be trying to create these gaps in the corner of the screen, so on both left and right side,", 'start': 1223.874, 'duration': 7.367}], 'summary': 'The alternative policy b aims to hit the side of the paddle to create gaps in the corner of the screen.', 'duration': 33.207, 'max_score': 1198.034, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1198034.jpg'}, {'end': 1293.433, 'src': 'embed', 'start': 1265.112, 'weight': 3, 'content': [{'end': 1269.415, 'text': 'And for me, I would have assumed that the safest action to take was actually A.', 'start': 1265.112, 'duration': 4.303}, {'end': 1276.64, 'text': 'But through reinforcement learning, we can learn more optimal actions than what might be immediately apparent to human operators.', 'start': 1269.415, 'duration': 7.225}, {'end': 1289.929, 'text': "So now let's bring this back to the context of deep learning and find out how we can use deep learning to actually model Q functions and estimate Q functions using training data.", 'start': 1278.939, 'duration': 10.99}, {'end': 1293.433, 'text': 'And we can do this in one of two ways.', 'start': 1291.971, 'duration': 1.462}], 'summary': 'Reinforcement learning improves optimal actions in deep learning models.', 'duration': 28.321, 'max_score': 1265.112, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1265112.jpg'}, {'end': 1346.751, 'src': 'heatmap', 'start': 1291.971, 'weight': 0.72, 'content': [{'end': 1293.433, 'text': 'And we can do this in one of two ways.', 'start': 1291.971, 'duration': 1.462}, {'end': 1298.537, 'text': "So the primary way or the primary model that's used is called a deep Q network.", 'start': 1293.693, 'duration': 4.844}, {'end': 1302.081, 'text': "And this is essentially, like I said, a model that's trying to estimate a Q function.", 'start': 1298.918, 'duration': 3.163}, {'end': 1310.95, 'text': "So in this first model that I'm showing here, it takes as input a state and a possible action that you could execute at that state.", 'start': 1304.087, 'duration': 6.863}, {'end': 1313.671, 'text': 'And the output is just the Q value.', 'start': 1311.85, 'duration': 1.821}, {'end': 1315.471, 'text': "It's just a scalar output.", 'start': 1314.051, 'duration': 1.42}, {'end': 1324.695, 'text': 'And the neural network is basically predicting what is the estimated expected total reward that it can obtain given the state and this action.', 'start': 1315.952, 'duration': 8.743}, {'end': 1331.217, 'text': 'Then you want to train this network using mean squared error to produce the right answer given a lot of training data.', 'start': 1325.195, 'duration': 6.022}, {'end': 1332.778, 'text': "At a high level, that's what's going on.", 'start': 1331.497, 'duration': 1.281}, {'end': 1340.566, 'text': 'The problem with this approach is that if we want to use our policy now and we want our agent to act in this world,', 'start': 1334.261, 'duration': 6.305}, {'end': 1346.751, 'text': 'we have to feed through the network a whole bunch of different actions at every time step to find the optimal Q value.', 'start': 1340.566, 'duration': 6.185}], 'summary': 'Deep q network estimates q function to optimize actions in training data.', 'duration': 54.78, 'max_score': 1291.971, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1291971.jpg'}, {'end': 1514.194, 'src': 'heatmap', 'start': 1373.127, 'weight': 0.77, 'content': [{'end': 1373.687, 'text': "That's not great,", 'start': 1373.127, 'duration': 0.56}, {'end': 1382.855, 'text': "because it requires executing this network in a forward pass a total number of times that's equal to the total number of actions that the agent could take at that step.", 'start': 1373.687, 'duration': 9.168}, {'end': 1389.866, 'text': 'Another alternative is slightly re-parameterizing this problem, still learning the Q value.', 'start': 1383.942, 'duration': 5.924}, {'end': 1396.83, 'text': 'but now we input just the state and the network intrinsically will compute the Q value for each of the possible actions.', 'start': 1389.866, 'duration': 6.964}, {'end': 1405.295, 'text': 'And since your action space is fixed in reinforcement learning in a lot of cases your output of the network is also fixed,', 'start': 1397.49, 'duration': 7.805}, {'end': 1414.367, 'text': 'which means that at each time you input a state and the network is basically outputting n numbers, where n is the dimensionality of your action space,', 'start': 1405.295, 'duration': 9.072}, {'end': 1418.171, 'text': 'where each output corresponds to the Q value of executing that action.', 'start': 1414.367, 'duration': 3.804}, {'end': 1425.966, 'text': 'Now, this is great because it means if we want to take an action given a state, we simply feed in our state to the network.', 'start': 1419.141, 'duration': 6.825}, {'end': 1428.388, 'text': 'It gives us back all these Q values.', 'start': 1426.727, 'duration': 1.661}, {'end': 1434.312, 'text': 'We pick the maximum Q value, and we use the action associated to that maximum Q value.', 'start': 1429.088, 'duration': 5.224}, {'end': 1442.238, 'text': 'In both of these cases, however, we can actually train using mean squared error.', 'start': 1437.294, 'duration': 4.944}, {'end': 1446.381, 'text': "It's a fancy version of mean squared error that I'll just quickly walk through.", 'start': 1442.618, 'duration': 3.763}, {'end': 1450.445, 'text': 'So the right side is the predicted Q value.', 'start': 1447.423, 'duration': 3.022}, {'end': 1452.607, 'text': "That's actually the output of the neural network.", 'start': 1450.685, 'duration': 1.922}, {'end': 1458.191, 'text': 'Just to reiterate, this takes as input the state and the action, and this is what the network predicts.', 'start': 1453.367, 'duration': 4.824}, {'end': 1467.477, 'text': 'You then want to minimize the error of that predicted Q value compared to the true or the target Q value, which, in this case,', 'start': 1459.452, 'duration': 8.025}, {'end': 1468.358, 'text': 'is on the right-hand side.', 'start': 1467.477, 'duration': 0.881}, {'end': 1474.452, 'text': 'So the target queue value is what you actually observed when you took that action.', 'start': 1469.929, 'duration': 4.523}, {'end': 1478.734, 'text': 'So when the agent takes an action, it gets a reward that you can just record.', 'start': 1474.792, 'duration': 3.942}, {'end': 1479.675, 'text': 'You store it in memory.', 'start': 1478.774, 'duration': 0.901}, {'end': 1485.258, 'text': 'And you can also record the discounted reward that it receives in every action after that.', 'start': 1480.595, 'duration': 4.663}, {'end': 1487.685, 'text': "So that's the target return.", 'start': 1486.325, 'duration': 1.36}, {'end': 1491.407, 'text': "That's what you know the agent obtained.", 'start': 1487.906, 'duration': 3.501}, {'end': 1493.967, 'text': "That's the reward that they obtained given that action.", 'start': 1491.467, 'duration': 2.5}, {'end': 1499.729, 'text': 'And you can use that to now have a regression problem over the predicted Q values.', 'start': 1494.688, 'duration': 5.041}, {'end': 1504.211, 'text': "And basically, over time, using back propagation, it's just a normal feedforward network.", 'start': 1500.189, 'duration': 4.022}, {'end': 1506.632, 'text': 'We can train this loss function.', 'start': 1504.711, 'duration': 1.921}, {'end': 1514.194, 'text': 'train this network according to this loss function to make our predicted Q value as close as possible to our desired or target Q values.', 'start': 1506.632, 'duration': 7.562}], 'summary': 'Reinforcement learning involves training a network to compute q values for actions, using mean squared error for training.', 'duration': 141.067, 'max_score': 1373.127, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1373127.jpg'}], 'start': 1041.161, 'title': 'Understanding q functions in atari breakout game and deep q network', 'summary': 'Delves into the challenges and importance of understanding q functions in the context of the atari breakout game, alongside explaining the usage of deep q networks in reinforcement learning, demonstrating flexibility and impact through examples in atari games.', 'chapters': [{'end': 1264.391, 'start': 1041.161, 'title': 'Understanding q functions in atari breakout game', 'summary': "Discusses the atari breakout game, where the agent's goal is to hit the ball to knock off colored blocks, showcasing the challenges of understanding q functions and the importance of selecting optimal state-action pairs.", 'duration': 223.23, 'highlights': ["The agent's goal is to hit the ball to knock off colored blocks, with successful hits resulting in rewards.", 'Understanding Q functions and optimal Q values is a challenging problem in the context of the Atari Breakout game.', 'Comparing two possible state-action pairs, it is highlighted that in a slightly stochastic setting, an alternative policy (B) of hitting the side of the paddle every time can lead to faster success by creating gaps for the ball to get stuck and knock off multiple blocks with one action.', 'The alternative policy (B) of hitting the side of the paddle every time leads to faster success in knocking off colored blocks compared to the limited approach of policy A.', 'The chapter emphasizes the importance of selecting optimal state-action pairs by showcasing the difference in effectiveness between the two policies (A and B) in the context of the Atari Breakout game.']}, {'end': 1702.271, 'start': 1265.112, 'title': 'Deep q network and reinforcement learning', 'summary': 'Explains how deep q networks are used to estimate q functions in reinforcement learning, demonstrating the flexibility and challenges of the algorithm through examples in atari games, highlighting the impact of observable world and sparse rewards.', 'duration': 437.159, 'highlights': ['Deep Q network can estimate Q functions and optimize actions beyond human intuition', 'Training deep Q network using mean squared error to compute Q values', 'Flexibility and performance of deep Q networks in Atari games']}], 'duration': 661.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1041161.jpg', 'highlights': ["The agent's goal is to hit the ball to knock off colored blocks, with successful hits resulting in rewards.", 'Understanding Q functions and optimal Q values is a challenging problem in the context of the Atari Breakout game.', 'The chapter emphasizes the importance of selecting optimal state-action pairs by showcasing the difference in effectiveness between the two policies (A and B) in the context of the Atari Breakout game.', 'Deep Q network can estimate Q functions and optimize actions beyond human intuition', 'The alternative policy (B) of hitting the side of the paddle every time leads to faster success in knocking off colored blocks compared to the limited approach of policy A.']}, {'end': 2182.736, 'segs': [{'end': 1727.531, 'src': 'embed', 'start': 1702.932, 'weight': 1, 'content': [{'end': 1709.174, 'text': 'We have an optimal solution at any given state, given what we see, of where to move the paddle.', 'start': 1702.932, 'duration': 6.242}, {'end': 1712.489, 'text': 'That kind of summarizes our topic of Q learning.', 'start': 1709.807, 'duration': 2.682}, {'end': 1716.953, 'text': 'And I want to end on some of the downsides of Q learning.', 'start': 1713.51, 'duration': 3.443}, {'end': 1723.699, 'text': 'It surpasses human level performance on a lot of simpler tasks, but it also has trouble dealing with complexity.', 'start': 1717.234, 'duration': 6.465}, {'end': 1727.531, 'text': "It also can't handle action spaces which are continuous.", 'start': 1724.608, 'duration': 2.923}], 'summary': 'Q learning surpasses human performance on simpler tasks but struggles with complexity and continuous action spaces.', 'duration': 24.599, 'max_score': 1702.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1702932.jpg'}, {'end': 1791.218, 'src': 'embed', 'start': 1750.443, 'weight': 2, 'content': [{'end': 1752.864, 'text': 'There are tricks that you can get around this with,', 'start': 1750.443, 'duration': 2.421}, {'end': 1759.248, 'text': 'because you can just discretize your action space into very small bins and try to learn the Q value for each bin.', 'start': 1752.864, 'duration': 6.384}, {'end': 1765.612, 'text': 'But of course, the question is, well, how small do you want to make this? The smaller you make these bins, the harder learning becomes.', 'start': 1760.009, 'duration': 5.603}, {'end': 1773.918, 'text': 'And just at its core, the vanilla Q-learning algorithm that I presented here is not well suited for continuous action spaces.', 'start': 1766.193, 'duration': 7.725}, {'end': 1785.097, 'text': "And on another level, they're not flexible to handle stochastic policies because we're basically sampling from this argmax function.", 'start': 1775.655, 'duration': 9.442}, {'end': 1791.218, 'text': 'We have our Q function and we just take the argmax to compute the best action that we can execute at any given time.', 'start': 1785.877, 'duration': 5.341}], 'summary': 'Q-learning struggles with continuous action spaces and stochastic policies, as discretizing action space into small bins makes learning harder.', 'duration': 40.775, 'max_score': 1750.443, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1750443.jpg'}, {'end': 1898.669, 'src': 'heatmap', 'start': 1817.417, 'weight': 0, 'content': [{'end': 1820.638, 'text': 'bending the outputs, but this is actually a really big problem in practice.', 'start': 1817.417, 'duration': 3.221}, {'end': 1823.059, 'text': 'And to overcome this,', 'start': 1822.099, 'duration': 0.96}, {'end': 1830.602, 'text': "we're going to consider a new class of reinforcement learning models called policy gradient models for training these algorithms.", 'start': 1823.059, 'duration': 7.543}, {'end': 1839.104, 'text': "So policy gradients is a slightly different twist on Q-learning, but at the foundation, it's actually very different.", 'start': 1833.12, 'duration': 5.984}, {'end': 1842.127, 'text': "So let's recall Q-learning.", 'start': 1840.425, 'duration': 1.702}, {'end': 1848.932, 'text': 'So the deep Q network takes as input the states and predicts a Q value for each possible action on the right-hand side.', 'start': 1842.187, 'duration': 6.745}, {'end': 1853.649, 'text': "Now, in policy gradients, we're going to do something slightly different.", 'start': 1850.768, 'duration': 2.881}, {'end': 1860.751, 'text': "We're going to take as input the state, but now we're going to output a probability distribution over all possible actions.", 'start': 1854.369, 'duration': 6.382}, {'end': 1865.033, 'text': "Again, we're still considering the case of discrete action spaces,", 'start': 1861.632, 'duration': 3.401}, {'end': 1871.495, 'text': "but we'll see how we can easily extend this in ways that we couldn't do with Q-learning to continuous action spaces as well.", 'start': 1865.033, 'duration': 6.462}, {'end': 1874.676, 'text': "Let's stick with discrete action spaces for now, just for simplicity.", 'start': 1872.035, 'duration': 2.641}, {'end': 1888.058, 'text': 'So here, pi of alpha, sorry, pi of ai for all i is just the probability that you should execute action i given the state that you see as an input.', 'start': 1875.957, 'duration': 12.101}, {'end': 1895.166, 'text': 'And since this is a probability distribution, it means that all of these outputs have to add up to 1.', 'start': 1890.2, 'duration': 4.966}, {'end': 1898.669, 'text': 'We can do this using a softmax activation function in deep neural networks.', 'start': 1895.166, 'duration': 3.503}], 'summary': 'Introducing policy gradient models for reinforcement learning with focus on outputting a probability distribution over possible actions.', 'duration': 81.252, 'max_score': 1817.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1817417.jpg'}, {'end': 1995.168, 'src': 'heatmap', 'start': 1938.552, 'weight': 6, 'content': [{'end': 1941.115, 'text': 'What is the probability that that action is A1?', 'start': 1938.552, 'duration': 2.563}, {'end': 1943.058, 'text': 'What is the probability that that action is A2?', 'start': 1941.195, 'duration': 1.863}, {'end': 1944.76, 'text': 'And just execute the correct action.', 'start': 1943.398, 'duration': 1.362}, {'end': 1949.221, 'text': "So in some sense, it's skipping a step from Q learning.", 'start': 1946.499, 'duration': 2.722}, {'end': 1954.304, 'text': 'In Q learning, you learn the Q function, use the Q function to infer your policy.', 'start': 1949.281, 'duration': 5.023}, {'end': 1957.225, 'text': 'In policy learning, you just learn your policy directly.', 'start': 1954.844, 'duration': 2.381}, {'end': 1971.936, 'text': 'So how do we train policy gradient learning? Essentially, the way it works is we run a policy for a long time.', 'start': 1963.709, 'duration': 8.227}, {'end': 1977.319, 'text': 'Before we even start training, we run multiple episodes or multiple rollouts of that policy.', 'start': 1972.236, 'duration': 5.083}, {'end': 1983.903, 'text': 'A rollout is basically from start to end of a training session.', 'start': 1978.059, 'duration': 5.844}, {'end': 1995.168, 'text': 'So we can define a rollout as basically from time 0 to time t, where t is the end of some definition episode in that game.', 'start': 1983.943, 'duration': 11.225}], 'summary': 'Policy gradient learning trains policy directly, running multiple episodes before training to define rollouts.', 'duration': 31.459, 'max_score': 1938.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1938552.jpg'}, {'end': 2090.263, 'src': 'heatmap', 'start': 2044.929, 'weight': 4, 'content': [{'end': 2049.172, 'text': 'So we do a rollout for an episode given our policy.', 'start': 2044.929, 'duration': 4.243}, {'end': 2055.438, 'text': 'So policy is defined by the neural network, parameterized by the parameters theta.', 'start': 2049.373, 'duration': 6.065}, {'end': 2059.681, 'text': 'We sample a bunch of episodes from that policy.', 'start': 2056.978, 'duration': 2.703}, {'end': 2064.585, 'text': 'And each episode is basically just a collection of state, action, and reward pairs.', 'start': 2060.482, 'duration': 4.103}, {'end': 2066.347, 'text': 'So we record all of those into memory.', 'start': 2064.606, 'duration': 1.741}, {'end': 2072.572, 'text': "And then when we're ready to begin training, all we do is compute this gradient right here.", 'start': 2067.487, 'duration': 5.085}, {'end': 2083.92, 'text': 'And that gradient is the log likelihood of seeing a particular action given the state multiplied by the expected reward of that action.', 'start': 2073.996, 'duration': 9.924}, {'end': 2086.46, 'text': 'Sorry, excuse me.', 'start': 2085.841, 'duration': 0.619}, {'end': 2090.263, 'text': "That's the expected discounted reward of that action at that time.", 'start': 2086.561, 'duration': 3.702}], 'summary': 'Neural network policy rollout, computes gradient using state, action, reward pairs.', 'duration': 45.334, 'max_score': 2044.929, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2044929.jpg'}, {'end': 2141.307, 'src': 'embed', 'start': 2115.504, 'weight': 5, 'content': [{'end': 2120.107, 'text': "And that's just defined by we do our rollout on the top, and we won the game at the end.", 'start': 2115.504, 'duration': 4.603}, {'end': 2125.23, 'text': 'So all of these policies, all of these actions should be enforced or reinforced.', 'start': 2120.147, 'duration': 5.083}, {'end': 2132.564, 'text': 'And in this case, on the second line, we did another episode, which resulted in a loss.', 'start': 2126.152, 'duration': 6.412}, {'end': 2136.683, 'text': 'All of these policies should be discouraged in the future.', 'start': 2133.641, 'duration': 3.042}, {'end': 2141.307, 'text': 'So when things result in positive rewards, we multiply.', 'start': 2137.404, 'duration': 3.903}], 'summary': 'Rollout led to winning game, but second episode resulted in loss. positive rewards should be reinforced, negative discouraged.', 'duration': 25.803, 'max_score': 2115.504, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2115504.jpg'}], 'start': 1702.932, 'title': 'Reinforcement learning algorithms', 'summary': 'Discusses the limitations of q-learning and introduces policy gradient models as a solution, which directly learns the policy by outputting a probability distribution over all possible actions. it also explains the policy gradient algorithm, which aims to increase the probability of actions leading to high rewards and decrease the probability of actions leading to low rewards by computing the log likelihood of actions given the state multiplied by the expected discounted reward, guiding the network to update its parameters accordingly.', 'chapters': [{'end': 2017.837, 'start': 1702.932, 'title': 'Q-learning and policy gradients', 'summary': 'Discusses the limitations of q-learning, particularly in handling continuous action spaces and stochastic policies, and introduces policy gradient models as a solution, which directly learns the policy by outputting a probability distribution over all possible actions.', 'duration': 314.905, 'highlights': ['Q-learning surpasses human level performance on simpler tasks but struggles with complexity and continuous action spaces', 'Discretizing action space in Q-learning leads to increased learning difficulty', 'Q-learning is inflexible in handling stochastic policies due to its reliance on argmax function', 'Introduction of policy gradients as a solution for training reinforcement learning algorithms', 'Training process of policy gradient learning involves running multiple episodes or rollouts of the policy']}, {'end': 2182.736, 'start': 2019.209, 'title': 'Policy gradient algorithm', 'summary': 'Explains the policy gradient algorithm, which aims to increase the probability of actions leading to high rewards and decrease the probability of actions leading to low rewards by computing the log likelihood of actions given the state multiplied by the expected discounted reward, guiding the network to update its parameters accordingly.', 'duration': 163.527, 'highlights': ['The algorithm aims to increase the probability of actions leading to high rewards and decrease the probability of actions leading to low rewards by computing the log likelihood of actions given the state multiplied by the expected discounted reward.', 'Positive rewards result in reinforcing the actions, while negative or zero rewards result in discouraging the actions in the future.', "The algorithm utilizes the neural network's policy, parameterized by the parameters theta, to sample episodes and compute the gradient to update the network."]}], 'duration': 479.804, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA1702932.jpg', 'highlights': ['Introduction of policy gradients as a solution for training reinforcement learning algorithms', 'Q-learning surpasses human level performance on simpler tasks but struggles with complexity and continuous action spaces', 'Discretizing action space in Q-learning leads to increased learning difficulty', 'Q-learning is inflexible in handling stochastic policies due to its reliance on argmax function', 'The algorithm aims to increase the probability of actions leading to high rewards and decrease the probability of actions leading to low rewards by computing the log likelihood of actions given the state multiplied by the expected discounted reward', 'Positive rewards result in reinforcing the actions, while negative or zero rewards result in discouraging the actions in the future', 'Training process of policy gradient learning involves running multiple episodes or rollouts of the policy', "The algorithm utilizes the neural network's policy, parameterized by the parameters theta, to sample episodes and compute the gradient to update the network"]}, {'end': 2690.278, 'segs': [{'end': 2235.278, 'src': 'embed', 'start': 2208.433, 'weight': 1, 'content': [{'end': 2216.054, 'text': "I think at the core, for those of you who aren't familiar with the game of Go, it's an incredibly complex game with a massive state space.", 'start': 2208.433, 'duration': 7.621}, {'end': 2218.835, 'text': 'There are more states than there are atoms in the universe.', 'start': 2216.515, 'duration': 2.32}, {'end': 2222.496, 'text': "And that's in the full version of the game where it's a 19 by 19 game.", 'start': 2219.315, 'duration': 3.181}, {'end': 2227.837, 'text': "And the idea of Go is that it's a two-player game, black and white.", 'start': 2224.116, 'duration': 3.721}, {'end': 2235.278, 'text': 'And the motivation or the goal is that you want to get more board territory than your opponent.', 'start': 2228.357, 'duration': 6.921}], 'summary': 'Go is a complex game with more states than atoms, played on a 19x19 board with the goal of gaining more territory than the opponent.', 'duration': 26.845, 'max_score': 2208.433, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2208433.jpg'}, {'end': 2388.944, 'src': 'embed', 'start': 2366.007, 'weight': 3, 'content': [{'end': 2376.334, 'text': "So you're going to basically make two copies of this network and play one network against itself and use now reinforcement learning to achieve superhuman performance.", 'start': 2366.007, 'duration': 10.327}, {'end': 2380.717, 'text': "So now, since it's playing against itself and it's not receiving human input,", 'start': 2376.354, 'duration': 4.363}, {'end': 2385.02, 'text': "it's able to discover new possible actions that the human may not have thought of.", 'start': 2380.717, 'duration': 4.303}, {'end': 2388.944, 'text': 'that may result in even higher reward than before.', 'start': 2386.001, 'duration': 2.943}], 'summary': 'Using reinforcement learning, two copies of a network play each other to achieve superhuman performance.', 'duration': 22.937, 'max_score': 2366.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2366007.jpg'}, {'end': 2560.64, 'src': 'embed', 'start': 2533.538, 'weight': 0, 'content': [{'end': 2544.789, 'text': 'So they show it being able to outperform top model-based approaches and top human players in games of chess, Shogi and AlphaGo.', 'start': 2533.538, 'duration': 11.251}, {'end': 2550.133, 'text': "So it's actually able to surpass the performance of AlphaGo in just 40 hours.", 'start': 2545.489, 'duration': 4.644}, {'end': 2556.878, 'text': "And now this network, AlphaZero, it's called AlphaZero because it requires no prior knowledge of human players.", 'start': 2550.453, 'duration': 6.425}, {'end': 2560.64, 'text': "It's learned entirely using reinforcement learning and self-play.", 'start': 2557.378, 'duration': 3.262}], 'summary': 'Alphazero surpasses alphago in 40 hours, no prior human knowledge required.', 'duration': 27.102, 'max_score': 2533.538, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2533538.jpg'}, {'end': 2680.799, 'src': 'embed', 'start': 2635.544, 'weight': 2, 'content': [{'end': 2643.866, 'text': 'And now AlphaZero is being used as almost a learning mechanism for top human performers on these new policies that can improve humans even more.', 'start': 2635.544, 'duration': 8.322}, {'end': 2652.508, 'text': 'I think this is a really powerful technique because it shows reinforcement learning being used not just to pit humans against machines,', 'start': 2645.846, 'duration': 6.662}, {'end': 2660.573, 'text': 'but also as a teaching mechanism for humans to discover new ways to execute optimal policies in some of these games.', 'start': 2653.41, 'duration': 7.163}, {'end': 2662.173, 'text': 'And even going beyond games.', 'start': 2660.973, 'duration': 1.2}, {'end': 2663.354, 'text': 'the ultimate goal, of course,', 'start': 2662.173, 'duration': 1.181}, {'end': 2672.797, 'text': 'is to create reinforcement learning agents that can act in the real world robotic reinforcement learning agents not just in simulated board games,', 'start': 2663.354, 'duration': 9.443}, {'end': 2676.899, 'text': 'but in the real world with humans, and help us learn in this world as well.', 'start': 2672.797, 'duration': 4.102}, {'end': 2680.799, 'text': "So that's all for reinforcement learning.", 'start': 2678.314, 'duration': 2.485}], 'summary': 'Alphazero used to teach humans optimal policies in games and beyond, aiming for real-world robotic reinforcement learning agents.', 'duration': 45.255, 'max_score': 2635.544, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2635544.jpg'}], 'start': 2182.776, 'title': 'Advancements in alphago', 'summary': "Covers the use of policy gradient learning, alphago algorithm, and the subsequent release of alphazero, which outperformed model-based approaches and human players in chess, shogi, and go, surpassing alphago's performance in just 40 hours.", 'chapters': [{'end': 2478.28, 'start': 2182.776, 'title': 'Policy gradient learning in alphago', 'summary': 'Explains the use of policy gradient learning and the alphago algorithm to improve performance in the complex game of go, with reference to leveraging supervised learning, self-play, and value function networks.', 'duration': 295.504, 'highlights': ['The game of Go has a massive state space with more states than there are atoms in the universe in the full version of the game, and the goal is to get more board territory than the opponent.', "AlphaGo algorithm initially initializes the network by training it on human experts' gameplay data through supervised learning to imitate human gameplay.", 'The algorithm progresses to using self-play and reinforcement learning to surpass human performance by discovering new possible actions resulting in higher rewards.', 'The use of a value network in AlphaGo provides an understanding of desirable board states and informs the focus on specific game states to improve decision-making.']}, {'end': 2690.278, 'start': 2478.28, 'title': 'Alphazero: advancements in reinforcement learning', 'summary': 'Discusses the groundbreaking development of alphago in 2016, which defeated the top human player at go, and the subsequent release of alphazero, a model that outperformed top model-based approaches and human players in chess, shogi, and go, surpassing the performance of alphago in just 40 hours.', 'duration': 211.998, 'highlights': ['AlphaZero outperformed top model-based approaches and human players in chess, Shogi, and Go, surpassing the performance of AlphaGo in just 40 hours, and requires no prior knowledge of human players.', "The evolution of rewards over time in AlphaZero's training process shows that it starts to discover new, more advanced ways of starting the game, creating new policies that humans never considered.", 'AlphaZero is being used as a learning mechanism for top human performers to discover new policies, showing reinforcement learning as a teaching mechanism for humans to execute optimal policies in games and potentially in the real world.', 'The ultimate goal is to create reinforcement learning agents that can act in the real world, not just in simulated board games, but with humans, and help us learn in the real world as well.']}], 'duration': 507.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i6Mi2_QM3rA/pics/i6Mi2_QM3rA2182776.jpg', 'highlights': ['AlphaZero outperformed top model-based approaches and human players in chess, Shogi, and Go, surpassing the performance of AlphaGo in just 40 hours, and requires no prior knowledge of human players.', 'The game of Go has a massive state space with more states than there are atoms in the universe in the full version of the game, and the goal is to get more board territory than the opponent.', 'The ultimate goal is to create reinforcement learning agents that can act in the real world, not just in simulated board games, but with humans, and help us learn in the real world as well.', 'The algorithm progresses to using self-play and reinforcement learning to surpass human performance by discovering new possible actions resulting in higher rewards.', "The evolution of rewards over time in AlphaZero's training process shows that it starts to discover new, more advanced ways of starting the game, creating new policies that humans never considered."]}], 'highlights': ['AlphaZero outperformed top model-based approaches and human players in chess, Shogi, and Go, surpassing the performance of AlphaGo in just 40 hours, and requires no prior knowledge of human players.', "AlphaGo's victory over the top Go player demonstrates its ability to surpass human intelligence and its impactful influence on the world.", 'The ultimate goal is to create reinforcement learning agents that can act in the real world, not just in simulated board games, but with humans, and help us learn in the real world as well.', 'Reinforcement learning combines deep learning and reinforcement learning disciplines to teach agents to act optimally in the world.', 'Learning the Q function is crucial in reinforcement learning, involving computing expected total discounted reward for a state and action.', 'The Q function represents the expected future reward for executing a specific action in a state, crucial for decision-making.', "The policy function, denoted as pi of s, determines the desired action based on the state, guiding the agent's behavior.", "The agent's goal is to hit the ball to knock off colored blocks, with successful hits resulting in rewards.", 'Introduction of policy gradients as a solution for training reinforcement learning algorithms', 'Understanding Q functions and optimal Q values is a challenging problem in the context of the Atari Breakout game.']}