title

MIT 6.S191 (2021): Reinforcement Learning

description

MIT Introduction to Deep Learning 6.S191: Lecture 5
Deep Reinforcement Learning
Lecturer: Alexander Amini
January 2021
For all lectures, slides, and lab materials: http://introtodeeplearning.com
Lecture Outline
0:00 - Introduction
3:17 - Classes of learning problems
6:19 - Definitions
12:33 - The Q function
16:14 - Deeper into the Q function
20:49 - Deep Q Networks
26:28 - Atari results and limitations
29:53 - Policy learning algorithms
33:11 - Discrete vs continuous actions
37:22 - Training policy gradients
44:50 - RL in real life
46:02 - VISTA simulator
47:44 - AlphaGo and AlphaZero and MuZero
55:22 - Summary
Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!

detail

{'title': 'MIT 6.S191 (2021): Reinforcement Learning', 'heatmap': [{'end': 1479.731, 'start': 1405.071, 'weight': 1}], 'summary': 'Covers reinforcement learning and deep learning fusion, its applications in starcraft, strategies, q-learning limitations, handling continuous action spaces, policy gradient training, and deep reinforcement learning applications, including mastering go and game optimization.', 'chapters': [{'end': 143.107, 'segs': [{'end': 74.006, 'src': 'embed', 'start': 50.586, 'weight': 0, 'content': [{'end': 60.873, 'text': "the way we've seen it has been really confined to fixed data sets, the way we kind of either collect or can obtain online, for example.", 'start': 50.586, 'duration': 10.287}, {'end': 70.402, 'text': 'In reinforcement learning, though, deep learning is placed in some environment and is actually able to explore and interact with that environment.', 'start': 61.753, 'duration': 8.649}, {'end': 74.006, 'text': "And it's able to learn how to best accomplish its goal.", 'start': 70.903, 'duration': 3.103}], 'summary': 'Reinforcement learning allows deep learning to explore and interact in an environment to accomplish goals.', 'duration': 23.42, 'max_score': 50.586, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ50586.jpg'}, {'end': 117.625, 'src': 'embed', 'start': 96.914, 'weight': 2, 'content': [{'end': 106.257, 'text': "And it's this really connection between the real world and deep learning, the virtual world, that makes this particularly exciting to me.", 'start': 96.914, 'duration': 9.343}, {'end': 112.219, 'text': "And I hope this video that I'm going to show you next really conveys that as well.", 'start': 106.777, 'duration': 5.442}, {'end': 117.625, 'text': 'StarCraft has imperfect information and is played in real time.', 'start': 114.383, 'duration': 3.242}], 'summary': 'Exciting connection between real world and deep learning in starcraft, a game played in real time.', 'duration': 20.711, 'max_score': 96.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ96914.jpg'}], 'start': 4.242, 'title': 'Reinforcement learning and deep learning', 'summary': 'Explores the fusion of reinforcement learning and deep learning to produce high-performing agents, with instances in robotics, autonomous vehicles, and strategic decision-making, demonstrated by achievements in starcraft gaming.', 'chapters': [{'end': 143.107, 'start': 4.242, 'title': 'Reinforcement learning and deep learning', 'summary': 'Discusses the combination of reinforcement learning and deep learning to create powerful agents capable of achieving superhuman performance, with applications in robotics, self-driving cars, and strategic planning, exemplified by starcraft gameplay.', 'duration': 138.865, 'highlights': ['Reinforcement learning and deep learning are combined to create agents capable of achieving superhuman performance, impacting fields like robotics, self-driving cars, and strategic planning, as exemplified by StarCraft gameplay.', 'Reinforcement learning allows deep learning to interact and explore environments, learning to achieve goals without human supervision, making it extremely powerful and flexible.', 'StarCraft gameplay exemplifies the need for long-term planning and the ability to choose actions from millions of possibilities, showcasing the impact of reinforcement learning and deep learning in strategic planning.', 'The combination of reinforcement learning and deep learning revolutionizes the world of gameplay and strategic planning, with applications in robotics, self-driving cars, and virtual environments.', 'The chapter highlights the connection between the real world and deep learning, particularly exciting due to its impact in fields like robotics, self-driving cars, and strategic planning.']}], 'duration': 138.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ4242.jpg', 'highlights': ['Reinforcement learning and deep learning create high-performing agents impacting robotics, self-driving cars, and strategic planning.', 'Reinforcement learning allows deep learning to learn and achieve goals without human supervision, making it powerful and flexible.', 'StarCraft gameplay showcases the impact of reinforcement learning and deep learning in strategic planning.', 'The combination of reinforcement learning and deep learning revolutionizes gameplay and strategic planning.', 'The connection between the real world and deep learning is particularly exciting due to its impact in robotics, self-driving cars, and strategic planning.']}, {'end': 1209.094, 'segs': [{'end': 208.28, 'src': 'embed', 'start': 175.748, 'weight': 0, 'content': [{'end': 184.409, 'text': 'So, in fact, this is an example of how deep learning was used to compete against humans professionally trained game players,', 'start': 175.748, 'duration': 8.661}, {'end': 187.89, 'text': 'and was actually trained to not only compete against them,', 'start': 184.409, 'duration': 3.481}, {'end': 195.732, 'text': 'but it was able to achieve remarkably superhuman performance beating this professional Starcraft player five games to zero.', 'start': 187.89, 'duration': 7.842}, {'end': 208.28, 'text': "So let's start by taking a step back and really seeing how reinforcement learning fits within respect to all of the other types of learning problems that we have seen so far in this class.", 'start': 197.352, 'duration': 10.928}], 'summary': 'Deep learning achieved superhuman performance, beating a professional starcraft player 5-0.', 'duration': 32.532, 'max_score': 175.748, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ175748.jpg'}, {'end': 343.379, 'src': 'embed', 'start': 316.755, 'weight': 1, 'content': [{'end': 321.76, 'text': 'Now the goal of RL is very different than the goal of supervised learning and the goal of unsupervised learning.', 'start': 316.755, 'duration': 5.005}, {'end': 329.869, 'text': 'The goal of RL is to maximize the reward or the future reward of that agent in that environment over many time steps.', 'start': 322.241, 'duration': 7.628}, {'end': 337.695, 'text': 'So again, going back to the Apple example, what the analog would be would be that the agent should learn that it should eat this thing,', 'start': 330.409, 'duration': 7.286}, {'end': 340.357, 'text': 'because it knows that it will keep you alive.', 'start': 337.695, 'duration': 2.662}, {'end': 343.379, 'text': 'it will make you healthier and you need food to survive.', 'start': 340.357, 'duration': 3.022}], 'summary': "Reinforcement learning aims to maximize agent's reward in an environment over many time steps.", 'duration': 26.624, 'max_score': 316.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ316755.jpg'}, {'end': 800.394, 'src': 'embed', 'start': 772.123, 'weight': 2, 'content': [{'end': 780.667, 'text': 'So remember, the total reward, R of t, measures the discounted sum of rewards obtained since time t.', 'start': 772.123, 'duration': 8.544}, {'end': 784.856, 'text': 'So now the Q function, is very related to that.', 'start': 780.667, 'duration': 4.189}, {'end': 793.286, 'text': 'The Q function is a function that takes as input the current state that the agent is in and the action that the agent takes in that state.', 'start': 785.317, 'duration': 7.969}, {'end': 800.394, 'text': 'And then it returns the expected total future reward that the agent can receive after that point.', 'start': 793.786, 'duration': 6.608}], 'summary': 'Q function measures expected future reward based on state and action.', 'duration': 28.271, 'max_score': 772.123, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ772123.jpg'}, {'end': 1174.9, 'src': 'embed', 'start': 1142.751, 'weight': 3, 'content': [{'end': 1143.632, 'text': 'It actually does pretty well.', 'start': 1142.751, 'duration': 0.881}, {'end': 1147.635, 'text': 'It breaks out a lot of the colored blocks in this game.', 'start': 1143.652, 'duration': 3.983}, {'end': 1150.877, 'text': "But let's take a look also at option B.", 'start': 1148.275, 'duration': 2.602}, {'end': 1152.838, 'text': 'Option B actually does something really interesting.', 'start': 1150.877, 'duration': 1.961}, {'end': 1156.621, 'text': 'It really likes to hit the ball at the corner of the paddle.', 'start': 1152.938, 'duration': 3.683}, {'end': 1164.647, 'text': 'It does this just so the ball can ricochet off at an extreme angle and break off colors in the corner of the screen.', 'start': 1157.562, 'duration': 7.085}, {'end': 1167.812, 'text': 'Now, this is actually.', 'start': 1165.549, 'duration': 2.263}, {'end': 1174.9, 'text': 'it does this to the extreme actually, because even in the case where the ball is coming right towards it, it will move out of the way,', 'start': 1167.812, 'duration': 7.088}], 'summary': 'Option b excels in breaking colored blocks by hitting the ball at extreme angles and moving out of the way when necessary.', 'duration': 32.149, 'max_score': 1142.751, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1142751.jpg'}], 'start': 143.728, 'title': 'Reinforcement learning in starcraft', 'summary': "Delves into deep learning's application in training an ai for superhuman performance in starcraft, reinforcement learning basics, and understanding rewards and q function with a focus on maximizing future rewards and the impact of immediate and delayed rewards.", 'chapters': [{'end': 293.4, 'start': 143.728, 'title': 'Deep learning in starcraft', 'summary': 'Discusses the use of deep learning in training an ai to compete and achieve superhuman performance against a professional starcraft player, showcasing the application of reinforcement learning and comparing it with supervised and unsupervised learning.', 'duration': 149.672, 'highlights': ['The AI achieved remarkably superhuman performance, beating the professional Starcraft player five games to zero, demonstrating the effectiveness of deep learning in competitive gaming.', 'The chapter explains the application of reinforcement learning in training an AI to compete against professionally trained game players and achieve superhuman performance, highlighting the significance of deep learning in this context.', 'It contrasts reinforcement learning with supervised and unsupervised learning, providing a comprehensive overview of the different types of learning problems and their applications in various domains.']}, {'end': 576.459, 'start': 295, 'title': 'Reinforcement learning basics', 'summary': 'Explains the fundamentals of reinforcement learning, including the goal of maximizing future rewards, key vocabulary such as agent, environment, actions, observations, states, and rewards, and the interaction between the agent and the environment for learning representations and plans.', 'duration': 281.459, 'highlights': ['Reinforcement learning goal is to maximize future rewards over many time steps', 'Key vocabulary in reinforcement learning includes agent, environment, actions, observations, states, and rewards', 'Actions can be discrete or continuous, including movements in a video game or GPS coordinates', 'States are the immediate situations in which the agent finds itself, including observations like image feeds', "Rewards provide feedback to measure the success or failure of an agent's actions"]}, {'end': 1053.629, 'start': 577.558, 'title': 'Understanding rewards and q function in reinforcement learning', 'summary': "Discusses the concept of rewards in reinforcement learning, emphasizing the impact of immediate and delayed rewards, the calculation of total reward, the use of discounting factor to prioritize short-term rewards, and the significance of the q function in determining the expected total future reward for an agent's action.", 'duration': 476.071, 'highlights': ['The Q function is defined to return the expected total future reward based on the current state and action of the agent.', 'The chapter explains the use of a discounting factor to prioritize short-term rewards over long-term rewards in reinforcement learning.', 'The concept of total reward as the sum of all rewards obtained after a certain time is detailed, with the introduction of the Q function to determine the expected return based on the current state and action of the agent.']}, {'end': 1209.094, 'start': 1054.33, 'title': 'Understanding q function in reinforcement learning', 'summary': 'Explains the concept of q function, its challenges in determining values for state-action pairs, and compares the expected rewards for two different state-action pairs in a game scenario, showcasing the impact of different strategies.', 'duration': 154.764, 'highlights': ['Option B strategy leads to higher accumulated rewards by hitting the ball at extreme angles, resulting in breaking off a ton of colored blocks and accumulating a ton of rewards.', 'Option A strategy is relatively conservative, achieving moderate success in hitting off a lot of the breakout pieces towards the center of the game.', 'Q function helps in determining the expected total return in a given state-action pair, and it can be challenging to intuitively guess the Q value for a given state-action pair.']}], 'duration': 1065.366, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ143728.jpg', 'highlights': ['The AI achieved remarkably superhuman performance, beating the professional Starcraft player five games to zero, demonstrating the effectiveness of deep learning in competitive gaming.', 'Reinforcement learning goal is to maximize future rewards over many time steps', 'The Q function is defined to return the expected total future reward based on the current state and action of the agent.', 'Option B strategy leads to higher accumulated rewards by hitting the ball at extreme angles, resulting in breaking off a ton of colored blocks and accumulating a ton of rewards.']}, {'end': 1669.917, 'segs': [{'end': 1249.545, 'src': 'embed', 'start': 1209.534, 'weight': 0, 'content': [{'end': 1218.77, 'text': "And this is a very great policy to learn because it's able to beat the game much, much faster than option A and with much less effort as well.", 'start': 1209.534, 'duration': 9.236}, {'end': 1225.709, 'text': 'So the answer to the question which stay action pair has a higher Q value in this case is option B.', 'start': 1220.205, 'duration': 5.504}, {'end': 1230.072, 'text': "But that's a relatively unintuitive option, at least for me when I first saw this problem,", 'start': 1225.709, 'duration': 4.363}, {'end': 1237.517, 'text': "because I would have expected that playing things I mean not moving out of the way of the ball when it's coming right towards you would be a better action.", 'start': 1230.072, 'duration': 7.445}, {'end': 1245.542, 'text': 'But this agent actually has learned to move away from the ball just so it can come back and hit it and really attack at extreme angles.', 'start': 1237.937, 'duration': 7.605}, {'end': 1249.545, 'text': "That's a very interesting observation that this agent has made through learning.", 'start': 1245.802, 'duration': 3.743}], 'summary': 'Option b beats game faster with less effort, surprising observation by agent.', 'duration': 40.011, 'max_score': 1209.534, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1209534.jpg'}, {'end': 1296.586, 'src': 'embed', 'start': 1272.387, 'weight': 3, 'content': [{'end': 1280.814, 'text': "So one thing we could do is have a deep neural network that gets inputs of both its state and the desired action that it's considering to make in that state.", 'start': 1272.387, 'duration': 8.427}, {'end': 1286.478, 'text': 'Then the network would be trained to predict the Q value for that given state-action pair.', 'start': 1282.255, 'duration': 4.223}, {'end': 1287.479, 'text': "That's just a single number.", 'start': 1286.498, 'duration': 0.981}, {'end': 1293.144, 'text': 'The problem with this is that it can be rather inefficient to actually run forward in time,', 'start': 1288.42, 'duration': 4.724}, {'end': 1296.586, 'text': 'because if Remember how we compute the policy for this model.', 'start': 1293.144, 'duration': 3.442}], 'summary': 'Use deep neural network to predict q values for state-action pairs, but can be inefficient.', 'duration': 24.199, 'max_score': 1272.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1272387.jpg'}, {'end': 1336.366, 'src': 'embed', 'start': 1311.395, 'weight': 4, 'content': [{'end': 1319.28, 'text': 'This means that we basically have to run this network many times for each time step, just to compute what is the optimal action.', 'start': 1311.395, 'duration': 7.885}, {'end': 1320.641, 'text': 'And this can be rather inefficient.', 'start': 1319.42, 'duration': 1.221}, {'end': 1324.882, 'text': 'Instead, what we can do, which is very equivalent to this idea,', 'start': 1321.221, 'duration': 3.661}, {'end': 1331.224, 'text': "but just formulated slightly differently is that it's often much more convenient to output all of the Q values at once.", 'start': 1324.882, 'duration': 6.342}, {'end': 1336.366, 'text': 'So you input the state here and you output basically a vector of Q values.', 'start': 1332.025, 'duration': 4.341}], 'summary': 'Running the network multiple times for each time step can be inefficient; outputting all q values at once is often more convenient.', 'duration': 24.971, 'max_score': 1311.395, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1311395.jpg'}, {'end': 1479.731, 'src': 'heatmap', 'start': 1405.071, 'weight': 1, 'content': [{'end': 1410.214, 'text': 'Well, this would mean that the target return would be maximized.', 'start': 1405.071, 'duration': 5.143}, {'end': 1418.24, 'text': 'And what we can do in this case is we can actually use this exact target return to serve as our ground truth, our data set in some sense,', 'start': 1410.315, 'duration': 7.925}, {'end': 1421.302, 'text': 'in order to actually train this agent, to train this DeepQ network.', 'start': 1418.24, 'duration': 3.062}, {'end': 1427.139, 'text': "Now what that looks like is first we'll formulate our expected return.", 'start': 1423.035, 'duration': 4.104}, {'end': 1433.564, 'text': 'if we were to take all of the best actions the initial reward r plus the action that we select,', 'start': 1427.139, 'duration': 6.425}, {'end': 1437.247, 'text': 'that maximizes the expected return for the next future state.', 'start': 1433.564, 'duration': 3.683}, {'end': 1439.729, 'text': 'And then we apply that discounting factor, gamma.', 'start': 1437.788, 'duration': 1.941}, {'end': 1441.31, 'text': 'So this is our target.', 'start': 1440.41, 'duration': 0.9}, {'end': 1446.571, 'text': "This is our Q value that we're going to try and optimize towards.", 'start': 1441.81, 'duration': 4.761}, {'end': 1452.212, 'text': "It's like what we're trying to match, right? That's what we want our prediction to match.", 'start': 1446.631, 'duration': 5.581}, {'end': 1455.893, 'text': 'But now we should ask ourselves what does our network predict?', 'start': 1452.572, 'duration': 3.321}, {'end': 1457.413, 'text': 'Well, our network is predicting.', 'start': 1455.933, 'duration': 1.48}, {'end': 1463.495, 'text': 'like we can see in this network, the network is predicting the Q value for a given state action pair.', 'start': 1457.413, 'duration': 6.082}, {'end': 1473.145, 'text': 'We can use these two pieces of information both our predicted Q value and our target Q value to train and create this what we call Q loss.', 'start': 1465.478, 'duration': 7.667}, {'end': 1479.731, 'text': 'This is a essentially a mean squared error formulation between our target and our predicted Q values.', 'start': 1473.706, 'duration': 6.025}], 'summary': 'Using target return to train deepq network for maximized results.', 'duration': 74.66, 'max_score': 1405.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1405071.jpg'}, {'end': 1433.564, 'src': 'embed', 'start': 1410.315, 'weight': 5, 'content': [{'end': 1418.24, 'text': 'And what we can do in this case is we can actually use this exact target return to serve as our ground truth, our data set in some sense,', 'start': 1410.315, 'duration': 7.925}, {'end': 1421.302, 'text': 'in order to actually train this agent, to train this DeepQ network.', 'start': 1418.24, 'duration': 3.062}, {'end': 1427.139, 'text': "Now what that looks like is first we'll formulate our expected return.", 'start': 1423.035, 'duration': 4.104}, {'end': 1433.564, 'text': 'if we were to take all of the best actions the initial reward r plus the action that we select,', 'start': 1427.139, 'duration': 6.425}], 'summary': 'Using target return as ground truth to train deepq network.', 'duration': 23.249, 'max_score': 1410.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1410315.jpg'}], 'start': 1209.534, 'title': 'Reinforcement learning strategies and deep q learning', 'summary': 'Discusses reinforcement learning strategies, highlighting the identification of option b with a higher q value and introduces deep q learning, showcasing its remarkable performance in atari games.', 'chapters': [{'end': 1249.545, 'start': 1209.534, 'title': 'Reinforcement learning strategies', 'summary': "Discusses a reinforcement learning agent's ability to learn an effective policy, leading to faster game completion with less effort. option b is identified as having a higher q value, despite being initially counterintuitive due to the agent's ability to learn to move away from the ball to achieve extreme angles for effective attack.", 'duration': 40.011, 'highlights': ["The agent's policy can beat the game much faster than option A with much less effort as well.", 'Option B has a higher Q value in this case, indicating its effectiveness in the given scenario.', "The agent's ability to learn to move away from the ball to achieve extreme angles for effective attack is a counterintuitive yet effective strategy."]}, {'end': 1669.917, 'start': 1250.585, 'title': 'Deep q learning', 'summary': 'Introduces the concept of deep q learning, detailing how deep neural networks can be used to model the q value function, and how the training process involves predicting and optimizing q values to infer the optimal policy, ultimately achieving remarkable performance in atari games.', 'duration': 419.332, 'highlights': ['Deep neural networks can be used to model the Q value function, with the network trained to predict the Q value for a given state-action pair, but running forward in time can be inefficient, requiring multiple evaluations for each time step.', 'Outputting a vector of Q values for all possible actions at once is often more convenient, allowing for the inference of the optimal set of actions given the current state.', 'The training process for the DeepQ network involves formulating the expected return using the initial reward and the action that maximizes the expected return for the next future state, and then applying a discounting factor.']}], 'duration': 460.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1209534.jpg', 'highlights': ['Deep Q learning outperforms option A with less effort and faster game completion', "Option B's higher Q value indicates its effectiveness in the given scenario", "Agent's ability to learn counterintuitive yet effective strategies for extreme angles", 'Deep neural networks model the Q value function for state-action pairs', 'Outputting a vector of Q values for all possible actions is more convenient', 'Training process for DeepQ network involves formulating expected return using initial reward and action']}, {'end': 1983.037, 'segs': [{'end': 1702.392, 'src': 'embed', 'start': 1671.938, 'weight': 0, 'content': [{'end': 1676.699, 'text': 'So, despite all of the advantages of this approach the simplicity,', 'start': 1671.938, 'duration': 4.761}, {'end': 1687.043, 'text': "the cleanness and how elegant the solution is I think it's above all that the ability for this solution to learn superhuman policies,", 'start': 1676.699, 'duration': 10.344}, {'end': 1691.004, 'text': 'policies that can beat humans even on some relatively simple tasks.', 'start': 1687.043, 'duration': 3.961}, {'end': 1693.325, 'text': 'there are some very important downsides to Q-learning.', 'start': 1691.004, 'duration': 2.321}, {'end': 1702.392, 'text': 'So the first of which is The simplistic model that we learned about today, this model can only handle action spaces which are discrete.', 'start': 1693.985, 'duration': 8.407}], 'summary': 'Q-learning has advantages in simplicity and elegance, but its simplistic model can only handle discrete action spaces.', 'duration': 30.454, 'max_score': 1671.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1671938.jpg'}, {'end': 1895.709, 'src': 'embed', 'start': 1871.41, 'weight': 1, 'content': [{'end': 1881.338, 'text': "Now, policy learning, the key idea of policy learning, is to instead of predicting the Q values, we're going to directly optimize the policy, pi of s.", 'start': 1871.41, 'duration': 9.928}, {'end': 1888.784, 'text': 'So this is the policy distribution directly governing how we should act given a current state that we find ourselves in.', 'start': 1882.279, 'duration': 6.505}, {'end': 1895.709, 'text': 'So the output here is for us to give us the desired action in a much more direct way.', 'start': 1889.224, 'duration': 6.485}], 'summary': 'Policy learning directly optimizes the policy to give desired actions based on the current state.', 'duration': 24.299, 'max_score': 1871.41, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1871410.jpg'}, {'end': 1990.877, 'src': 'embed', 'start': 1964.395, 'weight': 2, 'content': [{'end': 1972.625, 'text': 'Now, what are some of the advantages of this type of formulation? First of all, over Q-learning like we saw before.', 'start': 1964.395, 'duration': 8.23}, {'end': 1976.77, 'text': "Besides the fact that it's just a much more direct way to get what we want.", 'start': 1973.246, 'duration': 3.524}, {'end': 1983.037, 'text': "instead of optimizing a Q function and then using the Q function to create our policy, now we're going to directly optimize the policy.", 'start': 1976.77, 'duration': 6.267}, {'end': 1990.877, 'text': 'Beyond that though, there is one very important advantage of this formulation, and that is that it can handle continuous action spaces.', 'start': 1984.11, 'duration': 6.767}], 'summary': 'Advantages of direct policy optimization over q-learning, able to handle continuous action spaces.', 'duration': 26.482, 'max_score': 1964.395, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1964395.jpg'}], 'start': 1671.938, 'title': 'Q-learning limitations and policy learning overview', 'summary': "Discusses q-learning's advantages and limitations, particularly in handling continuous and large action spaces. it also introduces policy learning as a more direct optimization method using neural networks, emphasizing its advantages over q-learning.", 'chapters': [{'end': 1736.81, 'start': 1671.938, 'title': 'Limitations of q-learning', 'summary': 'Discusses the advantages of q-learning, such as its simplicity and ability to learn superhuman policies, but highlights its downsides, particularly its limitation in handling continuous action spaces and inability to handle large action spaces effectively.', 'duration': 64.872, 'highlights': ["Q-learning's advantage lies in its ability to learn superhuman policies that can beat humans on relatively simple tasks, despite its simplicity and elegance.", 'The major downside of Q-learning is its limitation in handling continuous action spaces, making it ineffective for tasks such as predicting steering wheel angles for autonomous vehicles.', 'Another important downside of Q-learning is its inability to handle large action spaces effectively, particularly when the action space is small.']}, {'end': 1983.037, 'start': 1737.35, 'title': 'Policy learning overview', 'summary': 'Introduces policy learning as a more direct way to optimize the policy, directly learning the policy using a neural network to govern actions, presenting a subtle but important difference from q learning, and explaining the advantages over q-learning.', 'duration': 245.687, 'highlights': ['Policy learning directly optimizes the policy, pi of s, governing actions based on a probability distribution, providing a more direct way to obtain the desired action.', "Policy learning presents a subtle but important difference from Q learning, as it aims to directly learn the policy using a neural network, in contrast to Q learning's focus on learning Q values and inferring the best action.", 'Policy learning offers advantages over Q-learning by providing a much more direct way to obtain the desired action and directly optimizing the policy, instead of optimizing a Q function and then using the Q function to create the policy.']}], 'duration': 311.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1671938.jpg', 'highlights': ["Q-learning's advantage lies in its ability to learn superhuman policies on relatively simple tasks.", 'Policy learning directly optimizes the policy, pi of s, based on a probability distribution.', 'Policy learning offers advantages over Q-learning by providing a much more direct way to obtain the desired action.']}, {'end': 2542.813, 'segs': [{'end': 2031.645, 'src': 'embed', 'start': 2002.069, 'weight': 0, 'content': [{'end': 2004.212, 'text': "There's a finite number of actions here that can be taken.", 'start': 2002.069, 'duration': 2.143}, {'end': 2012.598, 'text': 'For example, This is showing, our action space here is representing the direction that I should move.', 'start': 2005.193, 'duration': 7.405}, {'end': 2021.841, 'text': 'But instead, a continuous action space would tell us not just the direction, but how fast, for example, as a real number that I should move.', 'start': 2012.818, 'duration': 9.023}, {'end': 2026.923, 'text': 'Questions like that that are infinite in the number of possible answers.', 'start': 2022.361, 'duration': 4.562}, {'end': 2031.645, 'text': 'this could be one meter per second to the left, half a meter per second to the left, or any numeric velocity.', 'start': 2026.923, 'duration': 4.722}], 'summary': 'Finite vs continuous action space in decision making with numerical examples.', 'duration': 29.576, 'max_score': 2002.069, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2002069.jpg'}, {'end': 2069.125, 'src': 'embed', 'start': 2046.496, 'weight': 3, 'content': [{'end': 2054.299, 'text': 'But now, when we plot this as a probability distribution, we can also visualize this as a continuous action space.', 'start': 2046.496, 'duration': 7.803}, {'end': 2061.362, 'text': 'And simply, we can visualize this using something like a Gaussian distribution in this case, but it could take many different.', 'start': 2054.619, 'duration': 6.743}, {'end': 2066.083, 'text': 'You can choose the type of distribution that fits best with your problem set.', 'start': 2061.462, 'duration': 4.621}, {'end': 2069.125, 'text': 'Gaussian is a popular choice here because of its simplicity.', 'start': 2066.344, 'duration': 2.781}], 'summary': 'Visualize probability distribution using gaussian for continuous action space', 'duration': 22.629, 'max_score': 2046.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2046496.jpg'}, {'end': 2278.77, 'src': 'embed', 'start': 2237.912, 'weight': 4, 'content': [{'end': 2242.115, 'text': 'So we can indeed sample from it, which is a very nice confirmation property.', 'start': 2237.912, 'duration': 4.203}, {'end': 2250.095, 'text': "Okay, great, so let's take a look now of how the policy gradients algorithm works in a concrete example.", 'start': 2244.011, 'duration': 6.084}, {'end': 2256.819, 'text': "Let's start by revisiting this whole learning loop of reinforcement learning again that we saw in the very beginning of this lecture.", 'start': 2250.715, 'duration': 6.104}, {'end': 2269.607, 'text': "And let's think of how we can use the policy gradient algorithm that we have introduced to actually train an autonomous vehicle using this trial and error policy gradient method.", 'start': 2256.839, 'duration': 12.768}, {'end': 2278.77, 'text': 'So with this case study of autonomous vehicles or self-driving cars, what are all of these components? So the agent would be our vehicle.', 'start': 2271.428, 'duration': 7.342}], 'summary': 'Policy gradients algorithm for training autonomous vehicles through trial and error.', 'duration': 40.858, 'max_score': 2237.912, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2237912.jpg'}], 'start': 1984.11, 'title': 'Continuous action spaces', 'summary': 'Discusses the advantage of formulating continuous action spaces, handling infinite possible answers, and modeling using a gaussian distribution, with a case study on training an autonomous vehicle using the policy gradient algorithm.', 'chapters': [{'end': 2045.11, 'start': 1984.11, 'title': 'Continuous action spaces', 'summary': 'Discusses the advantage of formulating continuous action spaces, which can handle infinite possible answers, compared to discrete action spaces demonstrated in the atari breakout game.', 'duration': 61, 'highlights': ['Continuous action spaces can handle infinite possible answers, such as specifying not just the direction but also the speed as a real number, allowing for a variety of numeric velocities (e.g., one meter per second to the left, half a meter per second to the left).', 'The formulation of continuous action spaces contrasts with discrete action spaces, which have a finite number of actions (e.g., moving left, moving right, or staying in the center) and are limited in the number of possible answers.', 'Continuous action spaces also indicate the direction through plus or minus signs, where a positive value suggests movement to the right and a negative value implies movement to the left.']}, {'end': 2542.813, 'start': 2046.496, 'title': 'Continuous action spaces in reinforcement learning', 'summary': 'Discusses modeling continuous action spaces using a gaussian distribution, the application of policy gradient method, and a case study on training an autonomous vehicle using the policy gradient algorithm.', 'duration': 496.317, 'highlights': ['The chapter discusses modeling continuous action spaces using a Gaussian distribution', 'Application of policy gradient method', 'Case study on training an autonomous vehicle using the policy gradient algorithm']}], 'duration': 558.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ1984110.jpg', 'highlights': ['Continuous action spaces can handle infinite possible answers, such as specifying direction and speed as real numbers.', 'Formulation of continuous action spaces contrasts with discrete action spaces, which have a finite number of actions.', 'Continuous action spaces indicate direction through plus or minus signs, where positive value suggests movement to the right.', 'Chapter discusses modeling continuous action spaces using a Gaussian distribution.', 'Application of policy gradient method.', 'Case study on training an autonomous vehicle using the policy gradient algorithm.']}, {'end': 2841.609, 'segs': [{'end': 2567.888, 'src': 'embed', 'start': 2543.494, 'weight': 3, 'content': [{'end': 2555.221, 'text': 'Now the remaining question is how we can actually update our policy on every training iteration to decrease the probability of bad events and increase the probability of these good events or these good actions.', 'start': 2543.494, 'duration': 11.727}, {'end': 2555.721, 'text': "let's call them.", 'start': 2555.221, 'duration': 0.5}, {'end': 2560.484, 'text': 'So that really focuses and narrows us into points four and five in this training algorithm.', 'start': 2555.901, 'duration': 4.583}, {'end': 2567.888, 'text': "How can we do this learning process of decreasing these probabilities when it's bad and increasing the probabilities when they're good?", 'start': 2561.417, 'duration': 6.471}], 'summary': 'Update policy to decrease bad events, increase good events in training algorithm.', 'duration': 24.394, 'max_score': 2543.494, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2543494.jpg'}, {'end': 2611.62, 'src': 'embed', 'start': 2583.789, 'weight': 1, 'content': [{'end': 2593.631, 'text': 'The first term is this log likelihood term, the log likelihood of our policy, our probability of an action given our state.', 'start': 2583.789, 'duration': 9.842}, {'end': 2604.087, 'text': 'The second term is where we multiply this negative log likelihood by the total discounted reward, or the total discounted return, excuse me, r of t.', 'start': 2594.511, 'duration': 9.576}, {'end': 2611.62, 'text': "So let's assume that we get a lot of reward for an action that had very high log likelihood.", 'start': 2605.578, 'duration': 6.042}], 'summary': "Policy's log likelihood affects total discounted return.", 'duration': 27.831, 'max_score': 2583.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2583789.jpg'}, {'end': 2689.177, 'src': 'embed', 'start': 2657.224, 'weight': 2, 'content': [{'end': 2662.385, 'text': 'And again, just to reiterate once more, this policy gradient term consists of these two parts.', 'start': 2657.224, 'duration': 5.161}, {'end': 2665.166, 'text': 'One is the likelihood of an action, and the second is the reward.', 'start': 2662.485, 'duration': 2.681}, {'end': 2671.667, 'text': "If the action is very positive, very good, resulting in good reward, it's going to amplify that through this gradient term.", 'start': 2665.306, 'duration': 6.361}, {'end': 2680.032, 'text': 'If the action is very is very probable or, sorry, not very probable, but it did result in a good reward.', 'start': 2672.188, 'duration': 7.844}, {'end': 2681.472, 'text': 'it will actually amplify it even further.', 'start': 2680.032, 'duration': 1.44}, {'end': 2689.177, 'text': 'so something that was not probable before will become probable because it resulted in a good return, and vice versa on the other side as well.', 'start': 2681.472, 'duration': 7.705}], 'summary': 'Policy gradient amplifies positive actions based on rewards, making them more probable.', 'duration': 31.953, 'max_score': 2657.224, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2657224.jpg'}, {'end': 2727.174, 'src': 'embed', 'start': 2702.892, 'weight': 4, 'content': [{'end': 2711.74, 'text': 'because this is something that has a particular interest to the reinforcement learning field right now, and especially right now,', 'start': 2702.892, 'duration': 8.848}, {'end': 2719.007, 'text': "because applying these algorithms in the real world is something that's very difficult for one reason or one main reason,", 'start': 2711.74, 'duration': 7.267}, {'end': 2720.568, 'text': 'and that is this step right here.', 'start': 2719.007, 'duration': 1.561}, {'end': 2722.83, 'text': 'Running a policy until termination.', 'start': 2721.269, 'duration': 1.561}, {'end': 2727.174, 'text': "That's one thing I touched on, but I didn't spend too much time really dissecting it.", 'start': 2723.21, 'duration': 3.964}], 'summary': 'Reinforcement learning faces challenges in real-world application due to difficulty in running policy until termination.', 'duration': 24.282, 'max_score': 2702.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2702892.jpg'}, {'end': 2807.25, 'src': 'embed', 'start': 2779.071, 'weight': 0, 'content': [{'end': 2786.296, 'text': 'So one really cool result that we created was developing this type of simulation engine, here called Vista,', 'start': 2779.071, 'duration': 7.225}, {'end': 2792.38, 'text': 'and allows us to use real data of the world to simulate brand new virtual agents inside of the simulation.', 'start': 2786.296, 'duration': 6.084}, {'end': 2800.145, 'text': 'Now, the results here are incredibly photorealistic, as you can see, and it allows us to train agents using reinforcement learning in simulation,', 'start': 2792.9, 'duration': 7.245}, {'end': 2807.25, 'text': 'using exactly the methods that we saw today, so that they can be directly deployed, without any transfer learning or domain adaptation,', 'start': 2800.145, 'duration': 7.105}], 'summary': 'Developed photorealistic simulation engine vista for training agents using real-world data.', 'duration': 28.179, 'max_score': 2779.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2779071.jpg'}], 'start': 2543.494, 'title': 'Policy gradient training and algorithm', 'summary': 'Discusses updating policy on every training iteration to decrease the probability of bad events and increase the probability of good events, and highlights the components and challenges of the policy gradient algorithm, while introducing a photorealistic simulation engine for real-world reinforcement learning.', 'chapters': [{'end': 2631.388, 'start': 2543.494, 'title': 'Policy gradient training and loss function', 'summary': 'Discusses updating the policy on every training iteration to decrease the probability of bad events and increase the probability of good events, focusing on points four and five in the training algorithm, while dissecting the loss function for training policy gradients to understand its impact on action selection.', 'duration': 87.894, 'highlights': ['The loss function for training policy gradients consists of a log likelihood term and a term involving the multiplication of negative log likelihood by the total discounted return, reinforcing actions with high returns and adjusting probabilities for actions with low returns.', 'The learning process aims to decrease the probabilities of bad events and increase the probabilities of good events, with a focus on points four and five in the training algorithm.', 'Understanding the impact of the loss function on action selection involves dissecting the log likelihood term and the term involving the multiplication of negative log likelihood by the total discounted return.']}, {'end': 2841.609, 'start': 2632.588, 'title': 'Policy gradient algorithm and real-world reinforcement learning', 'summary': 'Discusses the policy gradient algorithm, highlighting its components and challenges, and introduces a photorealistic simulation engine for real-world reinforcement learning, enabling direct deployment of trained agents into the real world without transfer learning or domain adaptation.', 'duration': 209.021, 'highlights': ['The policy gradient term in the algorithm computes the gradient over the policy part of the function and consists of two parts: the likelihood of an action and the reward, amplifying actions resulting in good rewards and making improbable actions probable (e.g., amplifying good returns).', 'Challenges in applying reinforcement learning algorithms in the real world stem from the difficulty of running a policy until termination, as real-world terminations often lead to negative consequences, and simulation training does not accurately transfer to the real world.', 'The development of a photorealistic simulation engine called Vista, based on real-world data, allows for training agents using reinforcement learning in simulation and direct deployment into the real world without transfer learning or domain adaptation, demonstrated through placing trained policies in a full-scale autonomous vehicle for autonomous driving.']}], 'duration': 298.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2543494.jpg', 'highlights': ['The development of a photorealistic simulation engine called Vista, based on real-world data, allows for training agents using reinforcement learning in simulation and direct deployment into the real world without transfer learning or domain adaptation, demonstrated through placing trained policies in a full-scale autonomous vehicle for autonomous driving.', 'The loss function for training policy gradients consists of a log likelihood term and a term involving the multiplication of negative log likelihood by the total discounted return, reinforcing actions with high returns and adjusting probabilities for actions with low returns.', 'The policy gradient term in the algorithm computes the gradient over the policy part of the function and consists of two parts: the likelihood of an action and the reward, amplifying actions resulting in good rewards and making improbable actions probable (e.g., amplifying good returns).', 'The learning process aims to decrease the probabilities of bad events and increase the probabilities of good events, with a focus on points four and five in the training algorithm.', 'Challenges in applying reinforcement learning algorithms in the real world stem from the difficulty of running a policy until termination, as real-world terminations often lead to negative consequences, and simulation training does not accurately transfer to the real world.']}, {'end': 3426.153, 'segs': [{'end': 2979.769, 'src': 'embed', 'start': 2954.219, 'weight': 2, 'content': [{'end': 2962.549, 'text': 'And several years ago, they actually developed a reinforcement learning-based pipeline that defeated champion Go players.', 'start': 2954.219, 'duration': 8.33}, {'end': 2968.537, 'text': "And the idea at its core is very simple and follows along with everything that we've learned in this lecture today.", 'start': 2962.87, 'duration': 5.667}, {'end': 2979.769, 'text': 'So first, a neural network was trained and it got to watch a lot of human expert goal players and basically learn to imitate their behaviors.', 'start': 2970.239, 'duration': 9.53}], 'summary': 'Reinforcement learning pipeline beat go champions using neural network training.', 'duration': 25.55, 'max_score': 2954.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2954219.jpg'}, {'end': 3101.903, 'src': 'embed', 'start': 3072.777, 'weight': 0, 'content': [{'end': 3079.619, 'text': 'But it was still able to not only beat the humans, but it also beat the previous networks that were pre-trained with human data.', 'start': 3072.777, 'duration': 6.842}, {'end': 3087.858, 'text': 'Now, as recently as only last month, very recently, the next breakthrough in this line of works was released with what is called Mu0,,', 'start': 3080.595, 'duration': 7.263}, {'end': 3092.499, 'text': 'where the algorithm now learned to master these environments without even knowing the rules.', 'start': 3087.858, 'duration': 4.641}, {'end': 3101.903, 'text': "I think the best way to describe Mu0 is to compare and contrast its abilities with those previous advancements that we've already discussed today.", 'start': 3093.3, 'duration': 8.603}], 'summary': 'Mu0 outperformed human-trained networks, mastering environments without knowing rules.', 'duration': 29.126, 'max_score': 3072.777, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ3072777.jpg'}, {'end': 3160.633, 'src': 'embed', 'start': 3116.393, 'weight': 1, 'content': [{'end': 3122.657, 'text': 'Then came AlphaGo Zero, which showed us that even better performance could be achieved entirely on its own,', 'start': 3116.393, 'duration': 6.264}, {'end': 3127.06, 'text': 'without pre-training from the human grandmasters, but instead directly learning from scratch.', 'start': 3122.657, 'duration': 4.403}, {'end': 3135.667, 'text': 'Then came AlphaZero, which extended this idea even further, beyond the game of Go, and also into chess and shogi,', 'start': 3127.981, 'duration': 7.686}, {'end': 3142.293, 'text': 'but still required the model to know the rule and be given the rules of the games in order to learn from them.', 'start': 3135.667, 'duration': 6.626}, {'end': 3152.75, 'text': 'Now, last month the authors demonstrated superhuman performance on over 50 games all without the algorithm knowing the rules beforehand.', 'start': 3143.166, 'duration': 9.584}, {'end': 3160.633, 'text': 'It had to learn them as well as actually learning how to play the game optimally during its training process.', 'start': 3153.29, 'duration': 7.343}], 'summary': 'Alphago zero and alphazero achieved superhuman performance without pre-training, alphazero extended to chess and shogi, demonstrated superhuman performance on over 50 games without knowing the rules beforehand.', 'duration': 44.24, 'max_score': 3116.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ3116393.jpg'}, {'end': 3301.723, 'src': 'embed', 'start': 3276.515, 'weight': 4, 'content': [{'end': 3282.577, 'text': 'This is very similar to how we saw AlphaZero work, but now the key difference is that the dynamics model,', 'start': 3276.515, 'duration': 6.062}, {'end': 3286.078, 'text': 'as part of the tree search that we can see at each of these steps,', 'start': 3282.577, 'duration': 3.501}, {'end': 3293.52, 'text': 'is entirely learned and greatly opens up the possibilities for these techniques to be applied outside of rigid game scenarios.', 'start': 3286.078, 'duration': 7.442}, {'end': 3299.762, 'text': 'So in these scenarios, we do know the rules of the games very well, so we could use them to train our algorithms better.', 'start': 3293.88, 'duration': 5.882}, {'end': 3301.723, 'text': 'But in many scenarios,', 'start': 3300.362, 'duration': 1.361}], 'summary': 'Learning dynamics model expands ai applications beyond games.', 'duration': 25.208, 'max_score': 3276.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ3276515.jpg'}], 'start': 2842.009, 'title': 'Deep reinforcement learning applications and learning rules for game optimization', 'summary': 'Discusses the applications of deep reinforcement learning such as mastering the game of go and recent breakthrough with mu0, as well as an algorithm demonstrating superhuman performance on over 50 games without prior knowledge of the rules and introducing a tree search algorithm for planning into the future.', 'chapters': [{'end': 3135.667, 'start': 2842.009, 'title': 'Deep reinforcement learning applications', 'summary': 'Discusses the remarkable applications of deep reinforcement learning, including the training of an ai to master the game of go, defeating champion players, and the evolution of reinforcement learning-based solutions up to the recent breakthrough with mu0.', 'duration': 293.658, 'highlights': ['Google DeepMind developed a reinforcement learning-based pipeline that defeated champion Go players in 2016.', 'AlphaZero extended the idea of self-play and model optimization entirely from scratch to games like chess, shogi, and Go.', 'The recent breakthrough in deep reinforcement learning is Mu0, where the algorithm learned to master environments without even knowing the rules.']}, {'end': 3426.153, 'start': 3135.667, 'title': 'Learning rules for game optimization', 'summary': 'Discusses an algorithm that demonstrated superhuman performance on over 50 games without prior knowledge of the rules, highlighting the importance of learning rules in environments where rules are unknown or too complicated, and introduces a tree search algorithm for planning into the future.', 'duration': 290.486, 'highlights': ['The authors demonstrated superhuman performance on over 50 games without the algorithm knowing the rules beforehand.', "The network is forced to learn the dynamics model of how to do this search, as it doesn't know the rules beforehand.", 'The dynamics model, as part of the tree search, is entirely learned, allowing these techniques to be applied outside of rigid game scenarios.']}], 'duration': 584.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/93M1l_nrhpQ/pics/93M1l_nrhpQ2842009.jpg', 'highlights': ['Mu0 achieved mastery without knowing the rules', 'AlphaZero mastered chess, shogi, and Go from scratch', "DeepMind's pipeline defeated champion Go players in 2016", 'Algorithm demonstrated superhuman performance on 50+ games', 'Tree search algorithm allows application outside rigid game scenarios']}], 'highlights': ['Reinforcement learning and deep learning create high-performing agents impacting robotics, self-driving cars, and strategic planning.', 'The AI achieved remarkably superhuman performance, beating the professional Starcraft player five games to zero, demonstrating the effectiveness of deep learning in competitive gaming.', 'The development of a photorealistic simulation engine called Vista, based on real-world data, allows for training agents using reinforcement learning in simulation and direct deployment into the real world without transfer learning or domain adaptation, demonstrated through placing trained policies in a full-scale autonomous vehicle for autonomous driving.', 'Continuous action spaces can handle infinite possible answers, such as specifying direction and speed as real numbers.', 'Mu0 achieved mastery without knowing the rules']}