title
MIT 6.S191 (2020): Reinforcement Learning
description
MIT Introduction to Deep Learning 6.S191: Lecture 5
Deep Reinforcement Learning
Lecturer: Alexander Amini
January 2020
For all lectures, slides, and lab materials: http://introtodeeplearning.com
Lecture Outline
0:00 - Introduction
2:47 - Classes of learning problems
4:59 - Definitions
9:23 - The Q function
13:18 - Deeper into the Q function
17:17 - Deep Q Networks
21:44 - Atari results and limitations
24:13 - Policy learning algorithms
27:36 - Discrete vs continuous actions
30:11 - Training policy gradients
36:04 - RL in real life
37:40 - VISTA simulator
38:55 - AlphaGo and AlphaZero
42:51 - Summary
Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!
detail
{'title': 'MIT 6.S191 (2020): Reinforcement Learning', 'heatmap': [{'end': 1645.174, 'start': 1612.161, 'weight': 1}], 'summary': 'Explores reinforcement learning, including deep reinforcement learning, basics, deep q networks, q-learning, policy gradient methods, policy gradient learning for self-driving car, and real-life applications such as training autonomous vehicles and ai agents for playing go.', 'chapters': [{'end': 297.574, 'segs': [{'end': 49.317, 'src': 'embed', 'start': 27.656, 'weight': 0, 'content': [{'end': 36.626, 'text': "But now we're moving away from that and we're thinking about scenarios where our deep learning model is its own self and it can act in an environment.", 'start': 27.656, 'duration': 8.97}, {'end': 42.032, 'text': "And when it takes actions in that environment, it's exploring the environment, learning how to solve some tasks,", 'start': 37.266, 'duration': 4.766}, {'end': 49.317, 'text': 'And we really get to explore these type of dynamic scenarios where you have an autonomous agent,', 'start': 43.371, 'duration': 5.946}], 'summary': 'Exploring scenarios where a deep learning model acts autonomously in an environment, learning to solve tasks.', 'duration': 21.661, 'max_score': 27.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w27656.jpg'}, {'end': 119.59, 'src': 'embed', 'start': 69.898, 'weight': 1, 'content': [{'end': 77.825, 'text': 'Now this has huge, obvious implications in fields like robotics, where you have self-driving cars and also manipulation,', 'start': 69.898, 'duration': 7.927}, {'end': 81.728, 'text': 'so having hands that can grasp different objects in the environment.', 'start': 77.825, 'duration': 3.903}, {'end': 87.753, 'text': 'But it also impacts the world of gameplay, and specifically strategy and planning.', 'start': 82.208, 'duration': 5.545}, {'end': 96.38, 'text': 'And you can imagine that if you combine these two worlds robotics and gameplay you can also create some pretty cool applications where you have a robot playing against a human.', 'start': 88.813, 'duration': 7.567}, {'end': 105.898, 'text': 'OK.', 'start': 105.638, 'duration': 0.26}, {'end': 109.521, 'text': 'So this is a little bit dramatized.', 'start': 107.44, 'duration': 2.081}, {'end': 114.746, 'text': 'And the robot here is not actually using deep reinforcement learning.', 'start': 111.543, 'duration': 3.203}, {'end': 115.746, 'text': "I'd like to say that first.", 'start': 114.786, 'duration': 0.96}, {'end': 119.59, 'text': 'So this is actually entirely choreographed for a TV ad.', 'start': 116.887, 'duration': 2.703}], 'summary': 'Robotic applications include self-driving cars, object manipulation, gameplay, and strategy.', 'duration': 49.692, 'max_score': 69.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w69898.jpg'}, {'end': 239.011, 'src': 'embed', 'start': 216.765, 'weight': 3, 'content': [{'end': 226.41, 'text': 'So states are the observations, or the inputs to the system, and the actions are the actions, well, that the agent wants to take in that environment.', 'start': 216.765, 'duration': 9.645}, {'end': 236.709, 'text': 'Now the goal of the agent in this world is just to maximize its own rewards, or to take actions that result in rewards,', 'start': 227.881, 'duration': 8.828}, {'end': 239.011, 'text': 'and in as many rewards as possible.', 'start': 236.709, 'duration': 2.302}], 'summary': 'Agents aim to maximize rewards by taking actions in the environment.', 'duration': 22.246, 'max_score': 216.765, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w216765.jpg'}, {'end': 290.449, 'src': 'embed', 'start': 258.122, 'weight': 4, 'content': [{'end': 265.448, 'text': 'our focus is gonna be just on this third realm of reinforcement learning and seeing how we can build deep neural networks that can solve these problems as well.', 'start': 258.122, 'duration': 7.326}, {'end': 274.785, 'text': 'And before I go any further, I want to start by building up some key vocabulary for all of you, just because in reinforcement learning,', 'start': 267.563, 'duration': 7.222}, {'end': 280.046, 'text': 'a lot of the vocabulary is a little bit different than in supervised or unsupervised learning.', 'start': 274.785, 'duration': 5.261}, {'end': 290.449, 'text': "So I think it's really important that we go back to the foundations and really define some important vocabulary that's going to be really crucial before we get to building up to the more complicated stuff later in this lecture.", 'start': 280.086, 'duration': 10.363}], 'summary': 'Focus on reinforcement learning and building deep neural networks, emphasizing key vocabulary for foundation.', 'duration': 32.327, 'max_score': 258.122, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w258122.jpg'}], 'start': 9.139, 'title': 'Reinforcement learning', 'summary': 'Explores the transition to deep reinforcement learning, focusing on its implications in dynamic environments and the need to build deep neural networks. it discusses the goal of maximizing rewards and emphasizes the shift from traditional learning methods.', 'chapters': [{'end': 136.937, 'start': 9.139, 'title': 'Deep reinforcement learning', 'summary': 'Explores the shift from using deep learning on fixed data sets to deep reinforcement learning, where models learn to act in dynamic environments, with implications in robotics, gameplay, and autonomous agent interactions in the real world.', 'duration': 127.798, 'highlights': ['The shift from using deep learning on fixed data sets to deep reinforcement learning allows models to act in dynamic environments, learning to solve tasks without human supervision or guidance.', 'The implications of deep reinforcement learning are significant in fields like robotics, including self-driving cars, manipulation, and gameplay.', 'The potential applications of deep reinforcement learning include creating scenarios where robots interact with humans, leading to efficient learning of the autonomous controllers defining their actions.']}, {'end': 297.574, 'start': 136.937, 'title': 'Reinforcement learning basics', 'summary': 'Discusses the transition from supervised and unsupervised learning to reinforcement learning, emphasizing on the goal of maximizing rewards by taking actions based on observations, and the need to build deep neural networks that can solve these problems.', 'duration': 160.637, 'highlights': ['Reinforcement learning focuses on maximizing rewards by taking actions based on observations, such as an agent learning to eat an apple to survive longer.', 'Unsupervised learning aims to find structure in data without any labels, while supervised learning involves learning a model to predict labels based on input data.', 'The transition from supervised and unsupervised learning to reinforcement learning is emphasized, with a focus on building deep neural networks for solving reinforcement learning problems.', 'Understanding the vocabulary specific to reinforcement learning is highlighted as crucial for building up to more complicated topics in the lecture.']}], 'duration': 288.435, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w9139.jpg', 'highlights': ['The shift to deep reinforcement learning allows models to act in dynamic environments', 'The implications of deep reinforcement learning are significant in fields like robotics', 'The potential applications of deep reinforcement learning include creating scenarios where robots interact with humans', 'Reinforcement learning focuses on maximizing rewards by taking actions based on observations', 'The transition from supervised and unsupervised learning to reinforcement learning is emphasized', 'Understanding the vocabulary specific to reinforcement learning is crucial for building up to more complicated topics']}, {'end': 1003.773, 'segs': [{'end': 345.524, 'src': 'embed', 'start': 298.895, 'weight': 3, 'content': [{'end': 300.656, 'text': "So first, we're gonna start with the agent.", 'start': 298.895, 'duration': 1.761}, {'end': 303.818, 'text': 'The agent is like the central part of the reinforcement learning algorithm.', 'start': 300.876, 'duration': 2.942}, {'end': 306.02, 'text': 'It is the neural network in this case.', 'start': 304.219, 'duration': 1.801}, {'end': 307.821, 'text': 'The agent is the thing that takes the actions.', 'start': 306.08, 'duration': 1.741}, {'end': 310.623, 'text': 'In real life, you are the agents, each of you.', 'start': 308.262, 'duration': 2.361}, {'end': 316.828, 'text': "If you're trying to learn a controller for a drone to make a delivery, the drone is the agent.", 'start': 311.864, 'duration': 4.964}, {'end': 320.95, 'text': 'The next one is the environment.', 'start': 319.309, 'duration': 1.641}, {'end': 325.312, 'text': 'The environment is simply the world in which the agent operates or acts.', 'start': 321.47, 'duration': 3.842}, {'end': 328.773, 'text': 'So in real life, again, the world is your environment.', 'start': 325.872, 'duration': 2.901}, {'end': 335.256, 'text': 'Now the agent can send commands to the environment in the form of what are called actions.', 'start': 330.634, 'duration': 4.622}, {'end': 345.524, 'text': 'Now, in many cases, we simplify this a little bit and say that the agent can pick from a finite set of actions that it can execute in that world.', 'start': 336.297, 'duration': 9.227}], 'summary': 'Reinforcement learning involves an agent interacting with an environment by taking actions, with the agent being a neural network and the environment being the world in which the agent operates.', 'duration': 46.629, 'max_score': 298.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w298895.jpg'}, {'end': 457.003, 'src': 'embed', 'start': 426.095, 'weight': 0, 'content': [{'end': 428.637, 'text': "So it doesn't have to be like every moment in time you're getting a reward.", 'start': 426.095, 'duration': 2.542}, {'end': 436.654, 'text': "These rewards, effectively, you can think about them as just evaluating all of the agent's actions.", 'start': 431.448, 'duration': 5.206}, {'end': 441.66, 'text': "So from them, you can get a sense of how well the agent is doing in that environment, and that's what we want to try and maximize.", 'start': 436.954, 'duration': 4.706}, {'end': 449.08, 'text': 'Now we can look at the total reward as just the summation of all of the individual rewards in time.', 'start': 443.558, 'duration': 5.522}, {'end': 457.003, 'text': 'So if you start at some time t, we can call capital R of t as the sum of all of the rewards from that point on to the future.', 'start': 449.3, 'duration': 7.703}], 'summary': "Maximize agent's total reward by evaluating its actions over time.", 'duration': 30.908, 'max_score': 426.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w426095.jpg'}, {'end': 580.45, 'src': 'embed', 'start': 556.382, 'weight': 1, 'content': [{'end': 563.244, 'text': 'multiplying it by a discounting factor and then adding on all future rewards also multiplied by their discounting factor as well.', 'start': 556.382, 'duration': 6.862}, {'end': 574.001, 'text': 'Now the Q function takes as input the current state of the agent and also takes as input the action that the agent executes at that time,', 'start': 564.448, 'duration': 9.553}, {'end': 580.45, 'text': 'and it returns the expected total discounted return that the agent could expect at that point in time.', 'start': 574.001, 'duration': 6.449}], 'summary': 'Q function calculates expected total discounted return for agent.', 'duration': 24.068, 'max_score': 556.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w556382.jpg'}, {'end': 692.633, 'src': 'embed', 'start': 664.576, 'weight': 2, 'content': [{'end': 668.782, 'text': "You pick the action that gives you the highest Q value, and that's the one that you execute at that time.", 'start': 664.576, 'duration': 4.206}, {'end': 672.44, 'text': "So let's actually go through this.", 'start': 671.459, 'duration': 0.981}, {'end': 676.285, 'text': 'So ultimately what we want is to take actions in the environment.', 'start': 672.46, 'duration': 3.825}, {'end': 687.678, 'text': 'The function that will take as input a state or an observation and predict or evaluate that to an action is called the policy denoted here as pi of s.', 'start': 676.705, 'duration': 10.973}, {'end': 692.633, 'text': 'And the strategy that we always want to take is just to maximize our Q value.', 'start': 689.031, 'duration': 3.602}], 'summary': 'Maximize q value to pick actions for environment.', 'duration': 28.057, 'max_score': 664.576, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w664576.jpg'}], 'start': 298.895, 'title': 'Reinforcement learning basics', 'summary': 'Introduces the key components of reinforcement learning, emphasizing the interaction between the agent, environment, actions, and observations, along with explaining the concept of reinforcement learning, including maximizing rewards, delayed rewards, discounted total rewards, the q function, the policy, two classes of reinforcement learning algorithms, and a practical example of estimating q values.', 'chapters': [{'end': 385.879, 'start': 298.895, 'title': 'Reinforcement learning basics', 'summary': 'Introduces the key components of reinforcement learning, including the agent, environment, actions, and observations, emphasizing the interaction between them.', 'duration': 86.984, 'highlights': ['The agent is the central part of the reinforcement learning algorithm, acting as the neural network and taking actions, such as controlling a drone for delivery.', 'The environment is where the agent operates, receiving commands in the form of actions and providing feedback through observations, which could be visual, auditory, etc.', 'The agent can select actions from a finite set, like moving forward, backwards, left, or right, and receives observations in the form of states, representing its immediate situation.']}, {'end': 1003.773, 'start': 385.999, 'title': 'Reinforcement learning basics', 'summary': "Explains the concept of reinforcement learning, emphasizing the agent's goal to maximize rewards, delayed rewards, discounted total rewards, the q function, and the policy, along with the two classes of reinforcement learning algorithms and a practical example of estimating q values.", 'duration': 617.774, 'highlights': ["The agent's goal is to maximize its own reward in the environment, which may involve receiving delayed rewards and evaluating the total reward as the summation of individual rewards in time.", 'The concept of discounted sum of rewards is introduced to discount future rewards, and a concrete example is provided to illustrate the reasoning behind discounting future rewards.', 'The Q function is defined as a function that takes the current state and action of the agent as input and returns the expected total discounted return, guiding the agent to take actions that maximize the Q value.', 'The concept of using the Q function to choose actions in the environment is explained, where the agent evaluates the Q function for possible actions and selects the action with the highest Q value to execute.', 'Two classes of reinforcement learning algorithms are outlined, focusing on learning the Q function and using it to define the policy, and directly learning the policy without the intermediate Q function.', 'A practical example featuring the Breakout game is presented to illustrate the challenges of estimating Q values and the unintuitive nature of actions that reinforcement learning agents can learn.']}], 'duration': 704.878, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w298895.jpg', 'highlights': ["The agent's goal is to maximize its own reward in the environment, which may involve receiving delayed rewards and evaluating the total reward as the summation of individual rewards in time.", 'The Q function is defined as a function that takes the current state and action of the agent as input and returns the expected total discounted return, guiding the agent to take actions that maximize the Q value.', 'The concept of using the Q function to choose actions in the environment is explained, where the agent evaluates the Q function for possible actions and selects the action with the highest Q value to execute.', 'The agent is the central part of the reinforcement learning algorithm, acting as the neural network and taking actions, such as controlling a drone for delivery.', 'The environment is where the agent operates, receiving commands in the form of actions and providing feedback through observations, which could be visual, auditory, etc.']}, {'end': 1331.629, 'segs': [{'end': 1102.689, 'src': 'embed', 'start': 1061.581, 'weight': 1, 'content': [{'end': 1066.722, 'text': 'The alternative is that you could have one network that takes as input that state,', 'start': 1061.581, 'duration': 5.141}, {'end': 1070.643, 'text': 'but now it has learned to output all of the different Q values for all of the different actions.', 'start': 1066.722, 'duration': 3.921}, {'end': 1079.206, 'text': 'So now here we have to just execute this once, we forward propagate once, and we can see that it gives us back the Q value for every single action.', 'start': 1071.143, 'duration': 8.063}, {'end': 1084.547, 'text': "We look at all of those Q values, we pick the one that's maximum, and take the action that corresponds.", 'start': 1079.546, 'duration': 5.001}, {'end': 1087.101, 'text': "Now that we've set up this network,", 'start': 1085.18, 'duration': 1.921}, {'end': 1094.324, 'text': 'how do we train it to actually output the true Q value at a particular instance or the Q function over many different states?', 'start': 1087.101, 'duration': 7.223}, {'end': 1101.268, 'text': 'Now what we want to do is to maximize the target return right?', 'start': 1095.545, 'duration': 5.723}, {'end': 1102.689, 'text': 'And that will train the agent.', 'start': 1101.548, 'duration': 1.141}], 'summary': 'Using one network to output q values for actions, maximizing target return to train the agent.', 'duration': 41.108, 'max_score': 1061.581, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1061581.jpg'}, {'end': 1184.658, 'src': 'embed', 'start': 1154.994, 'weight': 2, 'content': [{'end': 1159.856, 'text': 'So we take the best action now, and we take the best action at every future time as well.', 'start': 1154.994, 'duration': 4.862}, {'end': 1167.918, 'text': "Assuming we do that, we can just look at our data, see what the rewards were, add them all up and discount appropriately, and that's our true Q value.", 'start': 1160.036, 'duration': 7.882}, {'end': 1173.44, 'text': 'Now the predicted Q value is obviously just the output from the network.', 'start': 1169.699, 'duration': 3.741}, {'end': 1175.392, 'text': 'We can train these.', 'start': 1174.431, 'duration': 0.961}, {'end': 1177.353, 'text': 'we have a target, we have a predicted.', 'start': 1175.392, 'duration': 1.961}, {'end': 1184.658, 'text': "we can train this whole network end-to-end by subtracting the two, taking the squared difference, and that's our loss function.", 'start': 1177.353, 'duration': 7.305}], 'summary': 'Train network to minimize loss function for q value calculation.', 'duration': 29.664, 'max_score': 1154.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1154994.jpg'}, {'end': 1242.766, 'src': 'embed', 'start': 1202.608, 'weight': 3, 'content': [{'end': 1206.669, 'text': 'We get it as pixels coming in, and you can see that on the left-hand side.', 'start': 1202.608, 'duration': 4.061}, {'end': 1208.23, 'text': 'That gets fed into a neural network.', 'start': 1206.709, 'duration': 1.521}, {'end': 1212.511, 'text': "Our neural network outputs, in this case of Atari, it's gonna output three numbers.", 'start': 1208.53, 'duration': 3.981}, {'end': 1214.972, 'text': 'The Q value for each of the possible actions.', 'start': 1212.931, 'duration': 2.041}, {'end': 1217.993, 'text': "It can go left, it can go right, or it can stay and don't do anything.", 'start': 1215.052, 'duration': 2.941}, {'end': 1223.815, 'text': 'Each of those Q values will have a numerical value that the neural network will predict.', 'start': 1218.733, 'duration': 5.082}, {'end': 1229.221, 'text': 'Now again, how do we pick what action to take given this Q function?', 'start': 1225, 'duration': 4.221}, {'end': 1238.244, 'text': "We can just take the argmax of those Q values and just say OK, if I go left, I'm going to have an expected return of 20..", 'start': 1230.202, 'duration': 8.042}, {'end': 1242.766, 'text': "That means I'm going to probably break off 20 colored blocks in the future.", 'start': 1238.244, 'duration': 4.522}], 'summary': 'Neural network outputs 3 numbers for atari, predicting q values for actions.', 'duration': 40.158, 'max_score': 1202.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1202608.jpg'}, {'end': 1298.597, 'src': 'embed', 'start': 1267.676, 'weight': 0, 'content': [{'end': 1271.4, 'text': 'The game repeats, the next frame goes, and this whole process loops again.', 'start': 1267.676, 'duration': 3.724}, {'end': 1279.042, 'text': 'Now, DeepMind actually showed how these networks, which are called deep queue networks,', 'start': 1273.598, 'duration': 5.444}, {'end': 1282.845, 'text': 'could actually be applied to solve a whole variety of Atari games,', 'start': 1279.042, 'duration': 3.803}, {'end': 1290.831, 'text': 'providing the state as input through pixels so just raw input state as pixels and showing how they could learn the queue function.', 'start': 1282.845, 'duration': 7.986}, {'end': 1298.597, 'text': "So all of the possible actions are shown on the right-hand side, and it's learning that queue function just by interacting with its environment.", 'start': 1291.272, 'duration': 7.325}], 'summary': "Deepmind demonstrated deep queue networks' ability to learn and solve various atari games using pixel input and interaction with the environment.", 'duration': 30.921, 'max_score': 1267.676, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1267676.jpg'}, {'end': 1341.557, 'src': 'embed', 'start': 1311.48, 'weight': 6, 'content': [{'end': 1317.601, 'text': "And it's actually amazing that this technique works so well because, to be honest, it is so simple and it is extremely clean.", 'start': 1311.48, 'duration': 6.121}, {'end': 1325.323, 'text': "How clean the idea is, it's very elegant in some sense, how simple it is.", 'start': 1319.081, 'duration': 6.242}, {'end': 1331.629, 'text': "And still, it's able to achieve superhuman performance, which means that it beat the human on over 50% of these Atari games.", 'start': 1325.523, 'duration': 6.106}, {'end': 1341.557, 'text': "So now that we saw the magic of Q-learning, I'd like to touch on some of the downsides that we haven't seen so far.", 'start': 1334.591, 'duration': 6.966}], 'summary': 'Q-learning achieves over 50% better human performance on atari games.', 'duration': 30.077, 'max_score': 1311.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1311480.jpg'}], 'start': 1006.505, 'title': 'Deep q networks', 'summary': 'Discusses the use of deep q networks to train agents in atari games, achieving superhuman performance on over 50% of the games, using raw input state as pixels for learning, and the simplicity and elegance of the technique leading to its success.', 'chapters': [{'end': 1202.268, 'start': 1006.505, 'title': 'Deep q-learning for atari games', 'summary': 'Discusses the use of deep neural networks to train an agent to play atari games by outputting q values for different actions, maximizing target return, and training the network using the mean squared error loss function.', 'duration': 195.763, 'highlights': ['The network takes as input a state and action representation, returning the Q value for that state-action pair, reducing the need to run the network multiple times for different actions.', 'Training the network involves maximizing the target return over an infinite time horizon, using the target Q value obtained from rolling out the agent and the predicted Q value from the network to calculate the mean squared error loss function.', 'The predicted Q value is obtained from the network output, while the target Q value is composed of the reward obtained at the current time plus the maximum action value at every future time, and the network is trained end-to-end by minimizing the squared difference between the target and predicted Q values.']}, {'end': 1266.995, 'start': 1202.608, 'title': 'Reinforcement learning in atari', 'summary': 'Discusses how a neural network in atari outputs q values for possible actions, and how the argmax is used to determine the action that maximizes the total return, with examples of expected return values.', 'duration': 64.387, 'highlights': ['The neural network in Atari outputs Q values for each possible action, such as left, right, or staying, which are used to predict the expected returns for each action.', 'The argmax of the Q values is used to determine the action that maximizes the total return, with examples of expected return values like 20 for going left, 3 for staying, and 0 for going right.']}, {'end': 1331.629, 'start': 1267.676, 'title': 'Deep q networks for atari games', 'summary': "Discusses deep q networks' application in solving atari games, achieving superhuman performance on over 50% of the games, using raw input state as pixels for learning the q function, and the simplicity and elegance of the technique leading to its success.", 'duration': 63.953, 'highlights': ['Deep Q Networks achieved superhuman performance on over 50% of analyzed Atari games, using raw input state as pixels for learning the Q function', 'Demonstrated the simple and elegant nature of the technique, despite its ability to outperform humans on the majority of Atari games']}], 'duration': 325.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1006505.jpg', 'highlights': ['Deep Q Networks achieved superhuman performance on over 50% of analyzed Atari games, using raw input state as pixels for learning the Q function', 'The network takes as input a state and action representation, returning the Q value for that state-action pair, reducing the need to run the network multiple times for different actions', 'The predicted Q value is obtained from the network output, while the target Q value is composed of the reward obtained at the current time plus the maximum action value at every future time, and the network is trained end-to-end by minimizing the squared difference between the target and predicted Q values', 'The neural network in Atari outputs Q values for each possible action, such as left, right, or staying, which are used to predict the expected returns for each action', 'Training the network involves maximizing the target return over an infinite time horizon, using the target Q value obtained from rolling out the agent and the predicted Q value from the network to calculate the mean squared error loss function', 'The argmax of the Q values is used to determine the action that maximizes the total return, with examples of expected return values like 20 for going left, 3 for staying, and 0 for going right', 'Demonstrated the simple and elegant nature of the technique, despite its ability to outperform humans on the majority of Atari games']}, {'end': 1778.362, 'segs': [{'end': 1426.607, 'src': 'embed', 'start': 1360.31, 'weight': 0, 'content': [{'end': 1366.052, 'text': "So you can't effectively model or parameterize this problem to deal with continuous action spaces.", 'start': 1360.31, 'duration': 5.742}, {'end': 1372.975, 'text': 'There are ways that you can kind of tweak it, but at its core, what I presented today is not amenable to continuous action spaces.', 'start': 1366.072, 'duration': 6.903}, {'end': 1381.258, 'text': "It's really well suited for small action spaces where you have a small number of possible actions and discrete possibilities.", 'start': 1373.035, 'duration': 8.223}, {'end': 1383.799, 'text': 'So a finite number of possible actions at every given time.', 'start': 1381.498, 'duration': 2.301}, {'end': 1393.772, 'text': "It's also, its policy is also deterministic because you're always picking the action that maximizes your Q function.", 'start': 1384.949, 'duration': 8.823}, {'end': 1400.214, 'text': "And this can be challenging specifically when you're dealing with stochastic environments like we talked about before.", 'start': 1394.212, 'duration': 6.002}, {'end': 1406.797, 'text': 'So Q value learning is really well suited for deterministic action spaces.', 'start': 1400.555, 'duration': 6.242}, {'end': 1412.059, 'text': 'sorry, deterministic environments, discrete action spaces,', 'start': 1408.057, 'duration': 4.002}, {'end': 1420.584, 'text': "and we'll see how we can move past Q-learning to something like a policy gradient method which allows us to deal with continuous action spaces and potentially stochastic environments.", 'start': 1412.059, 'duration': 8.525}, {'end': 1426.607, 'text': "So next up we'll learn about policy learning to get around some of these problems,", 'start': 1421.364, 'duration': 5.243}], 'summary': 'Q value learning is suited for deterministic, discrete action spaces; policy gradient method is for continuous, stochastic environments.', 'duration': 66.297, 'max_score': 1360.31, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1360310.jpg'}, {'end': 1554.883, 'src': 'embed', 'start': 1527.914, 'weight': 3, 'content': [{'end': 1531.037, 'text': "You should not do that because you're definitely not going to get any return.", 'start': 1527.914, 'duration': 3.123}, {'end': 1537.831, 'text': 'Now with that probability distribution, that defines your policy, like that is your policy.', 'start': 1533.188, 'duration': 4.643}, {'end': 1541.694, 'text': 'You can then take an action simply by sampling from that distribution.', 'start': 1538.231, 'duration': 3.463}, {'end': 1547.678, 'text': 'So if you draw a sample from that probability distribution, that exactly tells you the action you should take.', 'start': 1541.974, 'duration': 5.704}, {'end': 1554.883, 'text': 'So if I sample from this probability distribution here, I might see that the action I select is A1 going left.', 'start': 1548.378, 'duration': 6.505}], 'summary': 'Policy defined by probability distribution, action determined by sampling.', 'duration': 26.969, 'max_score': 1527.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1527914.jpg'}, {'end': 1645.174, 'src': 'heatmap', 'start': 1612.161, 'weight': 1, 'content': [{'end': 1617.886, 'text': "but what that means is now we're not really constrained to dealing only with categorical action spaces.", 'start': 1612.161, 'duration': 5.725}, {'end': 1621.332, 'text': "We can parameterize this probability distribution however we'd like.", 'start': 1617.947, 'duration': 3.385}, {'end': 1623.956, 'text': 'In fact, we could make it continuous pretty easily.', 'start': 1621.833, 'duration': 2.123}, {'end': 1626.3, 'text': "So let's take an example of what that might look like.", 'start': 1624.036, 'duration': 2.264}, {'end': 1628.163, 'text': 'This is the discrete action space.', 'start': 1626.701, 'duration': 1.462}, {'end': 1631.589, 'text': 'So we have three possible actions, left, right, or stay in the center.', 'start': 1628.684, 'duration': 2.905}, {'end': 1637.329, 'text': 'And a discrete action space is going to have all of its mass on these three points.', 'start': 1633.206, 'duration': 4.123}, {'end': 1642.352, 'text': "The summation is going to be one of those masses, but still, they're concentrated on three points.", 'start': 1637.709, 'duration': 4.643}, {'end': 1645.174, 'text': 'A continuous action space in this realm.', 'start': 1643.153, 'duration': 2.021}], 'summary': 'Agent can now handle continuous action spaces, not limited to categorical actions.', 'duration': 33.013, 'max_score': 1612.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1612161.jpg'}, {'end': 1733.253, 'src': 'embed', 'start': 1705.918, 'weight': 2, 'content': [{'end': 1710.679, 'text': 'we can parameterize that Gaussian or the output of that Gaussian with a mean and a variance.', 'start': 1705.918, 'duration': 4.761}, {'end': 1717.341, 'text': 'So at every point in time now, our network is going to predict the mean and the variance of that distribution.', 'start': 1710.799, 'duration': 6.542}, {'end': 1721.364, 'text': "So it's outputting actually a mean number and a variance number.", 'start': 1717.581, 'duration': 3.783}, {'end': 1725.567, 'text': "Now all we have to do then, let's suppose that mean and variance is minus 1 and 0.5.", 'start': 1721.384, 'duration': 4.183}, {'end': 1733.253, 'text': "So it's saying that the center of that distribution is minus 1 meters per second or moving 1 meter per second to the left.", 'start': 1725.567, 'duration': 7.686}], 'summary': 'A network predicts mean and variance for gaussian distribution; e.g., mean=-1, variance=0.5.', 'duration': 27.335, 'max_score': 1705.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1705918.jpg'}], 'start': 1334.591, 'title': 'Q-learning and policy gradient methods', 'summary': 'Discusses the challenges of q-learning for continuous action spaces and stochastic environments, emphasizing its limitations and introduces policy gradient methods as a potential solution. it also covers the advantages of directly modeling the policy instead of the q function, highlighting the flexibility of parameterizing probability distributions and continuous action spaces for policy gradient networks.', 'chapters': [{'end': 1502.659, 'start': 1334.591, 'title': 'Challenges with q-learning for continuous action spaces', 'summary': 'Discusses the limitations of q-learning, emphasizing its unsuitability for continuous action spaces and stochastic environments, and introduces policy gradient methods as a potential solution.', 'duration': 168.068, 'highlights': ['Q-learning is not amenable to continuous action spaces', 'Policy of Q-learning is deterministic', 'Introduction of policy gradient methods']}, {'end': 1778.362, 'start': 1502.679, 'title': 'Direct policy modeling and continuous action spaces', 'summary': 'Discusses the advantages of directly modeling the policy instead of the q function, highlighting the flexibility of parameterizing probability distributions and continuous action spaces for policy gradient networks.', 'duration': 275.683, 'highlights': ['The probability distribution defines the policy, allowing actions to be taken by sampling from it.', 'Directly modeling the policy provides flexibility in dealing with continuous action spaces, enabling parameterization of probability distributions for varied actions.', 'Parameterizing the output of a Gaussian distribution with a mean and variance allows for the prediction and sampling of continuous actions.']}], 'duration': 443.771, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1334591.jpg', 'highlights': ['Introduction of policy gradient methods', 'Directly modeling the policy provides flexibility in dealing with continuous action spaces, enabling parameterization of probability distributions for varied actions', 'Parameterizing the output of a Gaussian distribution with a mean and variance allows for the prediction and sampling of continuous actions', 'The probability distribution defines the policy, allowing actions to be taken by sampling from it', 'Q-learning is not amenable to continuous action spaces', 'Policy of Q-learning is deterministic']}, {'end': 2131.936, 'segs': [{'end': 1809.374, 'src': 'embed', 'start': 1780.262, 'weight': 1, 'content': [{'end': 1782.925, 'text': "Okay, so that's a lot of material.", 'start': 1780.262, 'duration': 2.663}, {'end': 1786.789, 'text': "So let's cover how policy gradients works in a concrete example now.", 'start': 1782.985, 'duration': 3.804}, {'end': 1788.451, 'text': "So let's walk through it.", 'start': 1787.59, 'duration': 0.861}, {'end': 1793.456, 'text': "And let's first start by going back to the original reinforcement learning loop.", 'start': 1789.012, 'duration': 4.444}, {'end': 1800.024, 'text': 'We have the agent, the environment, agent sends actions to the environment, environment sends observations back to the agent.', 'start': 1793.797, 'duration': 6.227}, {'end': 1809.374, 'text': "Let's think about how we could use this paradigm combined with policy gradients to train like a very, I guess, intuitive example.", 'start': 1801.189, 'duration': 8.185}], 'summary': 'Covering policy gradients in reinforcement learning loop with concrete example.', 'duration': 29.112, 'max_score': 1780.262, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1780262.jpg'}, {'end': 1856.191, 'src': 'embed', 'start': 1820.881, 'weight': 2, 'content': [{'end': 1825.404, 'text': "The action it could take, let's say simple, it's just the steering wheel angle that it should execute at that time.", 'start': 1820.881, 'duration': 4.523}, {'end': 1826.825, 'text': 'This is a continuous variable.', 'start': 1825.504, 'duration': 1.321}, {'end': 1829.547, 'text': 'It can take any of the angles within some bounded set.', 'start': 1827.145, 'duration': 2.402}, {'end': 1834.524, 'text': "And finally, the reward, let's say, is the distance traveled before we crash.", 'start': 1831.443, 'duration': 3.081}, {'end': 1837.005, 'text': 'OK? Great.', 'start': 1835.944, 'duration': 1.061}, {'end': 1844.567, 'text': 'So the training algorithm for policy gradients is a little bit different than the training algorithm for Q function or Q deep neural networks.', 'start': 1838.325, 'duration': 6.242}, {'end': 1847.648, 'text': "So let's go through it step by step in this example.", 'start': 1844.867, 'duration': 2.781}, {'end': 1852.65, 'text': "So to train our self-driving car, what we're going to do is first initialize the agent.", 'start': 1847.708, 'duration': 4.942}, {'end': 1854.15, 'text': 'The agent is the self-driving car.', 'start': 1852.81, 'duration': 1.34}, {'end': 1856.191, 'text': "We're going to start the agent in the center of the road.", 'start': 1854.39, 'duration': 1.801}], 'summary': 'Training a self-driving car involves initializing the agent and rewarding based on distance traveled before a crash.', 'duration': 35.31, 'max_score': 1820.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1820881.jpg'}, {'end': 1961.652, 'src': 'embed', 'start': 1933.93, 'weight': 0, 'content': [{'end': 1937.391, 'text': 'increase the probability of actions farther from the crash.', 'start': 1933.93, 'duration': 3.461}, {'end': 1942.412, 'text': 'and just keep repeating this over and over until you see that the agent starts to perform better and better,', 'start': 1937.391, 'duration': 5.021}, {'end': 1945.253, 'text': 'drive farther and farther and accumulate more and more reward.', 'start': 1942.412, 'duration': 2.841}, {'end': 1949.254, 'text': 'Until eventually, it starts to follow the lanes without crashing.', 'start': 1946.473, 'duration': 2.781}, {'end': 1955.428, 'text': 'Now this is really awesome because we never taught anything about what are lane markers.', 'start': 1950.885, 'duration': 4.543}, {'end': 1956.909, 'text': "It's just seeing images of the road.", 'start': 1955.528, 'duration': 1.381}, {'end': 1959.851, 'text': 'We never taught anything about how to avoid crashes.', 'start': 1957.329, 'duration': 2.522}, {'end': 1961.652, 'text': 'It just learned this from sparse rewards.', 'start': 1959.971, 'duration': 1.681}], 'summary': "Repetition of actions increases agent's performance, leading to longer drives and more rewards, eventually achieving lane following without crashes.", 'duration': 27.722, 'max_score': 1933.93, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1933930.jpg'}, {'end': 2006.117, 'src': 'embed', 'start': 1976.246, 'weight': 3, 'content': [{'end': 1979.568, 'text': 'I think everything else, conceptually at least, is pretty clear, I hope.', 'start': 1976.246, 'duration': 3.322}, {'end': 1983.83, 'text': 'The question is how do we improve our policy over time?', 'start': 1980.068, 'duration': 3.762}, {'end': 1987.552, 'text': 'So to do that,', 'start': 1986.652, 'duration': 0.9}, {'end': 1993.636, 'text': "let's first look at the loss function for training policy gradients and then we'll dissect it to understand a little bit why this works.", 'start': 1987.552, 'duration': 6.084}, {'end': 1996.397, 'text': 'The loss consists of two terms.', 'start': 1994.776, 'duration': 1.621}, {'end': 2001.52, 'text': 'The first term is the log likelihood of selecting the action given the state that you were in.', 'start': 1996.597, 'duration': 4.923}, {'end': 2006.117, 'text': 'So this really tells us how likely was this action that you selected.', 'start': 2002.756, 'duration': 3.361}], 'summary': 'Improving policy over time by analyzing loss function for training policy gradients.', 'duration': 29.871, 'max_score': 1976.246, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1976246.jpg'}], 'start': 1780.262, 'title': 'Policy gradient learning', 'summary': 'Outlines policy gradient learning for self-driving car, involving agent-environment interaction, sensory information, unique training algorithm, penalizing actions leading to crashes, increasing probability of high-reward actions, and iterative training for better performance.', 'chapters': [{'end': 1856.191, 'start': 1780.262, 'title': 'Policy gradients for self-driving car', 'summary': 'Explores the application of policy gradients in training a self-driving car, explaining the agent-environment interaction, the use of sensory information for state, steering wheel angle as the action, and distance traveled as the reward, and the unique training algorithm for policy gradients.', 'duration': 75.929, 'highlights': ['The agent-environment interaction in training a self-driving car involves the agent sending actions to the environment and receiving observations back, with the agent being the vehicle, its state being sensory information, and the action being the steering wheel angle, which is a continuous variable within a bounded set.', "The reward for training the self-driving car is defined as the distance traveled before a crash, providing a quantifiable measure of the car's performance in the training process.", 'The unique training algorithm for policy gradients differs from the algorithm for Q function or Q deep neural networks, emphasizing the distinct approach required for training a self-driving car using policy gradients.']}, {'end': 2131.936, 'start': 1856.951, 'title': 'Policy gradient learning', 'summary': 'Outlines policy gradient learning, which involves penalizing actions leading to crashes, increasing the probability of actions with higher rewards, and iteratively training the agent until it performs better and accumulates more reward, without explicit teaching of lane-following or crash avoidance.', 'duration': 274.985, 'highlights': ['The agent iteratively learns by penalizing actions taken right before crashes and increasing the probability of actions leading to higher rewards, resulting in improved performance and reward accumulation.', "The loss function for training policy gradients consists of two terms - log likelihood of action selection given the state and the total discounted return, aiming to maximize the agent's likelihood of selecting actions with high rewards.", 'The method utilizes the policy gradient, taking the gradient of the policy function scaled by the return, to train the neural network and improve the policy over time.']}], 'duration': 351.674, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w1780262.jpg', 'highlights': ['The agent iteratively learns by penalizing actions before crashes and increasing high-reward actions.', 'The agent-environment interaction involves sending actions and receiving observations.', 'The unique training algorithm for policy gradients differs from Q function or Q deep neural networks.', 'The loss function for training policy gradients consists of log likelihood of action selection and total discounted return.', 'The reward for training the self-driving car is defined as the distance traveled before a crash.']}, {'end': 2612.317, 'segs': [{'end': 2249.369, 'src': 'embed', 'start': 2215.516, 'weight': 0, 'content': [{'end': 2220.218, 'text': 'One really cool result that we created in my lab also with some of the TAs,', 'start': 2215.516, 'duration': 4.702}, {'end': 2229.601, 'text': 'so you can ask them if you have any questions has been developing a brand new type of photorealistic simulation engine for self-driving cars that is entirely data-driven.', 'start': 2220.218, 'duration': 9.383}, {'end': 2240.686, 'text': 'So the simulator we created was called Vista, and it allows us to use real data of the world to simulate virtual agents.', 'start': 2231.004, 'duration': 9.682}, {'end': 2249.369, 'text': 'So these are virtual reinforcement learning agents that can travel within these synthesized environments and the results are incredibly photorealistic,', 'start': 2240.706, 'duration': 8.663}], 'summary': 'Developed a new photorealistic simulation engine for self-driving cars called vista, using real world data to simulate virtual agents.', 'duration': 33.853, 'max_score': 2215.516, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2215516.jpg'}, {'end': 2310.217, 'src': 'embed', 'start': 2280.441, 'weight': 1, 'content': [{'end': 2285.323, 'text': "On the left-hand side, you can actually see us sitting in the vehicle, but it's completely autonomous.", 'start': 2280.441, 'duration': 4.882}, {'end': 2290.565, 'text': "It's executing a policy that was trained using reinforcement learning entirely within the simulation engine.", 'start': 2285.403, 'duration': 5.162}, {'end': 2300.249, 'text': 'And this actually represented the first time ever a full-scale autonomous vehicle was trained using only reinforcement learning and able to successfully be deployed in the real world.', 'start': 2291.225, 'duration': 9.024}, {'end': 2302.63, 'text': 'So this was a really awesome result that we had.', 'start': 2300.369, 'duration': 2.261}, {'end': 2310.217, 'text': "And now we've covered some fundamentals of policy learning, also value learning with Q functions.", 'start': 2304.833, 'duration': 5.384}], 'summary': 'First full-scale autonomous vehicle trained using reinforcement learning deployed successfully in the real world.', 'duration': 29.776, 'max_score': 2280.441, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2280441.jpg'}, {'end': 2399.7, 'src': 'embed', 'start': 2376.466, 'weight': 3, 'content': [{'end': 2385.211, 'text': 'So the objective of our AI is to learn this incredibly complex state space and learn how to not only beat other autonomous agents,', 'start': 2376.466, 'duration': 8.745}, {'end': 2390.215, 'text': 'but learn how to beat the existing gold standard human professional Go players.', 'start': 2385.211, 'duration': 5.004}, {'end': 2399.7, 'text': 'Now Google DeepMind rose to the challenge a couple years ago and developed a reinforcement learning pipeline which defeated champion players.', 'start': 2392.315, 'duration': 7.385}], 'summary': "Ai aims to surpass human go players, achieved by google deepmind's reinforcement learning pipeline.", 'duration': 23.234, 'max_score': 2376.466, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2376466.jpg'}, {'end': 2449.349, 'src': 'embed', 'start': 2422.915, 'weight': 4, 'content': [{'end': 2430.478, 'text': 'They then used these pre-trained networks, trained from the expert Go players, to play against their own reinforcement learning agents.', 'start': 2422.915, 'duration': 7.563}, {'end': 2433.098, 'text': 'And the reinforcement learning policy network,', 'start': 2430.918, 'duration': 2.18}, {'end': 2440.841, 'text': 'which allowed the policy to go beyond what was imitating the humans and actually go beyond human level capabilities to go superhuman.', 'start': 2433.098, 'duration': 7.743}, {'end': 2449.349, 'text': 'The other trick that they had here, which really made all of this possible, was the usage of an auxiliary neural network,', 'start': 2442.145, 'duration': 7.204}], 'summary': 'Pre-trained networks from expert go players surpassed human level capabilities using reinforcement learning and auxiliary neural network.', 'duration': 26.434, 'max_score': 2422.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2422915.jpg'}, {'end': 2529.486, 'src': 'embed', 'start': 2504.432, 'weight': 2, 'content': [{'end': 2514.08, 'text': 'So the previous example I showed used a buildup of expert data to imitate, and that was what started the foundation of the algorithm.', 'start': 2504.432, 'duration': 9.648}, {'end': 2516.702, 'text': 'Now in AlphaZero, they start from zero.', 'start': 2514.56, 'duration': 2.142}, {'end': 2520.045, 'text': 'They start from scratch and use entirely self-play from the beginning.', 'start': 2516.742, 'duration': 3.303}, {'end': 2529.486, 'text': "In these examples, they showed examples on Let's see, it was chess, Go, many other games as well.", 'start': 2521.106, 'duration': 8.38}], 'summary': 'Alphazero uses entirely self-play to learn various games including chess and go.', 'duration': 25.054, 'max_score': 2504.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2504432.jpg'}], 'start': 2134.434, 'title': 'Reinforcement learning applications', 'summary': 'Covers the challenges of applying reinforcement learning in real life, proposing a photorealistic simulation engine called vista for training, and discusses the success in deploying a fully autonomous vehicle. additionally, it explores the use of deep reinforcement learning in training an ai agent to play go, achieving superhuman capabilities, and the development of alphazero.', 'chapters': [{'end': 2300.249, 'start': 2134.434, 'title': 'Reinforcement learning in real life', 'summary': 'Discusses the limitations of deploying reinforcement learning in real life, particularly in safety-critical domains like self-driving cars, and presents a solution involving the development of a photorealistic simulation engine called vista, enabling training of reinforcement learning agents entirely in simulation before deploying them in the real world, resulting in the successful deployment of a fully autonomous vehicle trained using reinforcement learning.', 'duration': 165.815, 'highlights': ['The development of a photorealistic simulation engine called Vista enabled training reinforcement learning agents entirely in simulation before deploying them in the real world.', 'The full-scale autonomous vehicle was trained using reinforcement learning entirely within the simulation engine, representing the first time ever a full-scale autonomous vehicle was trained using only reinforcement learning and successfully deployed in the real world.', 'The photorealistic simulation engine, Vista, allows the use of real data of the world to simulate virtual reinforcement learning agents, resulting in incredibly photorealistic results.']}, {'end': 2612.317, 'start': 2300.369, 'title': 'Deep reinforcement learning in go', 'summary': 'Discusses the application of deep reinforcement learning in training an ai agent to play go, achieving superhuman level capabilities, using expert data to imitate human strategies and optimizing networks entirely from scratch, leading to the development of alphazero.', 'duration': 311.948, 'highlights': ['Google DeepMind developed a reinforcement learning pipeline which defeated champion players in the game of Go.', 'The AI was trained using expert human data to imitate the moves of expert Go players and then go beyond human level capabilities to achieve superhuman performance.', 'AlphaZero used self-play from scratch to optimize networks for games like chess and Go without the need for pre-training with human experts.']}], 'duration': 477.883, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/nZfaHIxDD5w/pics/nZfaHIxDD5w2134434.jpg', 'highlights': ['The photorealistic simulation engine, Vista, allows the use of real data of the world to simulate virtual reinforcement learning agents, resulting in incredibly photorealistic results.', 'The full-scale autonomous vehicle was trained using reinforcement learning entirely within the simulation engine, representing the first time ever a full-scale autonomous vehicle was trained using only reinforcement learning and successfully deployed in the real world.', 'AlphaZero used self-play from scratch to optimize networks for games like chess and Go without the need for pre-training with human experts.', 'Google DeepMind developed a reinforcement learning pipeline which defeated champion players in the game of Go.', 'The AI was trained using expert human data to imitate the moves of expert Go players and then go beyond human level capabilities to achieve superhuman performance.', 'The development of a photorealistic simulation engine called Vista enabled training reinforcement learning agents entirely in simulation before deploying them in the real world.']}], 'highlights': ['Deep Q Networks achieved superhuman performance on over 50% of analyzed Atari games, using raw input state as pixels for learning the Q function', 'The full-scale autonomous vehicle was trained using reinforcement learning entirely within the simulation engine, representing the first time ever a full-scale autonomous vehicle was trained using only reinforcement learning and successfully deployed in the real world', 'The AI was trained using expert human data to imitate the moves of expert Go players and then go beyond human level capabilities to achieve superhuman performance', 'The shift to deep reinforcement learning allows models to act in dynamic environments', 'The implications of deep reinforcement learning are significant in fields like robotics', 'The potential applications of deep reinforcement learning include creating scenarios where robots interact with humans', 'Understanding the vocabulary specific to reinforcement learning is crucial for building up to more complicated topics', "The agent's goal is to maximize its own reward in the environment, which may involve receiving delayed rewards and evaluating the total reward as the summation of individual rewards in time", 'The Q function is defined as a function that takes the current state and action of the agent as input and returns the expected total discounted return, guiding the agent to take actions that maximize the Q value', 'The concept of using the Q function to choose actions in the environment is explained, where the agent evaluates the Q function for possible actions and selects the action with the highest Q value to execute', 'The agent is the central part of the reinforcement learning algorithm, acting as the neural network and taking actions, such as controlling a drone for delivery', 'The environment is where the agent operates, receiving commands in the form of actions and providing feedback through observations, which could be visual, auditory, etc', 'The photorealistic simulation engine, Vista, allows the use of real data of the world to simulate virtual reinforcement learning agents, resulting in incredibly photorealistic results', 'AlphaZero used self-play from scratch to optimize networks for games like chess and Go without the need for pre-training with human experts', 'Google DeepMind developed a reinforcement learning pipeline which defeated champion players in the game of Go', 'The predicted Q value is obtained from the network output, while the target Q value is composed of the reward obtained at the current time plus the maximum action value at every future time, and the network is trained end-to-end by minimizing the squared difference between the target and predicted Q values', 'Directly modeling the policy provides flexibility in dealing with continuous action spaces, enabling parameterization of probability distributions for varied actions', 'The probability distribution defines the policy, allowing actions to be taken by sampling from it', 'The agent iteratively learns by penalizing actions before crashes and increasing high-reward actions', 'The agent-environment interaction involves sending actions and receiving observations', 'The unique training algorithm for policy gradients differs from Q function or Q deep neural networks', 'The loss function for training policy gradients consists of log likelihood of action selection and total discounted return', 'The reward for training the self-driving car is defined as the distance traveled before a crash', 'Reinforcement learning focuses on maximizing rewards by taking actions based on observations', 'The transition from supervised and unsupervised learning to reinforcement learning is emphasized', 'Introduction of policy gradient methods', 'The argmax of the Q values is used to determine the action that maximizes the total return, with examples of expected return values like 20 for going left, 3 for staying, and 0 for going right', 'The network takes as input a state and action representation, returning the Q value for that state-action pair, reducing the need to run the network multiple times for different actions', 'The neural network in Atari outputs Q values for each possible action, such as left, right, or staying, which are used to predict the expected returns for each action', 'Training the network involves maximizing the target return over an infinite time horizon, using the target Q value obtained from rolling out the agent and the predicted Q value from the network to calculate the mean squared error loss function', 'The development of a photorealistic simulation engine called Vista enabled training reinforcement learning agents entirely in simulation before deploying them in the real world', 'The implications of deep reinforcement learning are significant in fields like robotics', 'The potential applications of deep reinforcement learning include creating scenarios where robots interact with humans', 'The shift to deep reinforcement learning allows models to act in dynamic environments', 'Understanding the vocabulary specific to reinforcement learning is crucial for building up to more complicated topics']}