title

Lecture 14 | Deep Reinforcement Learning

description

In Lecture 14 we move from supervised learning to reinforcement learning (RL), in which an agent must learn to interact with an environment in order to maximize its reward. We formalize reinforcement learning using the language of Markov Decision Processes (MDPs), policies, value functions, and Q-Value functions. We discuss different algorithms for reinforcement learning including Q-Learning, policy gradients, and Actor-Critic. We show how deep reinforcement learning has been used to play Atari games and to achieve super-human Go performance in AlphaGo.
Keywords: Reinforcement learning, RL, Markov decision process, MDP, Q-Learning, policy gradients, REINFORCE, actor-critic, Atari games, AlphaGo
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/

detail

{'title': 'Lecture 14 | Deep Reinforcement Learning', 'heatmap': [{'end': 935.519, 'start': 883.283, 'weight': 1}], 'summary': "Covers stanford's reinforcement learning update, introduction to reinforcement learning and classic problems, mdp and optimal policy, deep q-learning and experience replay, methods like policy gradients, actor-critic model, recurrent attention model, and alphago's success using policy gradients and reinforcement learning.", 'chapters': [{'end': 52.123, 'segs': [{'end': 52.123, 'src': 'embed', 'start': 4.918, 'weight': 0, 'content': [{'end': 6.039, 'text': 'Stanford University.', 'start': 4.918, 'duration': 1.121}, {'end': 11.262, 'text': "Okay, let's get started.", 'start': 10.241, 'duration': 1.021}, {'end': 16.285, 'text': 'All right, so welcome to lecture 14.', 'start': 13.403, 'duration': 2.882}, {'end': 18.627, 'text': "And today we'll be talking about reinforcement learning.", 'start': 16.285, 'duration': 2.342}, {'end': 23.01, 'text': 'So some administrative details first.', 'start': 21.248, 'duration': 1.762}, {'end': 26.552, 'text': 'Update on grades, midterm grades were released last night.', 'start': 23.99, 'duration': 2.562}, {'end': 29.954, 'text': 'So see Piazza for more information and statistics about that.', 'start': 27.012, 'duration': 2.942}, {'end': 34.709, 'text': 'And we also have A2 and milestone grades scheduled for later this week.', 'start': 30.686, 'duration': 4.023}, {'end': 40.654, 'text': 'Also about your projects, all teams must register your projects.', 'start': 37.211, 'duration': 3.443}, {'end': 45.178, 'text': 'So, on Piazza, we have a form posted, so you should go there, and this is required.', 'start': 41.114, 'duration': 4.064}, {'end': 52.123, 'text': "every team should go and fill out this form with information about your project that we'll use for final grading and the poster session.", 'start': 45.178, 'duration': 6.945}], 'summary': 'Stanford university lecture 14 on reinforcement learning, midterm grades released, a2 and milestone grades scheduled, project registration required on piazza.', 'duration': 47.205, 'max_score': 4.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE4918.jpg'}], 'start': 4.918, 'title': "Stanford's reinforcement learning update", 'summary': 'Provides an update on grades, including the release of midterm grades, the schedule for a2 and milestone grades, and the requirement for all teams to register their projects via a form on piazza for final grading and the poster session at stanford.', 'chapters': [{'end': 52.123, 'start': 4.918, 'title': 'Reinforcement learning update at stanford', 'summary': 'Covers an update on grades, including the release of midterm grades and the schedule for a2 and milestone grades, along with the requirement for all teams to register their projects via a form on piazza for final grading and the poster session.', 'duration': 47.205, 'highlights': ['Midterm grades were released last night, with further information and statistics available on Piazza.', 'A2 and milestone grades are scheduled for later this week.', 'All teams are required to register their projects via a form on Piazza for final grading and the poster session.']}], 'duration': 47.205, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE4918.jpg', 'highlights': ['Midterm grades released with statistics on Piazza.', 'A2 and milestone grades scheduled for later this week.', 'Teams must register projects on Piazza for final grading.']}, {'end': 343.87, 'segs': [{'end': 172.653, 'src': 'embed', 'start': 145.343, 'weight': 0, 'content': [{'end': 150.625, 'text': "and then we'll talk about two major classes of RL algorithms Q-learning and policy gradients.", 'start': 145.343, 'duration': 5.282}, {'end': 158.129, 'text': 'So in the reinforcement learning setup, what we have is we have an agent and we have an environment.', 'start': 153.347, 'duration': 4.782}, {'end': 161.671, 'text': 'And so the environment gives the agent a state.', 'start': 159.35, 'duration': 2.321}, {'end': 172.653, 'text': 'In turn, the agent is going to take an action and then the environment is going to give back a reward as well as this next state.', 'start': 163.686, 'duration': 8.967}], 'summary': 'Reinforcement learning involves q-learning and policy gradients, with an agent interacting with an environment to receive states and rewards.', 'duration': 27.31, 'max_score': 145.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE145343.jpg'}, {'end': 305.784, 'src': 'embed', 'start': 278.874, 'weight': 1, 'content': [{'end': 284.256, 'text': 'So for example here we have Atari games which are a classic success of deep reinforcement learning.', 'start': 278.874, 'duration': 5.382}, {'end': 288.938, 'text': 'And so here the objective is to complete these games with the highest possible score.', 'start': 284.796, 'duration': 4.142}, {'end': 292.159, 'text': "So your agent is basically a player that's trying to play these games.", 'start': 288.958, 'duration': 3.201}, {'end': 297.501, 'text': 'And the state that you have is going to be the raw pixels of the game state.', 'start': 293.179, 'duration': 4.322}, {'end': 302.723, 'text': "So these are just the pixels on the screen that you would see as you're playing the game.", 'start': 298.101, 'duration': 4.622}, {'end': 305.784, 'text': 'And then the actions that you have are your game controls.', 'start': 303.583, 'duration': 2.201}], 'summary': "Atari games showcase deep reinforcement learning's success in achieving high scores through game completion.", 'duration': 26.91, 'max_score': 278.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE278874.jpg'}, {'end': 350.476, 'src': 'embed', 'start': 322, 'weight': 2, 'content': [{'end': 324.641, 'text': 'And finally here we have another example of a game here.', 'start': 322, 'duration': 2.641}, {'end': 335.267, 'text': "It's Go, which is something that was a huge achievement of deep reinforcement learning last year, when DeepMind's AlphaGo beat Lee Sedol,", 'start': 324.762, 'duration': 10.505}, {'end': 343.87, 'text': 'which is one of the the best Go players of the last few years, and this is actually in the news again, for some of you may have seen.', 'start': 335.267, 'duration': 8.603}, {'end': 350.476, 'text': "There's another Go competition going on now with AlphaGo versus a top ranked Go player.", 'start': 344.211, 'duration': 6.265}], 'summary': "Deepmind's alphago beat top go player lee sedol, sparking ongoing competitions.", 'duration': 28.476, 'max_score': 322, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE322000.jpg'}], 'start': 53.745, 'title': 'Reinforcement learning', 'summary': 'Introduces reinforcement learning, discussing its problem setup, topics of markov decision processes, q-learning, and policy gradients. it also explores classic reinforcement learning problems like cart-pole balancing, robot locomotion, atari games, and the game of go, with key objectives, states, actions, rewards, and examples of successful applications.', 'chapters': [{'end': 172.653, 'start': 53.745, 'title': 'Reinforcement learning overview', 'summary': "Introduces reinforcement learning, discussing its problem setup, including the agent's actions in the environment and its goal to maximize reward, as well as outlining the topics of markov decision processes, q-learning, and policy gradients.", 'duration': 118.908, 'highlights': ['The reinforcement learning problem involves an agent taking actions in an environment to maximize its reward.', 'The chapter outlines the topics of Markov decision processes, Q-learning, and policy gradients as major classes of RL algorithms.', 'The environment gives the agent a state, and in turn, the agent takes an action, receiving a reward and the next state in return.', 'The Tiny ImageNet evaluation server is now online for the Tiny ImageNet challenge, and a course survey on Piazza is available for feedback.', 'The chapter will discuss supervised learning, unsupervised learning, and reinforcement learning as different types of problem setups.']}, {'end': 343.87, 'start': 173.074, 'title': 'Reinforcement learning applications', 'summary': 'Discusses classic reinforcement learning problems such as cart-pole balancing, robot locomotion, atari games, and the game of go, highlighting key objectives, states, actions, rewards, and examples of successful applications.', 'duration': 170.796, 'highlights': ['Atari games as a classic success of deep reinforcement learning The objective is to complete games with the highest possible score, with the state being the raw pixels of the game and the actions being game controls, leading to a score increase or decrease at each time step.', "DeepMind's AlphaGo beating Lee Sedol in Go AlphaGo's victory over Lee Sedol, one of the best Go players, was a significant achievement of deep reinforcement learning, garnering attention in the news.", 'Cart-pole problem as a classic RL problem The objective is to balance a pole on top of a movable cart, with the reward of one at each time step if the pole is upright, showcasing a fundamental RL challenge.', 'Robot locomotion and the actions involving torques applied to joints The objective is to make the robot move forward, with the state describing the angle and positions of all the joints, and the actions involving the application of torques onto these joints.']}], 'duration': 290.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE53745.jpg', 'highlights': ['The chapter outlines the topics of Markov decision processes, Q-learning, and policy gradients as major classes of RL algorithms.', 'Atari games as a classic success of deep reinforcement learning The objective is to complete games with the highest possible score, with the state being the raw pixels of the game and the actions being game controls, leading to a score increase or decrease at each time step.', "DeepMind's AlphaGo beating Lee Sedol in Go AlphaGo's victory over Lee Sedol, one of the best Go players, was a significant achievement of deep reinforcement learning, garnering attention in the news.", 'The reinforcement learning problem involves an agent taking actions in an environment to maximize its reward.', 'The environment gives the agent a state, and in turn, the agent takes an action, receiving a reward and the next state in return.']}, {'end': 926.02, 'segs': [{'end': 422.889, 'src': 'embed', 'start': 396.621, 'weight': 0, 'content': [{'end': 403.043, 'text': 'And an MDP here is defined by a tuple of objects consisting of S, which is the set of possible states.', 'start': 396.621, 'duration': 6.422}, {'end': 405.924, 'text': 'We have A, our set of possible actions.', 'start': 403.523, 'duration': 2.401}, {'end': 411.686, 'text': 'We also have R, our distribution of our reward given a state action pair.', 'start': 406.404, 'duration': 5.282}, {'end': 415.207, 'text': "So it's a function mapping from state action to your reward.", 'start': 412.386, 'duration': 2.821}, {'end': 422.889, 'text': "You also have P, which is a transition probability distribution over your next state that you're gonna transition to given your state action pair.", 'start': 415.647, 'duration': 7.242}], 'summary': 'Mdp defined by s, a, r, and p: s= set of states, a= set of actions, r= reward distribution, p= transition probability', 'duration': 26.268, 'max_score': 396.621, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE396621.jpg'}, {'end': 524.837, 'src': 'embed', 'start': 500.492, 'weight': 1, 'content': [{'end': 507.582, 'text': 'And our objective now is going to be to find your optimal policy pi star that maximizes your cumulative discounted rewards.', 'start': 500.492, 'duration': 7.09}, {'end': 514.087, 'text': 'So we can see here we have our sum of future rewards which can be also discounted by your discount factor.', 'start': 507.622, 'duration': 6.465}, {'end': 518.812, 'text': "So let's look at an example of a simple MDP.", 'start': 516.289, 'duration': 2.523}, {'end': 524.837, 'text': 'And here we have grid world which is this task where we have this grid of states.', 'start': 519.693, 'duration': 5.144}], 'summary': 'Find pi star for maximizing cumulative discounted rewards in mdp', 'duration': 24.345, 'max_score': 500.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE500492.jpg'}, {'end': 579.35, 'src': 'embed', 'start': 549.751, 'weight': 2, 'content': [{'end': 551.933, 'text': 'And this can be something like r equals negative one.', 'start': 549.751, 'duration': 2.182}, {'end': 559.973, 'text': 'And so your objective is going to be to reach one of the terminal states, which are the gray states shown here, in the least number of actions.', 'start': 552.394, 'duration': 7.579}, {'end': 565.778, 'text': "So the longer that you take to reach your terminal state, you're going to keep accumulating these negative rewards.", 'start': 560.653, 'duration': 5.125}, {'end': 575.967, 'text': "Okay, so if you look at a random policy here, a random policy would consist of basically, at any given state or cell that you're in,", 'start': 568, 'duration': 7.967}, {'end': 579.35, 'text': "just sampling randomly which direction that you're gonna move in next.", 'start': 575.967, 'duration': 3.383}], 'summary': 'Reach terminal state in fewest actions to minimize negative rewards.', 'duration': 29.599, 'max_score': 549.751, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE549751.jpg'}, {'end': 697.511, 'src': 'embed', 'start': 652.605, 'weight': 3, 'content': [{'end': 662.19, 'text': 'So formally, we can write our optimal policy pi star as maximizing this expected sum of future rewards over policies pi,', 'start': 652.605, 'duration': 9.585}, {'end': 665.691, 'text': 'where we have our initial state sampled from our state distribution,', 'start': 662.19, 'duration': 3.501}, {'end': 674.476, 'text': 'we have our actions sample from our policy given the state and then we have our next states sampled from our transition probability distributions.', 'start': 665.691, 'duration': 8.785}, {'end': 684.362, 'text': "Okay, so, before we talk about exactly how we're going to find this policy, let's first talk about a few definitions.", 'start': 677.237, 'duration': 7.125}, {'end': 686.484, 'text': "that's going to be helpful for us in doing so.", 'start': 684.362, 'duration': 2.122}, {'end': 691.467, 'text': 'So specifically, the value function and the Q value function.', 'start': 687.484, 'duration': 3.983}, {'end': 697.511, 'text': "So, as we follow the policy, we're going to sample trajectories or paths right for every episode.", 'start': 692.107, 'duration': 5.404}], 'summary': 'Optimal policy pi star maximizes expected sum of future rewards over policies pi, involving initial state, actions, and transition probabilities.', 'duration': 44.906, 'max_score': 652.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE652605.jpg'}, {'end': 777.875, 'src': 'embed', 'start': 750.263, 'weight': 5, 'content': [{'end': 756.126, 'text': 'So, then, the optimal Q value function that we can get is going to be Q star,', 'start': 750.263, 'duration': 5.863}, {'end': 763.849, 'text': 'which is the maximum expected cumulative reward that we can get from a given state action pair defined here.', 'start': 756.126, 'duration': 7.723}, {'end': 771.612, 'text': "So now we're going to see one important thing in reinforcement learning which is called the Bellman equation.", 'start': 765.61, 'duration': 6.002}, {'end': 777.875, 'text': "So let's consider this Q value function from the optimal policy Q star.", 'start': 772.433, 'duration': 5.442}], 'summary': 'The optimal q value function, q star, represents the maximum expected cumulative reward from a state-action pair.', 'duration': 27.612, 'max_score': 750.263, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE750263.jpg'}, {'end': 877.735, 'src': 'embed', 'start': 850.402, 'weight': 6, 'content': [{'end': 855.464, 'text': "that's following this and just taking the action that's going to lead to best reward.", 'start': 850.402, 'duration': 5.062}, {'end': 861.107, 'text': 'Okay, so how can we solve for this optimal policy?', 'start': 857.826, 'duration': 3.281}, {'end': 866.39, 'text': 'So one way we can solve for this is something called a value iteration algorithm,', 'start': 861.827, 'duration': 4.563}, {'end': 869.651, 'text': "where we're going to use this Bellman equation as an iterative update.", 'start': 866.39, 'duration': 3.261}, {'end': 877.735, 'text': "So at each step, we're going to refine our approximation of Q star by trying to enforce the Bellman equation.", 'start': 870.211, 'duration': 7.524}], 'summary': 'Solving for optimal policy using value iteration algorithm and bellman equation.', 'duration': 27.333, 'max_score': 850.402, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE850402.jpg'}, {'end': 935.519, 'src': 'embed', 'start': 904.137, 'weight': 7, 'content': [{'end': 908.562, 'text': 'for every state action pair in order to make our iterative updates.', 'start': 904.137, 'duration': 4.425}, {'end': 917.151, 'text': 'But then this is a problem if, for example, if we look at the state of, for example, an Atari game that we had earlier,', 'start': 909.263, 'duration': 7.888}, {'end': 918.953, 'text': "it's going to be your screen of pixels.", 'start': 917.151, 'duration': 1.802}, {'end': 926.02, 'text': "And this is a huge state space and it's basically computationally infeasible to compute this for the entire state space.", 'start': 919.493, 'duration': 6.527}, {'end': 935.519, 'text': "Okay, so what's the solution to this? Well, we can use a function approximator to estimate Q of SA.", 'start': 929.096, 'duration': 6.423}], 'summary': 'Using function approximators to estimate q of sa for large state spaces', 'duration': 31.382, 'max_score': 904.137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE904137.jpg'}], 'start': 344.211, 'title': 'Reinforcement learning basics and finding optimal policy in mdp', 'summary': 'Introduces markov decision process (mdp) and discusses finding the optimal policy, pi star, in mdp by maximizing the expected sum of future rewards using value iteration algorithm, which is not scalable for large state spaces.', 'chapters': [{'end': 605.685, 'start': 344.211, 'title': 'Reinforcement learning basics', 'summary': 'Introduces the concept of markov decision process (mdp) in reinforcement learning, detailing the key elements such as states, actions, rewards, transition probabilities, and discount factor, aiming to find an optimal policy that maximizes cumulative discounted rewards.', 'duration': 261.474, 'highlights': ['Markov Decision Process (MDP) is defined by a tuple of objects consisting of possible states, actions, reward distribution, transition probability distribution, and a discount factor. The MDP is defined by a tuple of objects consisting of S (set of possible states), A (set of possible actions), R (distribution of reward given a state action pair), P (transition probability distribution over next state), and gamma (a discount factor).', 'The objective is to find the optimal policy that maximizes cumulative discounted rewards, aiming to reach the terminal states in the least number of actions to avoid accumulating negative rewards. The objective is to find the optimal policy pi star that maximizes cumulative discounted rewards, aiming to reach the terminal states in the least number of actions to avoid accumulating negative rewards.', 'A random policy involves sampling randomly the direction to move in next, while an optimal policy aims to move closest to a terminal state to minimize negative rewards. A random policy involves sampling randomly the direction to move in next, while an optimal policy aims to move closest to a terminal state to minimize negative rewards.']}, {'end': 926.02, 'start': 609.71, 'title': 'Finding optimal policy in mdp', 'summary': 'Discusses finding the optimal policy, pi star, in markov decision process (mdp) by maximizing the expected sum of future rewards, defining the value function, q value function, and bellman equation, and solving for the optimal policy using a value iteration algorithm, which is not scalable for large state spaces.', 'duration': 316.31, 'highlights': ['The optimal policy, pi star, is defined as maximizing the expected sum of future rewards over policies pi, considering randomness in MDP and working with maximizing the expected sum of the rewards. The chapter emphasizes the importance of the optimal policy in maximizing the expected sum of future rewards, considering the randomness in MDP, and working with maximizing the expected sum of the rewards.', 'Definition of value function and Q value function, representing the expected cumulative reward from following the policy from a current state and taking an action in a state. The definition of the value function and Q value function is explained, representing the expected cumulative reward from following the policy from a current state and taking an action in a state.', 'The Bellman equation is introduced, representing the optimal Q value function and specifying the maximum future reward that can be obtained from any action. The introduction of the Bellman equation is highlighted, representing the optimal Q value function and specifying the maximum future reward that can be obtained from any action.', 'Explanation of the value iteration algorithm as a method to solve for the optimal policy by refining the approximation of Q star through iterative updates. The explanation of the value iteration algorithm as a method to solve for the optimal policy is provided, emphasizing the iterative refinement of the approximation of Q star.', 'Challenges with scalability in solving for the optimal policy, especially in large state spaces such as in Atari games, due to computational infeasibility. The chapter discusses the challenges with scalability in solving for the optimal policy, particularly in large state spaces like Atari games, due to computational infeasibility.']}], 'duration': 581.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE344211.jpg', 'highlights': ['MDP is defined by a tuple of objects consisting of S, A, R, P, and gamma.', 'Objective is to find the optimal policy pi star that maximizes cumulative discounted rewards.', 'Optimal policy aims to move closest to a terminal state to minimize negative rewards.', 'Optimal policy, pi star, maximizes the expected sum of future rewards over policies pi.', 'Definition of value function and Q value function is explained.', 'Bellman equation represents the optimal Q value function and specifies the maximum future reward.', 'Value iteration algorithm is a method to solve for the optimal policy through iterative updates.', 'Challenges with scalability in solving for the optimal policy, especially in large state spaces.']}, {'end': 1659.454, 'segs': [{'end': 978.907, 'src': 'embed', 'start': 949.232, 'weight': 5, 'content': [{'end': 953.794, 'text': "Okay, so this is gonna take us to our formulation of Q-learning that we're going to look at.", 'start': 949.232, 'duration': 4.562}, {'end': 962.319, 'text': "And so what we're going to do is we're going to use a function approximator in order to estimate our action value function.", 'start': 954.975, 'duration': 7.344}, {'end': 970.823, 'text': "And if this function approximator is a deep neural network, which is what's been used recently, then this is going to be called deep Q-learning.", 'start': 963.239, 'duration': 7.584}, {'end': 978.907, 'text': "And so this is something that you'll hear around as one of the common approaches to deep reinforcement learning that's in use.", 'start': 971.223, 'duration': 7.684}], 'summary': 'Using deep q-learning with function approximator for estimating action value function.', 'duration': 29.675, 'max_score': 949.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE949232.jpg'}, {'end': 1047.233, 'src': 'embed', 'start': 1020.821, 'weight': 3, 'content': [{'end': 1028.626, 'text': 'or how far Q of SA is from its target, which is the Yi here, the right-hand side of the Bellman equation that we saw earlier.', 'start': 1020.821, 'duration': 7.805}, {'end': 1038.929, 'text': "So we're basically going to take these forward passes of our loss function, trying to minimize this error, and then our backward pass,", 'start': 1030.345, 'duration': 8.584}, {'end': 1041.23, 'text': 'our gradient update, is just going to be.', 'start': 1038.929, 'duration': 2.301}, {'end': 1047.233, 'text': 'we just take the gradient of this loss with respect to our network parameters, theta.', 'start': 1041.23, 'duration': 6.003}], 'summary': 'Minimize error through forward and backward passes to update network parameters.', 'duration': 26.412, 'max_score': 1020.821, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1020821.jpg'}, {'end': 1100.379, 'src': 'embed', 'start': 1074.085, 'weight': 0, 'content': [{'end': 1078.367, 'text': "And so we're gonna look at this problem that we saw earlier of playing Atari games,", 'start': 1074.085, 'duration': 4.282}, {'end': 1082.128, 'text': 'where our objective was to complete the game with the highest score.', 'start': 1078.367, 'duration': 3.761}, {'end': 1089.454, 'text': 'and remember, our state is going to be the raw pixel inputs of the game state and we can take these actions of moving left, right,', 'start': 1082.128, 'duration': 7.326}, {'end': 1091.955, 'text': 'up down or whatever actions of the particular game.', 'start': 1089.454, 'duration': 2.501}, {'end': 1094.456, 'text': 'And our reward.', 'start': 1093.456, 'duration': 1}, {'end': 1100.379, 'text': "at each time step we're going to get a reward of our score increase or decrease that we got at this time step,", 'start': 1094.456, 'duration': 5.923}], 'summary': 'Using raw pixel inputs to play atari games to maximize score.', 'duration': 26.294, 'max_score': 1074.085, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1074085.jpg'}, {'end': 1176.932, 'src': 'embed', 'start': 1153.179, 'weight': 1, 'content': [{'end': 1161.524, 'text': 'Okay, so the question is are we saying here that our network is going to approximate our Q value function for different state action pairs,', 'start': 1153.179, 'duration': 8.345}, {'end': 1162.425, 'text': 'for example, four of these?', 'start': 1161.524, 'duration': 0.901}, {'end': 1167.508, 'text': "Yeah, that's correct, and we'll talk about that in a few slides.", 'start': 1162.945, 'duration': 4.563}, {'end': 1176.932, 'text': "So no, so we don't have a softmax layer after the connected because here our goal is to directly predict our Q value functions.", 'start': 1170.406, 'duration': 6.526}], 'summary': 'Network approximates q value function for state-action pairs without softmax layer.', 'duration': 23.753, 'max_score': 1153.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1153179.jpg'}, {'end': 1419.582, 'src': 'embed', 'start': 1389.563, 'weight': 4, 'content': [{'end': 1396.467, 'text': 'And so now what we can do is that we can now train our queue network on random mini-batches of transitions from the replay memory.', 'start': 1389.563, 'duration': 6.904}, {'end': 1405.512, 'text': "So, instead of using consecutive samples, we're now gonna sample across these transitions that we've accumulated random samples of these,", 'start': 1397.107, 'duration': 8.405}, {'end': 1409.974, 'text': 'and this breaks all of these correlation problems that we had earlier.', 'start': 1405.512, 'duration': 4.462}, {'end': 1419.582, 'text': 'And then also as another side benefit is that each of these transitions can also contribute to potentially multiple weight updates.', 'start': 1411.975, 'duration': 7.607}], 'summary': 'Training queue network on random mini-batches breaks correlation problems and allows multiple weight updates.', 'duration': 30.019, 'max_score': 1389.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1389563.jpg'}, {'end': 1578.262, 'src': 'embed', 'start': 1550.506, 'weight': 2, 'content': [{'end': 1557.214, 'text': "We're going to be continuously playing this game and then also sampling mini-batches,", 'start': 1550.506, 'duration': 6.708}, {'end': 1561.696, 'text': 'using experience replay to update our weights of our queue network and then continuing in this fashion.', 'start': 1557.214, 'duration': 4.482}, {'end': 1566.177, 'text': "Okay, so let's see.", 'start': 1564.417, 'duration': 1.76}, {'end': 1567.058, 'text': "Let's see if I can.", 'start': 1566.197, 'duration': 0.861}, {'end': 1568.658, 'text': 'is this playing??', 'start': 1567.058, 'duration': 1.6}, {'end': 1578.262, 'text': "Okay, so let's take a look at this deep queue learning algorithm from Google DeepMind, trained on an Atari game of Breakout.", 'start': 1569.439, 'duration': 8.823}], 'summary': 'Using deep q-learning algorithm to train on atari breakout game.', 'duration': 27.756, 'max_score': 1550.506, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1550506.jpg'}], 'start': 929.096, 'title': 'Deep q-learning and experience replay', 'summary': "Discusses the use of deep q-learning with a deep neural network as a function approximator, emphasizing training to minimize error and application in playing atari games. it also explores experience replay, addressing problems with correlated samples and data efficiency, leading to efficient learning and improved performance in game-playing tasks, as demonstrated by google deepmind's training on an atari game of breakout.", 'chapters': [{'end': 1144.413, 'start': 929.096, 'title': 'Deep q-learning for reinforcement learning', 'summary': 'Discusses the use of a deep neural network as a function approximator to estimate the action value function in q-learning, emphasizing the training process to minimize the error of the bellman equation and its application in playing atari games.', 'duration': 215.317, 'highlights': ['The use of a deep neural network as a function approximator in Q-learning A deep neural network is used to estimate the action value function in Q-learning, serving as a function approximator for the Q function.', 'Training process to minimize the error of the Bellman equation A training process involving forward passes and backward passes is employed to minimize the error of the Bellman equation, with the objective of iteratively making the Q function closer to the target value.', 'Application in playing Atari games The approach is applied to playing Atari games, where the objective is to complete the game with the highest score using raw pixel inputs as states, and taking actions such as moving left, right, up, or down.']}, {'end': 1659.454, 'start': 1153.179, 'title': 'Deep q learning & experience replay', 'summary': "Discusses the use of deep q learning with experience replay, utilizing a neural network to approximate q value functions for different state-action pairs, and addressing problems with correlated samples and data efficiency, ultimately leading to efficient learning and improved performance in game-playing tasks, as demonstrated by google deepmind's training on an atari game of breakout.", 'duration': 506.275, 'highlights': ['Using a neural network to approximate Q value functions for different state-action pairs The network aims to directly predict Q value functions for state-action pairs, with the output layer providing Q values for each action, allowing for efficient computation of Q values for all functions from the current state.', 'Addressing problems with correlated samples and data efficiency through experience replay Experience replay is utilized to break correlation problems by continuously updating a replay memory table and training the Q network on random mini-batches of transitions from the replay memory, leading to greater data efficiency and improved learning.', "Demonstration of improved performance in game-playing tasks, as shown by Google DeepMind's training on an Atari game of Breakout Google DeepMind's training on an Atari game of Breakout showcases the efficacy of deep Q learning with experience replay, as the trained model demonstrates improved performance over time, efficiently following the ball and removing most of the blocks after several hours of training."]}], 'duration': 730.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE929096.jpg', 'highlights': ['Application in playing Atari games The approach is applied to playing Atari games, where the objective is to complete the game with the highest score using raw pixel inputs as states, and taking actions such as moving left, right, up, or down.', 'Using a neural network to approximate Q value functions for different state-action pairs The network aims to directly predict Q value functions for state-action pairs, with the output layer providing Q values for each action, allowing for efficient computation of Q values for all functions from the current state.', "Demonstration of improved performance in game-playing tasks, as shown by Google DeepMind's training on an Atari game of Breakout Google DeepMind's training on an Atari game of Breakout showcases the efficacy of deep Q learning with experience replay, as the trained model demonstrates improved performance over time, efficiently following the ball and removing most of the blocks after several hours of training.", 'Training process to minimize the error of the Bellman equation A training process involving forward passes and backward passes is employed to minimize the error of the Bellman equation, with the objective of iteratively making the Q function closer to the target value.', 'Addressing problems with correlated samples and data efficiency through experience replay Experience replay is utilized to break correlation problems by continuously updating a replay memory table and training the Q network on random mini-batches of transitions from the replay memory, leading to greater data efficiency and improved learning.', 'The use of a deep neural network as a function approximator in Q-learning A deep neural network is used to estimate the action value function in Q-learning, serving as a function approximator for the Q function.']}, {'end': 2265.906, 'segs': [{'end': 1756.068, 'src': 'embed', 'start': 1684.444, 'weight': 0, 'content': [{'end': 1687.326, 'text': 'Well, the problem can be that the Q function is very complicated.', 'start': 1684.444, 'duration': 2.882}, {'end': 1692.283, 'text': "So we're saying that we want to learn the value of every state action pair.", 'start': 1688.041, 'duration': 4.242}, {'end': 1697.365, 'text': "So let's say you have something, for example, a robot grasping, wanting to grasp an object.", 'start': 1693.063, 'duration': 4.302}, {'end': 1699.766, 'text': 'You can have a really high dimensional state.', 'start': 1697.925, 'duration': 1.841}, {'end': 1706.668, 'text': "Let's say you have all of your even just joint positions and angles.", 'start': 1701.726, 'duration': 4.942}, {'end': 1713.591, 'text': 'And so learning the exact value of every state action pair that you have can be really, really hard to do.', 'start': 1707.268, 'duration': 6.323}, {'end': 1718.787, 'text': 'But on the other hand, your policy can be much simpler.', 'start': 1715.845, 'duration': 2.942}, {'end': 1724.351, 'text': 'Like what you want this robot to do may be just to have this simple motion of just closing your hand.', 'start': 1719.367, 'duration': 4.984}, {'end': 1728.093, 'text': 'Just move your fingers in this particular direction and keep going.', 'start': 1725.031, 'duration': 3.062}, {'end': 1733.997, 'text': 'And so that leads to the question of can we just learn this policy directly??', 'start': 1729.134, 'duration': 4.863}, {'end': 1738.88, 'text': 'Is it possible, maybe, to just find the best policy from a collection of policies,', 'start': 1734.517, 'duration': 4.363}, {'end': 1744.464, 'text': 'without trying to go through this process of estimating your Q value and then using that to infer your policy?', 'start': 1738.88, 'duration': 5.584}, {'end': 1756.068, 'text': "So this is an approach that we're going to call policy gradients.", 'start': 1747.205, 'duration': 8.863}], 'summary': 'Learning q-values can be complex, but simpler policies can be learned directly using policy gradients.', 'duration': 71.624, 'max_score': 1684.444, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1684444.jpg'}, {'end': 1835.796, 'src': 'embed', 'start': 1810.678, 'weight': 2, 'content': [{'end': 1817.481, 'text': "We've learned that, given some objective, that we have some parameters, we can just use gradient ascent, gradient ascent,", 'start': 1810.678, 'duration': 6.803}, {'end': 1820.762, 'text': 'in order to continuously improve our parameters.', 'start': 1817.481, 'duration': 3.281}, {'end': 1828.746, 'text': "And so let's talk more specifically about how we can do this, which we're going to call here the reinforced algorithm.", 'start': 1823.624, 'duration': 5.122}, {'end': 1835.796, 'text': 'So mathematically, we can write out our expected future reward over trajectories.', 'start': 1829.611, 'duration': 6.185}], 'summary': 'Using gradient ascent to continuously improve parameters in the reinforced algorithm.', 'duration': 25.118, 'max_score': 1810.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1810678.jpg'}, {'end': 2265.906, 'src': 'embed', 'start': 2238.104, 'weight': 3, 'content': [{'end': 2243.969, 'text': 'So this leads to the question of is there anything that we can do to reduce the variance and improve the estimator?', 'start': 2238.104, 'duration': 5.865}, {'end': 2259.401, 'text': 'And so variance reduction is an important area of research and policy gradients and in coming up with ways in order to improve the estimator and require fewer samples.', 'start': 2247.071, 'duration': 12.33}, {'end': 2263.304, 'text': "So let's look at a couple of ideas of how we can do this.", 'start': 2260.261, 'duration': 3.043}, {'end': 2265.906, 'text': 'So, given our gradient estimator.', 'start': 2264.605, 'duration': 1.301}], 'summary': 'Research aims to reduce variance, improve estimator, and require fewer samples.', 'duration': 27.802, 'max_score': 2238.104, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2238104.jpg'}], 'start': 1661.535, 'title': 'Reinforcement learning methods', 'summary': 'Covers deep q learning for atari games, discussing the use of deep q learning to train an agent for playing atari games, as well as policy gradients in reinforcement learning, which introduces policy gradients as a method to find the best policy from a collection of policies.', 'chapters': [{'end': 1728.093, 'start': 1661.535, 'title': 'Deep q learning for atari games', 'summary': 'Discusses the use of deep q learning to train an agent for playing atari games, highlighting the complexity of learning the value of state-action pairs and the simplicity of the policy.', 'duration': 66.558, 'highlights': ["The Q function's complexity in learning the value of every state-action pair can pose a significant challenge, especially for high-dimensional states and actions.", 'Contrastingly, the policy for the agent can be much simpler, focusing on specific motions and actions to achieve the desired task.']}, {'end': 2265.906, 'start': 1729.134, 'title': 'Policy gradients in reinforcement learning', 'summary': 'Introduces policy gradients as a method to find the best policy from a collection of policies, using gradient ascent on policy parameters, and discusses the challenges of intractability and high variance in the estimator.', 'duration': 536.772, 'highlights': ['Policy gradients as a method to find the best policy The chapter introduces policy gradients as a method to find the best policy from a collection of policies.', 'Using gradient ascent on policy parameters The chapter discusses using gradient ascent on policy parameters to continuously improve parameters and find the optimal policy.', 'Challenges of intractability and high variance in the estimator The chapter discusses the challenges of intractability and high variance in the estimator when differentiating the expected future reward over trajectories.']}], 'duration': 604.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE1661535.jpg', 'highlights': ["The Q function's complexity in learning the value of every state-action pair can pose a significant challenge, especially for high-dimensional states and actions.", 'Contrastingly, the policy for the agent can be much simpler, focusing on specific motions and actions to achieve the desired task.', 'Using gradient ascent on policy parameters The chapter discusses using gradient ascent on policy parameters to continuously improve parameters and find the optimal policy.', 'Challenges of intractability and high variance in the estimator The chapter discusses the challenges of intractability and high variance in the estimator when differentiating the expected future reward over trajectories.', 'Policy gradients as a method to find the best policy The chapter introduces policy gradients as a method to find the best policy from a collection of policies.']}, {'end': 2780.275, 'segs': [{'end': 2291.92, 'src': 'embed', 'start': 2267.145, 'weight': 0, 'content': [{'end': 2277.235, 'text': 'So the first idea is that we can push up the probabilities of an action only by its effect on future rewards from that state.', 'start': 2267.145, 'duration': 10.09}, {'end': 2285.217, 'text': 'So now, instead of scaling this likelihood or pushing up this likelihood of this action by our total reward of this trajectory,', 'start': 2277.475, 'duration': 7.742}, {'end': 2291.92, 'text': "let's look more specifically at just the sum of rewards coming from this time, step on to the end.", 'start': 2285.217, 'duration': 6.703}], 'summary': 'Push up action probabilities based on future rewards, not total reward trajectory.', 'duration': 24.775, 'max_score': 2267.145, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2267145.jpg'}, {'end': 2339.739, 'src': 'embed', 'start': 2315.026, 'weight': 1, 'content': [{'end': 2324.013, 'text': 'which is saying that our discount factor is gonna tell us how much we care about just the rewards that are coming up soon.', 'start': 2315.026, 'duration': 8.987}, {'end': 2327.136, 'text': 'versus rewards that came much later on.', 'start': 2325.276, 'duration': 1.86}, {'end': 2332.198, 'text': "So we're going to now say how good or bad an action is,", 'start': 2327.817, 'duration': 4.381}, {'end': 2339.739, 'text': 'looking more at the local neighborhood of actions that it generates in the immediate near future and downgrading the ones that come later on.', 'start': 2332.198, 'duration': 7.541}], 'summary': 'Discount factor determines weight of immediate vs. delayed rewards in evaluating actions.', 'duration': 24.713, 'max_score': 2315.026, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2315026.jpg'}, {'end': 2617.404, 'src': 'embed', 'start': 2589.937, 'weight': 2, 'content': [{'end': 2596.602, 'text': "So can we learn these? And the answer is yes, using Q learning, what we've already talked about before.", 'start': 2589.937, 'duration': 6.665}, {'end': 2606.518, 'text': "So we can combine policy gradients what we've just been talking about with Q learning, by training both an actor, which is the policy,", 'start': 2597.123, 'duration': 9.395}, {'end': 2614.122, 'text': 'as well as a critic, a Q function, which is gonna tell us how good we think a state is, and an action in this state.', 'start': 2606.518, 'duration': 7.604}, {'end': 2617.404, 'text': 'So, using this in approach,', 'start': 2615.243, 'duration': 2.161}], 'summary': 'Combining policy gradients with q learning can train an actor and a critic to evaluate states and actions.', 'duration': 27.467, 'max_score': 2589.937, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2589937.jpg'}, {'end': 2715.734, 'src': 'embed', 'start': 2688.504, 'weight': 3, 'content': [{'end': 2693.266, 'text': 'So using this, we can put together our full actor-critic algorithm.', 'start': 2688.504, 'duration': 4.762}, {'end': 2701.029, 'text': "And so what this looks like is that we're going to start off with by initializing our policy parameters, theta,", 'start': 2694.306, 'duration': 6.723}, {'end': 2703.47, 'text': "and our critic parameters that we'll call phi.", 'start': 2701.029, 'duration': 2.441}, {'end': 2712.653, 'text': "And then for each for iterations of training, we're going to sample m trajectories under the current policy right?", 'start': 2704.43, 'duration': 8.223}, {'end': 2715.734, 'text': "We're gonna play our policy and get these trajectories s0, a0, r0, s1, and so on.", 'start': 2712.673, 'duration': 3.061}], 'summary': 'Using actor-critic algorithm, initialize theta and phi, sample m trajectories for training.', 'duration': 27.23, 'max_score': 2688.504, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2688504.jpg'}], 'start': 2267.145, 'title': 'Reinforcement learning concepts and baseline and variance reduction', 'summary': 'Discusses pushing up action probabilities based on future rewards, using a discount factor, using baselines to reduce variance, integrating q learning with policy gradients, and training an actor-critic model.', 'chapters': [{'end': 2339.739, 'start': 2267.145, 'title': 'Reinforcement learning concepts', 'summary': 'Discusses the concept of pushing up action probabilities based on future rewards and using a discount factor to ignore delayed effects in reinforcement learning.', 'duration': 72.594, 'highlights': ['The importance of an action is determined by the future rewards it generates, rather than the total reward of the trajectory.', 'The use of a discount factor to prioritize immediate rewards and downgrade delayed effects.']}, {'end': 2780.275, 'start': 2341.46, 'title': 'Baseline and variance reduction in reinforcement learning', 'summary': 'Discusses the use of baselines to reduce variance in reinforcement learning, emphasizing the importance of relative rewards and the integration of q learning with policy gradients to train an actor-critic model.', 'duration': 438.815, 'highlights': ['The importance of using a baseline function to address the relative value of rewards is emphasized, with the suggestion of using a moving average of rewards as a simple baseline. The baseline function is introduced to address the relative importance of rewards, with the recommendation of using a moving average of rewards as a simple baseline.', 'The integration of Q learning with policy gradients is discussed, highlighting the training of an actor (policy) and a critic (Q function) to determine the quality of states and actions. The integration of Q learning with policy gradients involves training an actor (policy) and a critic (Q function) to assess the quality of states and actions, alleviating the task of learning Q values for every state-action pair.', 'The full actor-critic algorithm is outlined, involving the initialization of policy and critic parameters, sampling of trajectories, computation of gradients, and iterative training of policy and critic functions. The actor-critic algorithm includes initializing policy and critic parameters, sampling trajectories, computing gradients, and iteratively training policy and critic functions to optimize the reinforcement learning model.']}], 'duration': 513.13, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2267145.jpg', 'highlights': ['The importance of an action is determined by the future rewards it generates, rather than the total reward of the trajectory.', 'The use of a discount factor to prioritize immediate rewards and downgrade delayed effects.', 'The integration of Q learning with policy gradients involves training an actor (policy) and a critic (Q function) to assess the quality of states and actions, alleviating the task of learning Q values for every state-action pair.', 'The full actor-critic algorithm is outlined, involving the initialization of policy and critic parameters, sampling of trajectories, computation of gradients, and iterative training of policy and critic functions.']}, {'end': 3353.965, 'segs': [{'end': 2849.43, 'src': 'embed', 'start': 2798.482, 'weight': 0, 'content': [{'end': 2808.836, 'text': "which is a model also referred to as hard attention that you'll see a lot recently in computer vision tasks for various purposes.", 'start': 2798.482, 'duration': 10.354}, {'end': 2812.638, 'text': 'And so the idea behind this is here.', 'start': 2809.697, 'duration': 2.941}, {'end': 2823.2, 'text': "I've talked about the original work on hard attention, which is on image classification, and your goal is to still predict the image class,", 'start': 2812.638, 'duration': 10.562}, {'end': 2826.801, 'text': "but now you're going to do this by taking a sequence of glimpses around the image.", 'start': 2823.2, 'duration': 3.601}, {'end': 2836.483, 'text': "You're going to look at local regions around an image and you're basically going to selectively focus on these parts and build up information as you're looking around.", 'start': 2827.101, 'duration': 9.382}, {'end': 2844.929, 'text': 'And so the reason that we want to do this is, well, first of all, it has some nice inspiration from human perception and eye movements.', 'start': 2837.706, 'duration': 7.223}, {'end': 2849.43, 'text': "Let's say we're looking at a complex image and we want to determine what's in the image.", 'start': 2845.189, 'duration': 4.241}], 'summary': 'Model uses hard attention for computer vision tasks, focusing on local image regions to predict image class.', 'duration': 50.948, 'max_score': 2798.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2798482.jpg'}, {'end': 2987.218, 'src': 'embed', 'start': 2962.501, 'weight': 4, 'content': [{'end': 2969.666, 'text': 'this is why we need to use a reinforcement learning formulation and learn policies for how to take these glimpse actions,', 'start': 2962.501, 'duration': 7.165}, {'end': 2971.027, 'text': 'and we can train this using reinforce.', 'start': 2969.666, 'duration': 1.361}, {'end': 2981.297, 'text': "Given the state of glimpses so far, the core of our model is going to be this RNN that we're going to use to model the state,", 'start': 2973.515, 'duration': 7.782}, {'end': 2987.218, 'text': "and then we're going to use our policy parameters in order to output the next action.", 'start': 2981.297, 'duration': 5.921}], 'summary': 'Using reinforcement learning to train a policy model with rnn for glimpse actions.', 'duration': 24.717, 'max_score': 2962.501, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2962501.jpg'}, {'end': 3249.215, 'src': 'embed', 'start': 3200.34, 'weight': 2, 'content': [{'end': 3205.723, 'text': 'it usually learns to go closer to where the digit is and then looking at the relevant parts of the digit right?', 'start': 3200.34, 'duration': 5.383}, {'end': 3207.084, 'text': 'So this is pretty cool,', 'start': 3205.783, 'duration': 1.301}, {'end': 3219.351, 'text': 'and this follows kind of what you would expect right if you were to choose places to look next in order to most efficiently determine what digit this is.', 'start': 3207.084, 'duration': 12.267}, {'end': 3225.338, 'text': 'And so this idea of hard attention, of recurrent attention models,', 'start': 3220.515, 'duration': 4.823}, {'end': 3230.682, 'text': 'has also been used in a lot of tasks in computer vision in the last couple of years.', 'start': 3225.338, 'duration': 5.344}, {'end': 3235.325, 'text': "So you'll see this used, for example, fine-grained image recognition.", 'start': 3231.142, 'duration': 4.183}, {'end': 3246.193, 'text': 'So I mentioned earlier that one of the useful benefits of this can be also to both save on computational efficiency,', 'start': 3235.365, 'duration': 10.828}, {'end': 3249.215, 'text': 'as well as to ignore clutter and irrelevant parts of the image.', 'start': 3246.193, 'duration': 3.022}], 'summary': 'Hard attention and recurrent attention models have been utilized in computer vision tasks, such as fine-grained image recognition, to efficiently determine digits and ignore clutter.', 'duration': 48.875, 'max_score': 3200.34, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3200340.jpg'}], 'start': 2781.196, 'title': 'Recurrent attention model and reinforcement learning formulation', 'summary': 'Discusses the recurrent attention model, a hard attention model that enables selective focus on local regions of an image, saving computational resources, improving scalability, and enhancing classification performance. it also explores reinforcement learning formulation for image classification using glimpses, utilizing a recurrent neural network to model the state and learn policies, trained on mnist with a focus on computational efficiency and ignoring clutter, leading to efficient determination of the digit and has been utilized in tasks such as fine-grained image recognition, image captioning, and visual question answering.', 'chapters': [{'end': 2916.228, 'start': 2781.196, 'title': 'Recurrent attention model', 'summary': 'Discusses the recurrent attention model, a hard attention model that enables selective focus on local regions of an image, saving computational resources, improving scalability, and enhancing classification performance.', 'duration': 135.032, 'highlights': ['The recurrent attention model allows for selective focus on local regions of an image, inspired by human perception and eye movements, leading to improved classification performance and computational resource savings.', 'By taking a sequence of glimpses around the image and selectively focusing on specific parts, the model can efficiently process larger images and ignore clutter and irrelevant parts, enhancing scalability and classification performance.', 'Looking at low resolution image first and then focusing on high res portions of the image helps in saving computational resources, benefiting scalability and improving classification performance.']}, {'end': 3353.965, 'start': 2917.99, 'title': 'Reinforcement learning formulation', 'summary': 'Discusses the reinforcement learning formulation for image classification using glimpses, utilizing a recurrent neural network to model the state and learn policies, trained on mnist with a focus on computational efficiency and ignoring clutter, leading to efficient determination of the digit and has been utilized in tasks such as fine-grained image recognition, image captioning, and visual question answering.', 'duration': 435.975, 'highlights': ['The model uses a reinforcement learning formulation for image classification, with a focus on computational efficiency and ignoring clutter to efficiently determine the digit. The model utilizes a reinforcement learning formulation to efficiently determine the digit by focusing on relevant parts of the image, leading to computational efficiency and ignoring clutter.', 'The model has been trained on MNIST and follows the idea of hard attention and recurrent attention models, showing efficient determination of the digit. The model has been trained on MNIST and follows the idea of hard attention and recurrent attention models, showing efficient determination of the digit.', 'The reinforcement learning formulation involves utilizing a recurrent neural network to model the state and learn policies for image classification. The reinforcement learning formulation involves utilizing a recurrent neural network to model the state and learn policies for image classification.', "The model's application extends to tasks such as fine-grained image recognition, image captioning, and visual question answering, demonstrating its versatility and usefulness. The model's application extends to tasks such as fine-grained image recognition, image captioning, and visual question answering, demonstrating its versatility and usefulness.", 'The reinforcement learning formulation focuses on computational efficiency, saving on computational resources by focusing on specific, smaller parts of the image. The reinforcement learning formulation focuses on computational efficiency, saving on computational resources by focusing on specific, smaller parts of the image.']}], 'duration': 572.769, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE2781196.jpg', 'highlights': ['The recurrent attention model allows for selective focus on local regions of an image, inspired by human perception and eye movements, leading to improved classification performance and computational resource savings.', 'By taking a sequence of glimpses around the image and selectively focusing on specific parts, the model can efficiently process larger images and ignore clutter and irrelevant parts, enhancing scalability and classification performance.', 'The model uses a reinforcement learning formulation for image classification, with a focus on computational efficiency and ignoring clutter to efficiently determine the digit.', 'The model has been trained on MNIST and follows the idea of hard attention and recurrent attention models, showing efficient determination of the digit.', 'The reinforcement learning formulation involves utilizing a recurrent neural network to model the state and learn policies for image classification.', "The model's application extends to tasks such as fine-grained image recognition, image captioning, and visual question answering, demonstrating its versatility and usefulness."]}, {'end': 3837.232, 'segs': [{'end': 3418.046, 'src': 'embed', 'start': 3384.874, 'weight': 0, 'content': [{'end': 3389.396, 'text': "that's been in the news a lot in the past, last year and this year.", 'start': 3384.874, 'duration': 4.522}, {'end': 3394.369, 'text': "And yesterday, yes, that's correct.", 'start': 3392.648, 'duration': 1.721}, {'end': 3399.933, 'text': 'So this is very, very exciting recent news as well.', 'start': 3395.21, 'duration': 4.723}, {'end': 3411.481, 'text': 'So last year, a first version of AlphaGo was put into a competition against one of the best Go players of recent years, Lee Sedol.', 'start': 3400.013, 'duration': 11.468}, {'end': 3418.046, 'text': 'And the agent was able to beat him four to one in a game of five matches.', 'start': 3412.202, 'duration': 5.844}], 'summary': 'Alphago beat lee sedol 4-1 in a go competition last year.', 'duration': 33.172, 'max_score': 3384.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3384874.jpg'}, {'end': 3476.445, 'src': 'embed', 'start': 3444.842, 'weight': 2, 'content': [{'end': 3446.963, 'text': "It's pretty interesting to hear the commentary.", 'start': 3444.842, 'duration': 2.121}, {'end': 3456.664, 'text': "So what is this AlphaGo agent from DeepMind? And it's based on a lot of what we've talked about so far in this lecture.", 'start': 3450.118, 'duration': 6.546}, {'end': 3458.646, 'text': 'And what it is.', 'start': 3457.345, 'duration': 1.301}, {'end': 3468.996, 'text': "it's a mix of supervised learning and reinforcement learning, as well as a mix of some older methods for Go Monte Carlo, tree search,", 'start': 3458.646, 'duration': 10.35}, {'end': 3472.019, 'text': 'as well as the recent deep RL approaches.', 'start': 3468.996, 'duration': 3.023}, {'end': 3476.445, 'text': 'Okay, so how does AlphaGo beat the Go World Champion?', 'start': 3473.604, 'duration': 2.841}], 'summary': 'Alphago is a mix of supervised learning, reinforcement learning, monte carlo tree search, and deep rl approaches, beating the go world champion.', 'duration': 31.603, 'max_score': 3444.842, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3444842.jpg'}, {'end': 3688.048, 'src': 'embed', 'start': 3635.58, 'weight': 1, 'content': [{'end': 3643.787, 'text': "And they train this, I think, over the version of AlphaGo that's being used in these matches is like, I think,", 'start': 3635.58, 'duration': 8.207}, {'end': 3648.39, 'text': 'maybe a couple thousand CPUs plus a couple hundred GPUs, putting all of this together.', 'start': 3643.787, 'duration': 4.603}, {'end': 3660.219, 'text': "So it's a huge, huge amount of training that's going on, right? And yeah, so you guys should follow the game this week.", 'start': 3648.41, 'duration': 11.809}, {'end': 3662.721, 'text': "It's pretty exciting.", 'start': 3661.38, 'duration': 1.341}, {'end': 3670.418, 'text': "Okay, so in summary today we've talked about policy gradients, right, which are general.", 'start': 3663.894, 'duration': 6.524}, {'end': 3677.842, 'text': "You're just directly taking gradient descent or ascent on your policy parameters.", 'start': 3671.458, 'duration': 6.384}, {'end': 3684.286, 'text': 'So this works well for a large class of problems, but it also suffers from high variance.', 'start': 3678.803, 'duration': 5.483}, {'end': 3688.048, 'text': 'so it requires a lot of samples, and your challenge here is sample efficiency.', 'start': 3684.286, 'duration': 3.762}], 'summary': 'Alphago trained with thousands of cpus and gpus, demonstrating high sample efficiency.', 'duration': 52.468, 'max_score': 3635.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3635580.jpg'}, {'end': 3809.317, 'src': 'embed', 'start': 3767.839, 'weight': 3, 'content': [{'end': 3775.928, 'text': "Okay, so we've talked about policy gradients and Q-learning, and just another look at some of these, some of the guarantees that you have.", 'start': 3767.839, 'duration': 8.089}, {'end': 3783.156, 'text': "With policy gradients, one thing we do know that's really nice is that this will always converge to a local minimum of J of theta.", 'start': 3775.948, 'duration': 7.208}, {'end': 3791.731, 'text': "because we're just directly doing gradient ascent, and so this is often, and this local minimum is often just pretty good.", 'start': 3784.929, 'duration': 6.802}, {'end': 3794.032, 'text': 'And in Q-learning, on the other hand,', 'start': 3792.612, 'duration': 1.42}, {'end': 3801.755, 'text': "we don't have any guarantees because here we're trying to approximate this Bellman equation with a complicated function approximator,", 'start': 3794.032, 'duration': 7.723}, {'end': 3809.317, 'text': 'and so in this case this is the problem, with Q-learning being a little bit trickier to train in terms of applicability to a wide range of problems.', 'start': 3801.755, 'duration': 7.562}], 'summary': 'Policy gradients converge to local minimum, q-learning lacks guarantees due to complex approximation.', 'duration': 41.478, 'max_score': 3767.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3767839.jpg'}], 'start': 3354.506, 'title': 'Alphago and reinforcement learning', 'summary': "Discusses alphago's success using policy gradients, supervised and reinforcement learning, and monte carlo tree search, achieving significant success in go competitions. it also provides an overview of policy gradients and q-learning, highlighting their challenges and advantages, including high variance in policy gradients, sample efficiency in q-learning, and the convergence guarantee in policy gradients.", 'chapters': [{'end': 3662.721, 'start': 3354.506, 'title': 'Alphago and policy gradient learning', 'summary': 'Discusses the use of policy gradients in training alphago, which competed against top go players and utilized a mix of supervised and reinforcement learning, as well as monte carlo tree search, achieving significant success in go competitions with a huge amount of training.', 'duration': 308.215, 'highlights': ['The AlphaGo agent, trained with policy gradients, won against top Go players like Lee Sedol and Ke Jie, achieving a significant victory of four to one and winning the first game against Ke Jie.', 'The training process of AlphaGo involved a mix of supervised learning from professional Go games, reinforcement learning through self-play, and the integration of a value network and a Monte Carlo tree search algorithm to select actions by look ahead search.', 'The version of AlphaGo used in matches involved training with thousands of CPUs and hundreds of GPUs, highlighting the extensive computational resources utilized for training.']}, {'end': 3837.232, 'start': 3663.894, 'title': 'Reinforcement learning overview', 'summary': 'Discussed policy gradients and q-learning, highlighting the challenges and advantages of each, such as high variance in policy gradients, sample efficiency in q-learning, and the convergence guarantee in policy gradients.', 'duration': 173.338, 'highlights': ['Policy gradients converge to a local minimum of J of theta, providing a reliable performance.', 'Q-learning is more sample efficient than policy gradients for certain problems, such as Atari, but lacks guarantees due to approximating the Bellman equation with a complex function approximator.', 'Policy gradients suffer from high variance, requiring a large number of samples, leading to challenges in sample efficiency.']}], 'duration': 482.726, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/lvoHnicueoE/pics/lvoHnicueoE3354506.jpg', 'highlights': ['The AlphaGo agent, trained with policy gradients, won against top Go players like Lee Sedol and Ke Jie, achieving a significant victory of four to one and winning the first game against Ke Jie.', 'The version of AlphaGo used in matches involved training with thousands of CPUs and hundreds of GPUs, highlighting the extensive computational resources utilized for training.', 'The training process of AlphaGo involved a mix of supervised learning from professional Go games, reinforcement learning through self-play, and the integration of a value network and a Monte Carlo tree search algorithm to select actions by look ahead search.', 'Policy gradients converge to a local minimum of J of theta, providing a reliable performance.', 'Q-learning is more sample efficient than policy gradients for certain problems, such as Atari, but lacks guarantees due to approximating the Bellman equation with a complex function approximator.', 'Policy gradients suffer from high variance, requiring a large number of samples, leading to challenges in sample efficiency.']}], 'highlights': ['The AlphaGo agent, trained with policy gradients, won against top Go players like Lee Sedol and Ke Jie, achieving a significant victory of four to one and winning the first game against Ke Jie.', 'The version of AlphaGo used in matches involved training with thousands of CPUs and hundreds of GPUs, highlighting the extensive computational resources utilized for training.', 'The training process of AlphaGo involved a mix of supervised learning from professional Go games, reinforcement learning through self-play, and the integration of a value network and a Monte Carlo tree search algorithm to select actions by look ahead search.', 'The recurrent attention model allows for selective focus on local regions of an image, inspired by human perception and eye movements, leading to improved classification performance and computational resource savings.', 'By taking a sequence of glimpses around the image and selectively focusing on specific parts, the model can efficiently process larger images and ignore clutter and irrelevant parts, enhancing scalability and classification performance.']}