title
Q Learning Intro/Table - Reinforcement Learning p.1

description
Welcome to a reinforcement learning tutorial. In this part, we're going to focus on Q-Learning. Q-Learning is a model-free form of machine learning, in the sense that the AI "agent" does not need to know or have a model of the environment that it will be in. The same algorithm can be used across a variety of environments. For a given environment, everything is broken down into "states" and "actions." The states are observations and samplings that we pull from the environment, and the actions are the choices the agent has made based on the observation. For the purposes of the rest of this tutorial, we'll use the context of our environment to exemplify how this works. Text-based tutorial and sample code: https://pythonprogramming.net/q-learning-reinforcement-learning-python-tutorial/ Channel membership: https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ/join Discord: https://discord.gg/sentdex Support the content: https://pythonprogramming.net/support-donate/ Twitter: https://twitter.com/sentdex Instagram: https://instagram.com/sentdex Facebook: https://www.facebook.com/pythonprogramming.net/ Twitch: https://www.twitch.tv/sentdex #reinforcementlearning #machinelearning #python

detail
{'title': 'Q Learning Intro/Table - Reinforcement Learning p.1', 'heatmap': [{'end': 347.459, 'start': 314.837, 'weight': 0.797}], 'summary': "Introduces q learning in gym, a model-free reinforcement learning algorithm applicable to any environment, emphasizing the importance of updating q values over time and rewarding for long-term goals, with a focus on the mountain car environment and three available actions, while also covering the process of exploring and updating q values, challenges of dealing with continuous values, creating a cue table with a range divided into 20 chunks, and the q-learning algorithm's working principle of initializing a q table and gradually updating the q values over time to make optimal decisions.", 'chapters': [{'end': 331.266, 'segs': [{'end': 55.393, 'src': 'embed', 'start': 20.998, 'weight': 1, 'content': [{'end': 23.419, 'text': "in this case, though, There really shouldn't be anything.", 'start': 20.998, 'duration': 2.421}, {'end': 29.221, 'text': 'if you know basic Python, you should be able to follow along with the Q learning tutorials pretty well.', 'start': 23.419, 'duration': 5.802}, {'end': 32.202, 'text': 'now, when we start getting into deep q learning.', 'start': 29.221, 'duration': 2.981}, {'end': 34.023, 'text': 'uh, that changes now.', 'start': 32.202, 'duration': 1.821}, {'end': 36.004, 'text': 'you need to know tensorflow and keras and stuff.', 'start': 34.023, 'duration': 1.981}, {'end': 38.525, 'text': 'so so uh, just keep that in mind.', 'start': 36.004, 'duration': 2.521}, {'end': 42.607, 'text': 'um, but at least to start pretty much anybody should be able to follow along here.', 'start': 38.525, 'duration': 4.082}, {'end': 45.888, 'text': 'so, first of all, what actually is q learning?', 'start': 42.607, 'duration': 3.281}, {'end': 55.393, 'text': 'uh, the idea of q learning is to have these like q values for every action you could possibly take given a state.', 'start': 45.888, 'duration': 9.505}], 'summary': 'Q learning tutorials are accessible with basic python knowledge, but deep q learning requires tensorflow and keras.', 'duration': 34.395, 'max_score': 20.998, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH820998.jpg'}, {'end': 116.899, 'src': 'embed', 'start': 68.403, 'weight': 0, 'content': [{'end': 74.547, 'text': "Now that's done by rewarding some sort of agent as they get through this environment.", 'start': 68.403, 'duration': 6.144}, {'end': 78.05, 'text': 'And the idea is to kind of reward for the long-term goal.', 'start': 74.587, 'duration': 3.463}, {'end': 81.652, 'text': 'rather than in any immediate short-term actions.', 'start': 78.89, 'duration': 2.762}, {'end': 85.254, 'text': "Now, Q-learning specifically is what's called model-free learning.", 'start': 81.752, 'duration': 3.502}, {'end': 94.781, 'text': 'The idea is that the Q-learning model that we rewrite, basically the Q-learning algorithm, is applicable to any environment.', 'start': 85.795, 'duration': 8.986}, {'end': 97.042, 'text': "It's not really environment-specific.", 'start': 94.861, 'duration': 2.181}, {'end': 106.43, 'text': 'So the code that we write should be mostly applicable to any other environment that we might want to use it in,', 'start': 97.122, 'duration': 9.308}, {'end': 108.372, 'text': "as long as that environment's simple enough.", 'start': 106.43, 'duration': 1.942}, {'end': 114.777, 'text': "So Q-learning and the way that you know what we're about to work with is good for really basic environments,", 'start': 108.412, 'duration': 6.365}, {'end': 116.899, 'text': "but it's not going to do anything really complicated.", 'start': 114.777, 'duration': 2.122}], 'summary': 'Q-learning rewards long-term goals, adaptable to simple environments.', 'duration': 48.496, 'max_score': 68.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH868403.jpg'}, {'end': 190.378, 'src': 'embed', 'start': 162.355, 'weight': 4, 'content': [{'end': 170.084, 'text': "So even though this environment doesn't matter much to our AI, it matters to you to understand how Q-learning actually works.", 'start': 162.355, 'duration': 7.729}, {'end': 174.368, 'text': "So what I'm going to go ahead and do is fix this first of all.", 'start': 170.764, 'duration': 3.604}, {'end': 174.809, 'text': 'There we go.', 'start': 174.429, 'duration': 0.38}, {'end': 179.833, 'text': 'Import, import, gym.', 'start': 175.65, 'duration': 4.183}, {'end': 187.597, 'text': "and then what we're going to say is the environment is a gym dot make, and the environment we're going to use here is mountain car.", 'start': 179.833, 'duration': 7.764}, {'end': 190.378, 'text': 'dash of v zero.', 'start': 187.597, 'duration': 2.781}], 'summary': 'Demonstrating q-learning with mountain car environment in python.', 'duration': 28.023, 'max_score': 162.355, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8162355.jpg'}, {'end': 241.491, 'src': 'embed', 'start': 217.47, 'weight': 6, 'content': [{'end': 228.86, 'text': "Now again For your benefit. I think it's useful to understand how the environment works, but in terms of your agent, it doesn't need to actually know.", 'start': 217.47, 'duration': 11.39}, {'end': 235.325, 'text': 'But for your own information, Mountain Car, this environment, has three actions that you can take.', 'start': 228.96, 'duration': 6.365}, {'end': 241.491, 'text': "Now, in a little bit, I'll show you, you can actually pull all of these gym environments to figure out how many actions are there.", 'start': 235.626, 'duration': 5.865}], 'summary': 'Mountain car environment has three actions available for agents.', 'duration': 24.021, 'max_score': 217.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8217470.jpg'}, {'end': 295.07, 'src': 'embed', 'start': 266.397, 'weight': 5, 'content': [{'end': 268.898, 'text': "And in fact, I think I'll already populate this.", 'start': 266.397, 'duration': 2.501}, {'end': 270.438, 'text': "That way we don't have to worry.", 'start': 269.318, 'duration': 1.12}, {'end': 272.159, 'text': "We'll hit an error if I don't do this.", 'start': 270.498, 'duration': 1.661}, {'end': 281.802, 'text': "So the next thing we're going to say is new underscore state, comma, reward, comma, done, comma, underscore equals env.stepaction.", 'start': 272.219, 'duration': 9.583}, {'end': 286.685, 'text': 'So every time we step with an action, we get a new state from the environment.', 'start': 282.522, 'duration': 4.163}, {'end': 292.909, 'text': "So the state is the things that we're sensing from the environment, basically.", 'start': 286.725, 'duration': 6.184}, {'end': 295.07, 'text': 'In this case, the state has two values.', 'start': 293.009, 'duration': 2.061}], 'summary': 'Creating a new state and acquiring environment sensing data.', 'duration': 28.673, 'max_score': 266.397, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8266397.jpg'}], 'start': 1.27, 'title': 'Q-learning in gym environment', 'summary': 'Introduces q learning in gym, a model-free reinforcement learning algorithm, applicable to any environment, and emphasizes the importance of updating q values over time and rewarding for long-term goals, with a focus on mountain car environment and three available actions.', 'chapters': [{'end': 116.899, 'start': 1.27, 'title': 'Q learning tutorial introduction', 'summary': 'Introduces q learning, a model-free reinforcement learning algorithm, applicable to any environment, and suitable for those with basic knowledge of python, but requiring tensorflow and keras for deep q learning, emphasizing the importance of updating q values over time and rewarding for long-term goals.', 'duration': 115.629, 'highlights': ['Q-learning is a model-free reinforcement learning algorithm, where Q values for every action given a state are updated over time to produce a good result, applicable to any environment (Relevance: 5)', 'Basic knowledge of Python is sufficient to follow along with Q learning tutorials, while deep Q learning requires TensorFlow and Keras (Relevance: 4)', 'The Q-learning model is designed to reward for the long-term goal rather than short-term actions (Relevance: 3)', "The Q-learning algorithm is not environment-specific and can be applied to any environment as long as it's simple enough (Relevance: 2)"]}, {'end': 331.266, 'start': 117.559, 'title': 'Q-learning in gym environment', 'summary': 'Explains the process of initializing a q-learning environment in gym, iteratively taking actions, and obtaining new states and rewards, with a focus on mountain car environment and three available actions.', 'duration': 213.707, 'highlights': ['The chapter explains the process of initializing a Q-learning environment in Gym Describes the process of setting up a Q-learning environment in Gym for learning purposes.', 'Iteratively taking actions and obtaining new states and rewards is demonstrated in the transcript Demonstrates the iterative process of taking actions and obtaining new states and rewards in the Q-learning environment.', 'The focus is on the mountain car environment and three available actions Emphasizes the usage of the mountain car environment and describes the availability of three actions within the environment.']}], 'duration': 329.996, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81270.jpg', 'highlights': ['Q-learning is a model-free reinforcement learning algorithm, updating Q values for every action given a state over time (Relevance: 5)', 'Basic knowledge of Python is sufficient for Q learning tutorials, while deep Q learning requires TensorFlow and Keras (Relevance: 4)', 'The Q-learning model is designed to reward for the long-term goal rather than short-term actions (Relevance: 3)', 'The Q-learning algorithm is not environment-specific and can be applied to any environment (Relevance: 2)', 'The chapter explains the process of initializing a Q-learning environment in Gym (Relevance: 1)', 'Iteratively taking actions and obtaining new states and rewards is demonstrated in the transcript (Relevance: 1)', 'The focus is on the mountain car environment and three available actions (Relevance: 1)']}, {'end': 855.659, 'segs': [{'end': 358.654, 'src': 'embed', 'start': 331.266, 'weight': 1, 'content': [{'end': 342.315, 'text': "close, I'm gonna save that, I'm gonna run it from terminal Python Q, learn one pie and we should get great.", 'start': 331.266, 'duration': 11.049}, {'end': 346.378, 'text': 'So, as you can see, the car is trying to get up this mountain for this hill.', 'start': 342.315, 'duration': 4.063}, {'end': 347.459, 'text': "I'm not sure.", 'start': 347.098, 'duration': 0.361}, {'end': 348.559, 'text': "I'd call that a mountain,", 'start': 347.459, 'duration': 1.1}, {'end': 357.072, 'text': "but Anyway it just clearly doesn't quite have the horsepower and instead what it needs to do is build momentum by going up here,", 'start': 348.559, 'duration': 8.513}, {'end': 358.654, 'text': 'swinging down here and then building up.', 'start': 357.072, 'duration': 1.582}], 'summary': 'Car struggles to climb hill, needs more horsepower to build momentum.', 'duration': 27.388, 'max_score': 331.266, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8331266.jpg'}, {'end': 413.829, 'src': 'embed', 'start': 390.411, 'weight': 0, 'content': [{'end': 397.197, 'text': 'But basically the way that this is going to work is we want to create this Q table.', 'start': 390.411, 'duration': 6.786}, {'end': 404.203, 'text': "It's this large table that, given any combination of states, we can just look it up on the table right?", 'start': 397.918, 'duration': 6.285}, {'end': 408.306, 'text': 'So, given any combination of state, of Position and velocity,', 'start': 404.603, 'duration': 3.703}, {'end': 413.829, 'text': 'For every combination of position and velocity, we just want to look up on this table.', 'start': 408.346, 'duration': 5.483}], 'summary': 'Create a q table to look up any state combination for position and velocity.', 'duration': 23.418, 'max_score': 390.411, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8390411.jpg'}, {'end': 469.416, 'src': 'embed', 'start': 437.767, 'weight': 2, 'content': [{'end': 446.414, 'text': 'so so actually, initially our agent is going to do what we call explore a lot, which means do random stuff, uh,', 'start': 437.767, 'duration': 8.647}, {'end': 450.036, 'text': 'and then slowly update those q values as time goes on.', 'start': 446.414, 'duration': 3.622}, {'end': 453.519, 'text': "so anyway, that's a lot of information to throw at you.", 'start': 450.036, 'duration': 3.483}, {'end': 460.646, 'text': "so So I think the first thing we're going to do is build the queue table and then go from there.", 'start': 453.519, 'duration': 7.127}, {'end': 468.154, 'text': 'So so first of all, let me just print will print new state.', 'start': 460.847, 'duration': 7.307}, {'end': 469.416, 'text': 'Save that.', 'start': 468.935, 'duration': 0.481}], 'summary': 'Agent explores and updates q values before building queue table.', 'duration': 31.649, 'max_score': 437.767, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8437767.jpg'}, {'end': 546.841, 'src': 'embed', 'start': 515.803, 'weight': 3, 'content': [{'end': 520.187, 'text': 'I mean, by the time we were done, quantum computing would be a huge thing already.', 'start': 515.803, 'duration': 4.384}, {'end': 523.169, 'text': "So we don't want to wait that long.", 'start': 521.508, 'duration': 1.661}, {'end': 531.774, 'text': 'So now, what we need to do is convert these continuous values to what are referred to as discrete values.', 'start': 523.65, 'duration': 8.124}, {'end': 537.316, 'text': 'Basically, we want to bucket this information into buckets of some sort of size.', 'start': 531.874, 'duration': 5.442}, {'end': 546.841, 'text': "Now what size those buckets are is going to be yet another one of many variables that we're going to set and likely need to tweak as time goes on.", 'start': 537.396, 'duration': 9.445}], 'summary': 'To expedite progress, we aim to convert continuous values to discrete values, organizing them into variable-sized buckets.', 'duration': 31.038, 'max_score': 515.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8515803.jpg'}, {'end': 662.867, 'src': 'embed', 'start': 638.449, 'weight': 4, 'content': [{'end': 647.233, 'text': 'But anyway, we can pull the environment of any of these gym environments this way to know exactly how many actions are possible.', 'start': 638.449, 'duration': 8.784}, {'end': 655.601, 'text': "Okay, so the next thing that we want to do is create, we want to figure out how we're going to create this cue table.", 'start': 648.635, 'duration': 6.966}, {'end': 662.867, 'text': "So we want this cue table to be of an A size that's at least manageable.", 'start': 655.681, 'duration': 7.186}], 'summary': 'Creating a cue table for gym environments with manageable size.', 'duration': 24.418, 'max_score': 638.449, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8638449.jpg'}, {'end': 772.462, 'src': 'embed', 'start': 745.359, 'weight': 5, 'content': [{'end': 751.837, 'text': 'So with that entire range, so from 0.6 to negative 1.2,.', 'start': 745.359, 'duration': 6.478}, {'end': 761.579, 'text': 'We want to separate that range into 20 chunks or 20 buckets or 20 discrete values, okay?', 'start': 751.837, 'duration': 9.742}, {'end': 764.16, 'text': "However, you want to call it, that's what we want to do.", 'start': 761.839, 'duration': 2.321}, {'end': 770.781, 'text': 'And then we want to do the same thing for the range of 0.07 to negative 0.07.', 'start': 764.54, 'duration': 6.241}, {'end': 772.462, 'text': 'We want 20 chunks and buckets.', 'start': 770.781, 'duration': 1.681}], 'summary': 'Range from 0.6 to -1.2 divided into 20 chunks, and range from 0.07 to -0.07 also divided into 20 chunks.', 'duration': 27.103, 'max_score': 745.359, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8745359.jpg'}, {'end': 855.659, 'src': 'embed', 'start': 828.63, 'weight': 6, 'content': [{'end': 832.673, 'text': "And then what I'm going to do is I'm just going to print this out just so you can see what I'm talking about.", 'start': 828.63, 'duration': 4.043}, {'end': 839.219, 'text': 'So now print, not in all caps though, print discrete OS win size.', 'start': 832.713, 'duration': 6.506}, {'end': 840.9, 'text': "We'll run that real quickly.", 'start': 839.459, 'duration': 1.441}, {'end': 847.666, 'text': 'And you can see now, okay, so for positions, each bucket will be of a range that is 0.09 long.', 'start': 841.3, 'duration': 6.366}, {'end': 849.227, 'text': 'And then over here, 0.007 long.', 'start': 847.726, 'duration': 1.501}, {'end': 855.659, 'text': 'for the velocity chunks.', 'start': 854.437, 'duration': 1.222}], 'summary': 'Printing discrete os win size and velocity chunks measurements.', 'duration': 27.029, 'max_score': 828.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8828630.jpg'}], 'start': 331.266, 'title': 'Reinforcement learning concepts', 'summary': 'Covers q learning for mountain car, converting continuous values to discrete, and creating a cue table for reinforcement learning, emphasizing the process of exploring and updating q values, challenges of dealing with continuous values, and the process of creating a cue table with a range divided into 20 chunks.', 'chapters': [{'end': 493.584, 'start': 331.266, 'title': 'Q learning for mountain car', 'summary': 'Explains the concept of q learning with a focus on building a q table to determine the best actions for a car to reach a goal area, while emphasizing the process of exploring and updating q values over time.', 'duration': 162.318, 'highlights': ['The process involves creating a Q table to look up the Q values for any combination of position and velocity, enabling the car to pick the action with the maximum Q value for exploitation, initially initialized with random and useless values and gradually updated over time. (Relevance: 5)', 'The car needs to build momentum by moving up and then swinging down the hill to eventually reach the yellow flag area, demonstrating the challenge of insufficient horsepower to overcome the obstacle. (Relevance: 4)', 'The chapter highlights the initial exploration phase where the agent performs random actions and gradually updates the Q values, indicating the transition from exploration to exploitation in the learning process. (Relevance: 3)']}, {'end': 591.778, 'start': 493.584, 'title': 'Converting continuous values to discrete', 'summary': 'Discusses the challenges of dealing with continuous values and the need to convert them to discrete values to avoid memory blowout and long learning times, with the necessity of setting and tweaking variables, as well as the determination of the number of actions that can be taken.', 'duration': 98.194, 'highlights': ['The need to convert continuous values to discrete values to avoid memory blowout and long learning times.', 'The necessity of setting and tweaking variables, including the determination of the number of actions that can be taken.', 'Challenges of dealing with continuous values and the impact on memory and learning times.']}, {'end': 855.659, 'start': 592.138, 'title': 'Creating cue table for reinforcement learning', 'summary': 'Discusses the process of creating a cue table for reinforcement learning, including determining the size and range of discrete observation space, with a focus on dividing the range into 20 chunks and calculating the size of each chunk.', 'duration': 263.521, 'highlights': ["Determining the size of the cue table based on the manageable A size and the discrete observation space size of 20 by 20 The cue table is created to be of an A size that's at least manageable, with the discrete observation space size set at 20 by 20.", 'Dividing the range of the observation space into 20 chunks to avoid an infinitely continuous range The range of the observation space is separated into 20 chunks to prevent an infinitely continuous range, ensuring easier learning.', 'Calculating the size of each bucket in the observation space, resulting in a range of 0.09 for positions and 0.007 for velocity chunks The size of each bucket in the observation space is calculated, resulting in a range of 0.09 for positions and 0.007 for velocity chunks.']}], 'duration': 524.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8331266.jpg', 'highlights': ['Creating a Q table to look up Q values for any combination of position and velocity, enabling the car to pick the action with the maximum Q value for exploitation.', 'The car needs to build momentum by moving up and then swinging down the hill to eventually reach the yellow flag area, demonstrating the challenge of insufficient horsepower to overcome the obstacle.', 'The initial exploration phase where the agent performs random actions and gradually updates the Q values, indicating the transition from exploration to exploitation in the learning process.', 'Converting continuous values to discrete values to avoid memory blowout and long learning times.', 'Determining the size of the cue table based on the manageable A size and the discrete observation space size of 20 by 20.', 'Dividing the range of the observation space into 20 chunks to avoid an infinitely continuous range, ensuring easier learning.', 'Calculating the size of each bucket in the observation space, resulting in a range of 0.09 for positions and 0.007 for velocity chunks.']}, {'end': 1071.487, 'segs': [{'end': 908.859, 'src': 'embed', 'start': 877.644, 'weight': 1, 'content': [{'end': 879.585, 'text': 'And by work, I mean just the code should run.', 'start': 877.644, 'duration': 1.941}, {'end': 892.133, 'text': 'But really you would take your Q-learning scripts and make them far more dynamic and probably also run maybe even a few hundred episodes,', 'start': 880.526, 'duration': 11.607}, {'end': 894.455, 'text': 'to tweak these numbers a bit.', 'start': 892.133, 'duration': 2.322}, {'end': 899.957, 'text': "um, but because this is not our stopping point, i'm not going to spend a bunch of time doing that.", 'start': 895.255, 'duration': 4.702}, {'end': 904.478, 'text': "like we're going to get into deep key learning and stuff, so i don't i don't want to spend too much time there.", 'start': 899.957, 'duration': 4.521}, {'end': 908.859, 'text': 'uh, we will spend a little bit of time talking about how we can optimize these numbers.', 'start': 904.478, 'duration': 4.381}], 'summary': 'Enhance q-learning scripts by running a few hundred episodes to optimize numbers.', 'duration': 31.215, 'max_score': 877.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8877644.jpg'}, {'end': 952.453, 'src': 'embed', 'start': 925.935, 'weight': 3, 'content': [{'end': 929.98, 'text': 'The objective at the end of the day for us is to have this like massive cue table.', 'start': 925.935, 'duration': 4.045}, {'end': 932.143, 'text': "Okay And so I'm just going to draw.", 'start': 930.18, 'duration': 1.963}, {'end': 937.361, 'text': "just the best table you've probably ever seen in your life.", 'start': 933.918, 'duration': 3.443}, {'end': 942.405, 'text': "And as you've seen, we have three actions right?", 'start': 938.942, 'duration': 3.463}, {'end': 945.908, 'text': "You've got action zero, action one and action two right?", 'start': 942.425, 'duration': 3.483}, {'end': 952.453, 'text': "And then over here you've got every possible combination of.", 'start': 946.328, 'duration': 6.125}], 'summary': 'Objective: create a massive cue table with 3 actions and all combinations.', 'duration': 26.518, 'max_score': 925.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8925935.jpg'}, {'end': 1048.759, 'src': 'embed', 'start': 974.477, 'weight': 0, 'content': [{'end': 980.662, 'text': "it's going to come to this table and it's going to see what are the Q values for this specific combination.", 'start': 974.477, 'duration': 6.185}, {'end': 985.605, 'text': "So for action two, maybe it's like a zero, it's a two and a one.", 'start': 980.702, 'duration': 4.903}, {'end': 987.427, 'text': 'Okay And these are the Q values.', 'start': 985.866, 'duration': 1.561}, {'end': 991.45, 'text': "So then it's going to look, what is the highest Q value? Well, it's a two.", 'start': 987.907, 'duration': 3.543}, {'end': 997.336, 'text': "So it's going to perform action number one, because number one has the largest Q value.", 'start': 991.93, 'duration': 5.406}, {'end': 1004.783, 'text': "So over time, at least initially, the agent is going to do a lot of exploration, so it's just going to pick a random value.", 'start': 997.416, 'duration': 7.367}, {'end': 1010.189, 'text': 'But then, over time, as it picks these things and eventually reaches a reward, Then,', 'start': 1005.444, 'duration': 4.745}, {'end': 1017.094, 'text': "using the Q function for this lovely thing that we'll probably talk about in the next video,", 'start': 1010.189, 'duration': 6.905}, {'end': 1028.102, 'text': 'it will slowly kind of back propagate that reward to make for higher Q values for the actions that, when chained together, lead to that lovely reward.', 'start': 1017.094, 'duration': 11.008}, {'end': 1034.067, 'text': 'But for now, we just need to initialize this Q table so later we can update this Q table.', 'start': 1028.202, 'duration': 5.865}, {'end': 1048.759, 'text': "So the way that we're going to initialize this Q table is we're going to say Q underscore table is equal to NP dot random dot uniform.", 'start': 1034.827, 'duration': 13.932}], 'summary': 'Using q-learning, agent explores and updates q table for higher rewards.', 'duration': 74.282, 'max_score': 974.477, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8974477.jpg'}], 'start': 856.76, 'title': 'Q-learning algorithm', 'summary': 'Discusses creating a cue table for q-learning, emphasizing the need for dynamic q-learning scripts, optimization of values through multiple episodes, and visualization of a cue table. it also explains how q-learning algorithm works by initializing a q table and gradually updating the q values over time to make optimal decisions, allowing the agent to perform actions with the highest q values and learn from rewards.', 'chapters': [{'end': 952.453, 'start': 856.76, 'title': 'Creating cue table for q-learning', 'summary': 'Discusses the process of creating a cue table for q-learning, emphasizing the need for dynamic q-learning scripts and the optimization of values through multiple episodes, and finally, the visualization of a cue table with three actions and their combinations.', 'duration': 95.693, 'highlights': ['The process of creating a cue table for Q-learning involves dynamic script adjustments and optimization through multiple episodes to ensure the code runs efficiently.', 'The visualization of the cue table includes three actions and their combinations, aiming to create a comprehensive and dynamic cue table for Q-learning.', 'Optimizing values through multiple episodes is crucial, possibly requiring adjustments of a few hundred episodes to ensure efficient Q-learning scripts.']}, {'end': 1071.487, 'start': 952.453, 'title': 'Q-learning algorithm overview', 'summary': 'Explains how q-learning algorithm works by initializing a q table and gradually updating the q values over time to make optimal decisions, allowing the agent to perform actions with the highest q values and learn from rewards.', 'duration': 119.034, 'highlights': ['The Q-learning algorithm works by initializing a Q table with Q values for different combinations of position and velocity, allowing the agent to make optimal decisions based on the highest Q values for actions. (relevance: 5)', 'The agent initially does a lot of exploration by picking random values, but gradually updates the Q values based on rewards using the Q function, back propagating the rewards to actions that lead to higher rewards. (relevance: 4)', 'The Q table is initialized using the Q underscore table is equal to NP dot random dot uniform function with low and high values set to negative two and zero respectively, which may need to be adjusted for different environments. (relevance: 3)']}], 'duration': 214.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH8856760.jpg', 'highlights': ['The Q-learning algorithm works by initializing a Q table with Q values for different combinations of position and velocity, allowing the agent to make optimal decisions based on the highest Q values for actions.', 'The process of creating a cue table for Q-learning involves dynamic script adjustments and optimization through multiple episodes to ensure the code runs efficiently.', 'The agent initially does a lot of exploration by picking random values, but gradually updates the Q values based on rewards using the Q function, back propagating the rewards to actions that lead to higher rewards.', 'The visualization of the cue table includes three actions and their combinations, aiming to create a comprehensive and dynamic cue table for Q-learning.', 'Optimizing values through multiple episodes is crucial, possibly requiring adjustments of a few hundred episodes to ensure efficient Q-learning scripts.', 'The Q table is initialized using the Q underscore table is equal to NP dot random dot uniform function with low and high values set to negative two and zero respectively, which may need to be adjusted for different environments.']}, {'end': 1443.378, 'segs': [{'end': 1143.848, 'src': 'embed', 'start': 1117.384, 'weight': 1, 'content': [{'end': 1125.763, 'text': "And again, to your agent doesn't really matter all that much, but to you it might, and that so basically it's good.", 'start': 1117.384, 'duration': 8.379}, {'end': 1135.205, 'text': 'the reward is going to be a negative one until you were you reach that flag, at which point your reward will be zero.', 'start': 1125.763, 'duration': 9.442}, {'end': 1139.407, 'text': "um, so so yeah, so so that's how.", 'start': 1135.205, 'duration': 4.202}, {'end': 1143.848, 'text': 'um, once it finally reaches that flag, it gets a zero reward, which is higher than negative one.', 'start': 1139.407, 'duration': 4.441}], 'summary': 'Training reward goes from negative to zero upon reaching flag.', 'duration': 26.464, 'max_score': 1117.384, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81117384.jpg'}, {'end': 1272.95, 'src': 'embed', 'start': 1251.517, 'weight': 0, 'content': [{'end': 1261.765, 'text': "But basically you can just think of it as it's every combination of possible environment observations times three actions that we could take.", 'start': 1251.517, 'duration': 10.248}, {'end': 1268.37, 'text': "And then inside of those cells, or whatever you want to call them, you're going to have your three at this point,", 'start': 1262.105, 'duration': 6.265}, {'end': 1272.95, 'text': 'random Q values between negative 2 and 0..', 'start': 1268.37, 'duration': 4.58}], 'summary': 'The q-learning algorithm generates random q values between -2 and 0 for all combinations of environment observations and actions.', 'duration': 21.433, 'max_score': 1251.517, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81251517.jpg'}, {'end': 1368.14, 'src': 'embed', 'start': 1314.934, 'weight': 2, 'content': [{'end': 1327.741, 'text': "um, it's every you know possible observation combination and then the q values for at every possible observation combination and every combination of or every possible action at every possible combination,", 'start': 1314.934, 'duration': 12.807}, {'end': 1330.25, 'text': 'We have a random starting Q value.', 'start': 1328.188, 'duration': 2.062}, {'end': 1341.783, 'text': "Okay, so I think I'm going to cut it here, and in the next video we will talk about how to actually start training this model,", 'start': 1333.074, 'duration': 8.709}, {'end': 1347.63, 'text': 'iterating and going over the steps per episode and all the episodes and all that.', 'start': 1341.783, 'duration': 5.847}, {'end': 1355.249, 'text': 'that. so, uh, with that a shout out to my longest term channel members.', 'start': 1348.183, 'duration': 7.066}, {'end': 1359.833, 'text': 'these are people who have been with me for, uh, a year or more.', 'start': 1355.249, 'duration': 4.584}, {'end': 1360.794, 'text': "that's incredible.", 'start': 1359.833, 'duration': 0.961}, {'end': 1362.616, 'text': "i can't believe it's been a year that i've had these.", 'start': 1360.794, 'duration': 1.822}, {'end': 1368.14, 'text': 'but, uh, stenosco or stenosco, uh, my longest term member, thank you very much.', 'start': 1362.616, 'duration': 5.524}], 'summary': 'Discussion on q-values and training model, acknowledging long-term channel members.', 'duration': 53.206, 'max_score': 1314.934, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81314934.jpg'}, {'end': 1415.816, 'src': 'embed', 'start': 1389.316, 'weight': 5, 'content': [{'end': 1393.961, 'text': 'If you guys want to support the channel, you can click on that blue join button.', 'start': 1389.316, 'duration': 4.645}, {'end': 1400.429, 'text': "I know some people don't have access to it for some reason, but if you have it and you want to support the channel, you can click on that right there.", 'start': 1394.001, 'duration': 6.428}, {'end': 1407.653, 'text': "Otherwise, if you've got questions, comments, whatever about Q-Learning, please ask them below.", 'start': 1401.73, 'duration': 5.923}, {'end': 1415.816, 'text': "I can't tell you how many Q-Learning tutorials I ended up having to go through before slowly figuring it out.", 'start': 1407.673, 'duration': 8.143}], 'summary': 'Encourages support for the channel through the join button and invites questions about q-learning.', 'duration': 26.5, 'max_score': 1389.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81389316.jpg'}, {'end': 1443.378, 'src': 'embed', 'start': 1431.643, 'weight': 6, 'content': [{'end': 1439.353, 'text': 'So anyway, in the next video, we will hopefully finish up our agent and actually have it driving on to that flag every single time.', 'start': 1431.643, 'duration': 7.71}, {'end': 1443.378, 'text': 'Anyway, I will see you guys in that video.', 'start': 1440.575, 'duration': 2.803}], 'summary': 'In the next video, the goal is to finish the agent and have it drive to the flag every time.', 'duration': 11.735, 'max_score': 1431.643, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81431643.jpg'}], 'start': 1071.507, 'title': 'Q-table training and channel member appreciation', 'summary': 'Delves into q-table training, covering the intuition behind q-tables, initializing values, dimensions, and the impact of rewards and actions. it also addresses channel member appreciation, promoting support for long-term members, and offering q-learning tutorial guidance.', 'chapters': [{'end': 1341.783, 'start': 1071.507, 'title': 'Reinforcement learning q-table training', 'summary': 'Discusses the intuition behind q-tables in reinforcement learning, initializing q-table values, and the dimensions of the q-table, including the impact of rewards and actions on the table values.', 'duration': 270.276, 'highlights': ['The rewards are consistently negative until the flag is reached, at which point the reward becomes zero, influencing the Q-table values.', 'The Q-table is initialized with values between negative 2 and 0, and its dimensions are 20 by 20 by 3, representing every combination of observations and actions.', "The Q-table values are initially random but will be optimized over time through agent exploration and exploitation, impacting the Q-table's values and eventual training of the model.", "The Q-table's shape is confirmed to be 20 by 20 by 3, representing the combination of observations and actions, with starting Q-values between negative 2 and 0."]}, {'end': 1443.378, 'start': 1341.783, 'title': 'Channel member appreciation & q-learning tutorial', 'summary': 'Discusses appreciation for long-term channel members, encouraging support for the channel, and offering help and guidance for understanding q-learning tutorials.', 'duration': 101.595, 'highlights': ['Long-term channel members are appreciated for supporting free education, with a special mention to the longest term member, stenosco.', 'The speaker encourages viewers to support the channel by clicking the join button and offers assistance for understanding Q-Learning tutorials.', 'In the next video, the speaker aims to complete the agent and ensure it consistently reaches the flag, providing a glimpse into the upcoming content.']}], 'duration': 371.871, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yMk_XtIEzH8/pics/yMk_XtIEzH81071507.jpg', 'highlights': ['The Q-table is initialized with values between negative 2 and 0, and its dimensions are 20 by 20 by 3, representing every combination of observations and actions.', 'The rewards are consistently negative until the flag is reached, at which point the reward becomes zero, influencing the Q-table values.', "The Q-table values are initially random but will be optimized over time through agent exploration and exploitation, impacting the Q-table's values and eventual training of the model.", "The Q-table's shape is confirmed to be 20 by 20 by 3, representing the combination of observations and actions, with starting Q-values between negative 2 and 0.", 'Long-term channel members are appreciated for supporting free education, with a special mention to the longest term member, stenosco.', 'The speaker encourages viewers to support the channel by clicking the join button and offers assistance for understanding Q-Learning tutorials.', 'In the next video, the speaker aims to complete the agent and ensure it consistently reaches the flag, providing a glimpse into the upcoming content.']}], 'highlights': ['Q-learning is a model-free reinforcement learning algorithm, updating Q values for every action given a state over time (Relevance: 5)', 'The Q-learning model is designed to reward for the long-term goal rather than short-term actions (Relevance: 3)', 'The Q-learning algorithm is not environment-specific and can be applied to any environment (Relevance: 2)', 'The Q-table is initialized with values between negative 2 and 0, and its dimensions are 20 by 20 by 3, representing every combination of observations and actions (Relevance: 1)', 'The rewards are consistently negative until the flag is reached, at which point the reward becomes zero, influencing the Q-table values (Relevance: 1)', 'Creating a Q table to look up Q values for any combination of position and velocity, enabling the car to pick the action with the maximum Q value for exploitation (Relevance: 1)', 'The initial exploration phase where the agent performs random actions and gradually updates the Q values, indicating the transition from exploration to exploitation in the learning process (Relevance: 1)', 'Converting continuous values to discrete values to avoid memory blowout and long learning times (Relevance: 1)', 'Determining the size of the cue table based on the manageable A size and the discrete observation space size of 20 by 20 (Relevance: 1)', 'The process of creating a cue table for Q-learning involves dynamic script adjustments and optimization through multiple episodes to ensure the code runs efficiently (Relevance: 1)']}