title
A. I. Learns to Play Starcraft 2 (Reinforcement Learning)

description
Tinkering with reinforcement learning via Stable Baselines 3 and Starcraft 2. Code and model: https://github.com/Sentdex/SC2RL Stable Baselines 3 tutorial: https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/ Neural Networks from Scratch book: https://nnfs.io Channel membership: https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ/join Discord: https://discord.gg/sentdex Reddit: https://www.reddit.com/r/sentdex/ Support the content: https://pythonprogramming.net/support-donate/ Twitter: https://twitter.com/sentdex Instagram: https://instagram.com/sentdex Facebook: https://www.facebook.com/pythonprogramming.net/ Twitch: https://www.twitch.tv/sentdex #artificialintelligence #machinelearning #python

detail
{'title': 'A. I. Learns to Play Starcraft 2 (Reinforcement Learning)', 'heatmap': [{'end': 981.765, 'start': 962.904, 'weight': 0.774}], 'summary': 'Explores playing starcraft 2 using python libraries and deep reinforcement learning from stable baselines 3, emphasizing the importance of determining inputs and actions, challenges in benchmarking agent performance and designing reward mechanisms, and details achieving a peak reward of about 200 and a game victory rate of around 70% against the hard computer bot.', 'chapters': [{'end': 50.476, 'segs': [{'end': 50.476, 'src': 'embed', 'start': 0.109, 'weight': 0, 'content': [{'end': 5.075, 'text': 'Starcraft 2 is a multiplayer game where the objective is simply to eliminate the other players.', 'start': 0.109, 'duration': 4.966}, {'end': 7.418, 'text': 'You do so with an army of attack units.', 'start': 5.355, 'duration': 2.063}, {'end': 8.399, 'text': 'We like this.', 'start': 7.658, 'duration': 0.741}, {'end': 9.941, 'text': 'We do not like this.', 'start': 8.96, 'duration': 0.981}, {'end': 15.588, 'text': 'Those attack units cost resources to make, so you probably need to set up a base, build various buildings,', 'start': 10.301, 'duration': 5.287}, {'end': 18.051, 'text': 'collect resources and then you can build that army.', 'start': 15.588, 'duration': 2.463}, {'end': 21.954, 'text': 'All of this involves seemingly endless strategizing and knowledge of the game,', 'start': 18.231, 'duration': 3.723}, {'end': 25.937, 'text': 'along with years of tuning muscle memory to achieve high actions per minute.', 'start': 21.954, 'duration': 3.983}, {'end': 31.261, 'text': "But we're programmers, and we can just use libraries that let us use Python code to play the game for us.", 'start': 25.997, 'duration': 5.264}, {'end': 34.704, 'text': 'Even this, however, takes a while to master all of the strategies.', 'start': 31.421, 'duration': 3.283}, {'end': 37.166, 'text': "But we're also machine learning practitioners.", 'start': 35.084, 'duration': 2.082}, {'end': 42.37, 'text': 'Could we just simply import some reinforcement learning from Stable Baselines 3 and play the game.', 'start': 37.266, 'duration': 5.104}, {'end': 44.252, 'text': "I guess it's going to be a little harder than that.", 'start': 42.47, 'duration': 1.782}, {'end': 49.175, 'text': "But let's say that we really do want to use deep reinforcement learning to play some Starcraft 2.", 'start': 44.532, 'duration': 4.643}, {'end': 50.476, 'text': 'How might we actually do that?', 'start': 49.175, 'duration': 1.301}], 'summary': 'Starcraft 2 involves building an army to eliminate players, requiring strategic planning, knowledge, and machine learning algorithms for reinforcement learning.', 'duration': 50.367, 'max_score': 0.109, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4109.jpg'}], 'start': 0.109, 'title': 'Playing starcraft 2 with deep reinforcement learning', 'summary': 'Explores the complexities of playing starcraft 2, including strategizing, game knowledge, and muscle memory, and discusses the potential of using python libraries and deep reinforcement learning from stable baselines 3 for gameplay.', 'chapters': [{'end': 50.476, 'start': 0.109, 'title': 'Playing starcraft 2 with deep reinforcement learning', 'summary': 'Explores the complexities of playing starcraft 2, including the need for strategizing, knowledge of the game, and muscle memory, and discusses the potential of using python libraries and deep reinforcement learning from stable baselines 3 for gameplay.', 'duration': 50.367, 'highlights': ['Using Python libraries to play the game involves a learning curve but is accessible to programmers.', 'Playing Starcraft 2 requires strategizing, knowledge of the game, and muscle memory for high actions per minute.', 'Exploring the use of deep reinforcement learning from Stable Baselines 3 for playing Starcraft 2.']}], 'duration': 50.367, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4109.jpg', 'highlights': ['Exploring the use of deep reinforcement learning from Stable Baselines 3 for playing Starcraft 2.', 'Playing Starcraft 2 requires strategizing, knowledge of the game, and muscle memory for high actions per minute.', 'Using Python libraries to play the game involves a learning curve but is accessible to programmers.']}, {'end': 373.014, 'segs': [{'end': 99.46, 'src': 'embed', 'start': 66.048, 'weight': 0, 'content': [{'end': 70.33, 'text': 'In this case, it could quite literally be the game visuals, like the frames from the game.', 'start': 66.048, 'duration': 4.282}, {'end': 77.953, 'text': 'it could be some other graphical representation, or it could just be a vector of values like number of units and minerals, and locations of things,', 'start': 70.33, 'duration': 7.623}, {'end': 78.473, 'text': 'and so on.', 'start': 77.953, 'duration': 0.52}, {'end': 83.575, 'text': 'Then, for the output of this model, we need to consider what this neural network can actually decide.', 'start': 78.613, 'duration': 4.962}, {'end': 88.837, 'text': 'Essentially, with deep reinforcement learning in this case, this is going to be the actions that the model can actually take.', 'start': 83.675, 'duration': 5.162}, {'end': 93.838, 'text': 'whether to expand and build out our base, to attack the enemy, and so on.', 'start': 89.717, 'duration': 4.121}, {'end': 99.46, 'text': 'For the input into the model, I think something like the minimap is actually a really good candidate for model input.', 'start': 94.098, 'duration': 5.362}], 'summary': 'Using game visuals or vector values as input, the neural network decides actions like base expansion or attacking the enemy in deep reinforcement learning.', 'duration': 33.412, 'max_score': 66.048, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ466048.jpg'}, {'end': 144.226, 'src': 'embed', 'start': 115.905, 'weight': 3, 'content': [{'end': 120.008, 'text': "For the input map, we're going to start with a blank canvas that is just simply the size of the game map.", 'start': 115.905, 'duration': 4.103}, {'end': 121.869, 'text': "Then first, we're going to draw the minerals.", 'start': 120.028, 'duration': 1.841}, {'end': 126.633, 'text': "Not only are we going to draw them, but we'll also make them more bright the more minerals there are left.", 'start': 122.03, 'duration': 4.603}, {'end': 132.777, 'text': "We'll use the same concept for minerals as well as the gas, but also a similar idea for building and unit health.", 'start': 126.773, 'duration': 6.004}, {'end': 134.519, 'text': "Here's the result of this so far.", 'start': 132.878, 'duration': 1.641}, {'end': 140.883, 'text': 'We can only gather information on minerals that we can see with units, which is why some of them are just very dim, and then some are very bright.', 'start': 134.699, 'duration': 6.184}, {'end': 144.226, 'text': "The dim ones we just, we haven't seen, so we have no idea how many minerals are there.", 'start': 140.923, 'duration': 3.303}], 'summary': 'A game map is initialized with minerals and gas, their brightness changes based on quantity, and unexplored minerals are dim.', 'duration': 28.321, 'max_score': 115.905, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4115905.jpg'}, {'end': 237.801, 'src': 'embed', 'start': 200.316, 'weight': 4, 'content': [{'end': 206.404, 'text': "And just by watching games with this representation, I think it's really easy to see what's going on, how we're doing, and all that.", 'start': 200.316, 'duration': 6.088}, {'end': 210.569, 'text': 'I think you could definitely make good decisions with this information.', 'start': 206.424, 'duration': 4.145}, {'end': 214.595, 'text': 'So I think this will be a pretty good starting point for input to our model.', 'start': 210.95, 'duration': 3.645}, {'end': 216.097, 'text': 'Will it work? Maybe.', 'start': 214.915, 'duration': 1.182}, {'end': 219.2, 'text': 'Remember, the best thing that we can do is keep things simple to start.', 'start': 216.377, 'duration': 2.823}, {'end': 224.606, 'text': 'One issue I can already imagine here is that this is just a ton of mostly noise per input.', 'start': 219.441, 'duration': 5.165}, {'end': 229.592, 'text': "The image is mostly empty, so it's going to be an area that we could probably improve upon later.", 'start': 224.726, 'duration': 4.866}, {'end': 233.136, 'text': 'But now that we have network input, we need to handle for the output logic,', 'start': 229.672, 'duration': 3.464}, {'end': 237.801, 'text': "which will be the macro actions that we're going to allow the reinforcement learning agent to take.", 'start': 233.136, 'duration': 4.665}], 'summary': 'Watching games for representation, simplifying for good decisions. potential issues with noise and empty inputs to improve upon.', 'duration': 37.485, 'max_score': 200.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4200316.jpg'}, {'end': 314.327, 'src': 'embed', 'start': 276.071, 'weight': 7, 'content': [{'end': 282.377, 'text': 'Right now, the code is limiting one stargate per base, but this could be changed to allow the AI to make endless stargates.', 'start': 276.071, 'duration': 6.306}, {'end': 287.022, 'text': 'Theoretically speeding up the ability for you to build void rays faster.', 'start': 282.857, 'duration': 4.165}, {'end': 290.667, 'text': 'For action number two, if we can afford and we can build a void ray, build one.', 'start': 287.062, 'duration': 3.605}, {'end': 292.869, 'text': 'Action number three, this is to scout.', 'start': 290.787, 'duration': 2.082}, {'end': 294.491, 'text': 'Just not too often.', 'start': 293.13, 'duration': 1.361}, {'end': 300.298, 'text': "Essentially scouting is sending a poor probe to a certain death and we're just doing this to see what the enemy is up to.", 'start': 294.692, 'duration': 5.606}, {'end': 303.08, 'text': "Initially we're going to start out completely random.", 'start': 301.039, 'duration': 2.041}, {'end': 310.265, 'text': 'so theoretically, one out of every six frames, we could be sending a probe, and this would be faster than we could possibly build probes.', 'start': 303.08, 'duration': 7.185}, {'end': 314.327, 'text': "So just to limit this, we're adding a once every 200 frames.", 'start': 310.665, 'duration': 3.662}], 'summary': 'Code allows one stargate per base, potential for endless stargates, scout action implemented to send probes every 6 frames, limited to once every 200 frames.', 'duration': 38.256, 'max_score': 276.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4276071.jpg'}, {'end': 382.18, 'src': 'embed', 'start': 352.208, 'weight': 6, 'content': [{'end': 356.875, 'text': 'The Void Rays should also be capable of just simply returning to the base after a fight.', 'start': 352.208, 'duration': 4.667}, {'end': 359.879, 'text': "So it's not just to flee, it's also just to bring our Void Rays back home.", 'start': 356.915, 'duration': 2.964}, {'end': 367.19, 'text': "The Void Ray units are really our only base defense here, so if they're not busy elsewhere, we should bring them back sometimes.", 'start': 360.48, 'duration': 6.71}, {'end': 373.014, 'text': 'At this point, the last thing that we need to do is calculate a reward for the agent to learn from.', 'start': 367.851, 'duration': 5.163}, {'end': 376.797, 'text': 'But first, is this even a worthy experiment at all??', 'start': 373.655, 'duration': 3.142}, {'end': 382.18, 'text': "I can already hear some people complaining that the agent isn't controlling every single action in the game,", 'start': 376.877, 'duration': 5.303}], 'summary': 'Void rays should return to base after a fight for defense, worth considering for agent learning.', 'duration': 29.972, 'max_score': 352.208, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4352208.jpg'}], 'start': 50.596, 'title': 'Deep reinforcement learning considerations', 'summary': 'Emphasizes the importance of determining inputs and actions in deep reinforcement learning, with a focus on using a minimap as a model input. it also discusses game map visualization and network input for informed decision-making and outlines an ai strategy for void ray management in the game.', 'chapters': [{'end': 115.705, 'start': 50.596, 'title': 'Deep reinforcement learning inputs and outputs', 'summary': 'Emphasizes the key considerations for inputs and outputs in deep reinforcement learning, highlighting the importance of determining the characteristics of inputs and actions for the neural network, with a focus on the potential use of a minimap as a model input.', 'duration': 65.109, 'highlights': ['Determining the characteristics of inputs and actions for the neural network is crucial for deep reinforcement learning, with a focus on the potential use of a minimap as a model input.', "The model's inputs could range from game visuals like frames, other graphical representations, to a vector of values including number of units, minerals, and locations.", 'The output of the model in deep reinforcement learning involves deciding actions such as base expansion, attacking the enemy, etc.']}, {'end': 237.801, 'start': 115.905, 'title': 'Game map visualization and network input', 'summary': "Discusses the visualization of the game map, including the rendering of minerals, enemy units, structures, and the player's own buildings, providing a clear representation for making informed decisions. the network input will be used for handling the output logic for reinforcement learning agents.", 'duration': 121.896, 'highlights': ["The visualization includes rendering minerals, enemy units, structures, and the player's own buildings in distinct colors to provide a clear representation of the game map. The visualization process involves rendering minerals, enemy units, structures, and the player's own buildings in distinct colors, such as bright minerals, red enemy units and structures, light blue Nexus buildings, greenish other structures, and pink Vespene. This approach aims to provide a clear representation for making informed decisions during gameplay.", "The visualization aims to make it easy to see what's happening in the game and facilitate informed decision-making. The visualization process aims to make it easy to see what's happening in the game, enabling informed decision-making based on the representation of the game map. This approach is expected to provide valuable insights for strategy and gameplay assessment.", 'The network input will be used to handle the output logic for reinforcement learning agents. The network input will play a crucial role in handling the output logic for reinforcement learning agents, enabling the implementation of macro actions that the reinforcement learning agent can take. This aspect is essential for the effective functioning of the reinforcement learning system in the game environment.']}, {'end': 373.014, 'start': 238.121, 'title': 'Void ray ai strategy', 'summary': 'Outlines the ai strategy for building and managing void rays in the game, from resource collection to scouting, attacking, and retreating, aiming to optimize the void ray production and utilization to gain a strategic advantage.', 'duration': 134.893, 'highlights': ["The chapter outlines the AI strategy for building and managing Void Rays in the game, from resource collection to scouting, attacking, and retreating. This provides an overview of the AI strategy's scope and comprehensiveness.", 'The code limits one stargate per base, but it could be changed to allow the AI to make endless stargates, theoretically speeding up the ability to build void rays faster. This highlights the potential for optimizing Void Ray production by adjusting stargate limitations.', 'Scouting is limited to once every 200 frames to balance information gathering without compromising resource allocation. This quantifies the limitation imposed on scouting actions to maintain a balance in resource usage.']}], 'duration': 322.418, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ450596.jpg', 'highlights': ['Determining inputs and actions is crucial for deep reinforcement learning, with a focus on using a minimap as a model input.', "The model's inputs range from game visuals to a vector of values including units, minerals, and locations.", 'The output of the model involves deciding actions such as base expansion, attacking the enemy, etc.', "The visualization includes rendering minerals, enemy units, structures, and the player's own buildings in distinct colors to provide a clear representation of the game map.", "The visualization aims to make it easy to see what's happening in the game and facilitate informed decision-making.", 'The network input plays a crucial role in handling the output logic for reinforcement learning agents.', 'The chapter outlines the AI strategy for building and managing Void Rays in the game, from resource collection to scouting, attacking, and retreating.', 'The code limits one stargate per base, but it could be changed to allow the AI to make endless stargates, theoretically speeding up the ability to build void rays faster.', 'Scouting is limited to once every 200 frames to balance information gathering without compromising resource allocation.']}, {'end': 565.832, 'segs': [{'end': 405.79, 'src': 'embed', 'start': 373.655, 'weight': 3, 'content': [{'end': 376.797, 'text': 'But first, is this even a worthy experiment at all??', 'start': 373.655, 'duration': 3.142}, {'end': 382.18, 'text': "I can already hear some people complaining that the agent isn't controlling every single action in the game,", 'start': 376.877, 'duration': 5.303}, {'end': 387.703, 'text': 'and we must ask ourselves how will we know? if this agent is good or not?', 'start': 382.18, 'duration': 5.523}, {'end': 389.584, 'text': 'we need a benchmark to compare to.', 'start': 387.703, 'duration': 1.881}, {'end': 394.466, 'text': 'So, before we waste our time on a reward mechanism and all the R&D required there, which, honestly,', 'start': 389.844, 'duration': 4.622}, {'end': 399.548, 'text': 'is often the hardest part of any reinforcement learning development, we really do need to test random.', 'start': 394.466, 'duration': 5.082}, {'end': 405.79, 'text': 'Which is really just a question of how good does an agent do if it is just randomly picking actions.', 'start': 400.128, 'duration': 5.662}], 'summary': "Testing the agent's performance against random actions is crucial before investing in reinforcement learning development.", 'duration': 32.135, 'max_score': 373.655, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4373655.jpg'}, {'end': 453.82, 'src': 'embed', 'start': 408.412, 'weight': 2, 'content': [{'end': 416.518, 'text': "Isn't it really just possible that randomly picking actions will also win games or at least be remotely comparable to any reinforcement learning agent?", 'start': 408.412, 'duration': 8.106}, {'end': 417.999, 'text': 'Well, we can test this really easily.', 'start': 416.698, 'duration': 1.301}, {'end': 424.504, 'text': 'After running 200 games running random actions, we have not won a single game.', 'start': 418.439, 'duration': 6.065}, {'end': 427.987, 'text': 'Okay, I think we have something we can certainly work for.', 'start': 425.905, 'duration': 2.082}, {'end': 430.728, 'text': "Alright, now let's talk reward.", 'start': 428.487, 'duration': 2.241}, {'end': 436.591, 'text': 'In order for the agent to learn anything at all, we need to reward it for doing quote-unquote good.', 'start': 431.549, 'duration': 5.042}, {'end': 440.193, 'text': 'Rewarding agents in reinforcement learning is often the hardest part of all.', 'start': 437.252, 'duration': 2.941}, {'end': 441.914, 'text': 'Take our problem for example.', 'start': 440.693, 'duration': 1.221}, {'end': 444.535, 'text': 'The actual goal here is to win games.', 'start': 441.994, 'duration': 2.541}, {'end': 445.116, 'text': "It's pretty simple.", 'start': 444.575, 'duration': 0.541}, {'end': 448.497, 'text': 'If we win, we could give the agent a reward.', 'start': 445.476, 'duration': 3.021}, {'end': 450.819, 'text': 'And if we lose, a big negative reward.', 'start': 448.678, 'duration': 2.141}, {'end': 453.82, 'text': "The issue is, as we've seen already,", 'start': 451.539, 'duration': 2.281}], 'summary': 'After 200 games of random actions, no wins, highlighting importance of rewarding in reinforcement learning.', 'duration': 45.408, 'max_score': 408.412, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4408412.jpg'}, {'end': 518.631, 'src': 'embed', 'start': 492.896, 'weight': 0, 'content': [{'end': 498.259, 'text': 'To do this, you can just give a very small reward per step for things that you think might be helpful.', 'start': 492.896, 'duration': 5.363}, {'end': 502.061, 'text': 'And then you can still give one large reward at the end for actually winning the game.', 'start': 498.479, 'duration': 3.582}, {'end': 505.163, 'text': 'So what might we reward to help win?', 'start': 502.381, 'duration': 2.782}, {'end': 510.786, 'text': 'We could reward resource gathering, but this would probably just result in an agent prioritizing,', 'start': 505.523, 'duration': 5.263}, {'end': 517.13, 'text': "lengthening the game so it can collect as many resources as possible, and it probably wouldn't want to build many attack units,", 'start': 510.786, 'duration': 6.344}, {'end': 518.631, 'text': 'since these cost resources.', 'start': 517.13, 'duration': 1.501}], 'summary': 'Use small rewards per step, prioritize resource gathering, and incentivize building attack units to win the game.', 'duration': 25.735, 'max_score': 492.896, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4492896.jpg'}, {'end': 582.29, 'src': 'embed', 'start': 550.406, 'weight': 1, 'content': [{'end': 554.487, 'text': 'This incentivizes actually eliminating or at least attacking the enemy.', 'start': 550.406, 'duration': 4.081}, {'end': 561.81, 'text': 'It will still possibly be the case that keeping the enemy just barely alive to make more units and buildings for us to destroy will happen,', 'start': 554.847, 'duration': 6.963}, {'end': 565.832, 'text': 'but my hope is that a large reward for actually finishing the game will overcome this.', 'start': 561.81, 'duration': 4.022}, {'end': 573.761, 'text': "Okay, so we've got some great ideas here, but we still have to actually connect StableBaselines 3 in this Python StarCraft 2 environment.", 'start': 566.132, 'duration': 7.629}, {'end': 582.29, 'text': 'And to do that, we also need to convert the Python SC2 environment to an OpenAI gem environment in order to work with StableBaselines 3.', 'start': 574.081, 'duration': 8.209}], 'summary': 'Incentivizing eliminating the enemy over keeping them barely alive, aiming for stablebaselines 3 connection in python sc2 environment.', 'duration': 31.884, 'max_score': 550.406, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4550406.jpg'}], 'start': 373.655, 'title': 'Reinforcement learning challenges and rewards', 'summary': "Discusses the challenges of benchmarking an agent's performance and the difficulty of designing a reward mechanism in reinforcement learning, as well as the initial failure of random action testing in winning games. it also explores the challenge of providing rewards in reinforcement learning, proposing the use of intermediary rewards and a focus on incentivizing attacking as the primary goal.", 'chapters': [{'end': 476.949, 'start': 373.655, 'title': 'Reinforcement learning challenges', 'summary': "Discusses the challenges of benchmarking an agent's performance and the difficulty of designing a reward mechanism in reinforcement learning, as well as the initial failure of random action testing in winning games.", 'duration': 103.294, 'highlights': ['The initial failure of randomly picking actions in winning any games after running 200 games.', 'The difficulty of designing a reward mechanism in reinforcement learning, particularly in the context of a large number of observations and actions per game.', "The need for a benchmark to compare the agent's performance and the consideration of testing random actions as a baseline for comparison."]}, {'end': 565.832, 'start': 477.449, 'title': 'Reinforcement learning rewards for winning', 'summary': 'Discusses the challenge of providing rewards in reinforcement learning and proposes using intermediary rewards, such as a small reward per step for helpful actions, and a large reward for winning the game, with a focus on incentivizing attacking as the primary goal.', 'duration': 88.383, 'highlights': ['An intermediary reward system is proposed, involving small rewards per step for helpful actions and a large reward for winning the game, to aid in reinforcement learning (RL) (e.g., a small reward for actively attacking with the void ray unit).', 'The challenge of incentivizing winning in reinforcement learning (RL) is discussed, with considerations for avoiding incentivizing actions that lengthen the game or discourage winning (e.g., rewarding resource gathering may lead to prioritizing lengthening the game to collect resources, rather than attacking).', 'The focus on incentivizing attacking as the primary goal in reinforcement learning (RL) is emphasized, with the hope that a large reward for finishing the game will overcome the potential issue of keeping the enemy alive to make more units and buildings to destroy.']}], 'duration': 192.177, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4373655.jpg', 'highlights': ['An intermediary reward system is proposed, involving small rewards per step for helpful actions and a large reward for winning the game, to aid in reinforcement learning (RL) (e.g., a small reward for actively attacking with the void ray unit).', 'The focus on incentivizing attacking as the primary goal in reinforcement learning (RL) is emphasized, with the hope that a large reward for finishing the game will overcome the potential issue of keeping the enemy alive to make more units and buildings to destroy.', 'The difficulty of designing a reward mechanism in reinforcement learning, particularly in the context of a large number of observations and actions per game.', "The need for a benchmark to compare the agent's performance and the consideration of testing random actions as a baseline for comparison.", 'The challenge of incentivizing winning in reinforcement learning (RL) is discussed, with considerations for avoiding incentivizing actions that lengthen the game or discourage winning (e.g., rewarding resource gathering may lead to prioritizing lengthening the game to collect resources, rather than attacking).', 'The initial failure of randomly picking actions in winning any games after running 200 games.']}, {'end': 1026.242, 'segs': [{'end': 590.996, 'src': 'embed', 'start': 566.132, 'weight': 1, 'content': [{'end': 573.761, 'text': "Okay, so we've got some great ideas here, but we still have to actually connect StableBaselines 3 in this Python StarCraft 2 environment.", 'start': 566.132, 'duration': 7.629}, {'end': 582.29, 'text': 'And to do that, we also need to convert the Python SC2 environment to an OpenAI gem environment in order to work with StableBaselines 3.', 'start': 574.081, 'duration': 8.209}, {'end': 590.996, 'text': 'Due to the way everything runs and can or cannot communicate while it runs, this presented a far more difficult challenge than I expected.', 'start': 582.29, 'duration': 8.706}], 'summary': 'Challenging task: connecting stablebaselines3 to python sc2 environment and converting it to openai gem environment.', 'duration': 24.864, 'max_score': 566.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4566132.jpg'}, {'end': 631.511, 'src': 'embed', 'start': 602.003, 'weight': 0, 'content': [{'end': 605.626, 'text': 'So for everything to communicate here, I ended up going with a state file.', 'start': 602.003, 'duration': 3.623}, {'end': 611.832, 'text': 'This file contains the state, which is going to be our mini-map, the reward, action, and whether or not the game is done.', 'start': 605.987, 'duration': 5.845}, {'end': 615.455, 'text': "And this shared file is how we'll communicate between the systems here.", 'start': 611.972, 'duration': 3.483}, {'end': 617.717, 'text': 'Is there a better way? Absolutely.', 'start': 616.075, 'duration': 1.642}, {'end': 619.599, 'text': 'Will this work though? I hope so.', 'start': 618.017, 'duration': 1.582}, {'end': 624.463, 'text': 'So this is definitely going to be very confusing, with everything that has to connect and wait on each other and all this.', 'start': 619.899, 'duration': 4.564}, {'end': 626.645, 'text': "but let's just go in logical order.", 'start': 624.463, 'duration': 2.182}, {'end': 631.511, 'text': "So to actually run everything, we're just going to have a training script, train.py.", 'start': 627.285, 'duration': 4.226}], 'summary': 'Using a state file to communicate between systems in training script train.py.', 'duration': 29.508, 'max_score': 602.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4602003.jpg'}, {'end': 823.412, 'src': 'embed', 'start': 796.043, 'weight': 4, 'content': [{'end': 803.709, 'text': 'I spent quite a while tinkering with various rewards in Logic, but the one that worked best was just a simple reward for attacking and, of course,', 'start': 796.043, 'duration': 7.666}, {'end': 804.49, 'text': 'winning or losing.', 'start': 803.709, 'duration': 0.781}, {'end': 809.008, 'text': 'While the models were training, I was obviously tracking things like overall reward.', 'start': 804.827, 'duration': 4.181}, {'end': 812.129, 'text': 'The reward mechanism itself changed quite a few times,', 'start': 809.128, 'duration': 3.001}, {'end': 823.412, 'text': "so each variant isn't necessarily directly comparable to the other one as well as I have personally found that it's best to try multiple times with the same exact reinforcement learning algorithm,", 'start': 812.129, 'duration': 11.283}], 'summary': 'Tinkered with rewards in logic, simple reward for attacking, tracking overall reward during training, multiple iterations of reinforcement learning algorithm.', 'duration': 27.369, 'max_score': 796.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4796043.jpg'}, {'end': 893.583, 'src': 'embed', 'start': 834.335, 'weight': 2, 'content': [{'end': 835.495, 'text': 'As you can probably tell,', 'start': 834.335, 'duration': 1.16}, {'end': 845.118, 'text': 'I made many attempts before settling on that very simple reward mechanism of just simply rewarding for combat and then at the end a reward or punishment for a win or loss.', 'start': 835.495, 'duration': 9.623}, {'end': 849.259, 'text': 'The best model landed on a reward of about 200 at its peak.', 'start': 845.458, 'duration': 3.801}, {'end': 851.64, 'text': 'That said, what does that actually translate to??', 'start': 849.719, 'duration': 1.921}, {'end': 857.664, 'text': "As models trained, I also kept a log of wins and losses, since that's actually what we're truly after.", 'start': 851.94, 'duration': 5.724}, {'end': 859.766, 'text': 'so I wanted to measure the winning rate.', 'start': 857.664, 'duration': 2.102}, {'end': 868.512, 'text': 'In the case of this model, at the same time as its 200 reward peak, we can see game victories at around 70% against the hard computer bot,', 'start': 860.066, 'duration': 8.446}, {'end': 869.513, 'text': "which isn't too bad.", 'start': 868.512, 'duration': 1.001}, {'end': 873.774, 'text': 'Especially considering random was literally 0%, we were never winning.', 'start': 869.753, 'duration': 4.021}, {'end': 880.916, 'text': 'Despite the amount of effort put in so far, I would still consider this just the beginning of any reinforcement learning with Starcraft 2.', 'start': 874.014, 'duration': 6.902}, {'end': 886.658, 'text': "And I will have to think more on how I might expand this to do better at both macro tasks, like we've done here,", 'start': 880.916, 'duration': 5.742}, {'end': 893.583, 'text': 'as well as some sort of micro reinforcement learning algorithm, probably in the form of two separate algorithms running at the same time,', 'start': 886.658, 'duration': 6.925}], 'summary': 'Achieved a 70% win rate with a peak reward of 200, showing progress in reinforcement learning for starcraft 2.', 'duration': 59.248, 'max_score': 834.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4834335.jpg'}, {'end': 990.713, 'src': 'heatmap', 'start': 962.904, 'weight': 0.774, 'content': [{'end': 966.688, 'text': "I've hosted all the code, as well as a decent model that you can play with if you'd like.", 'start': 962.904, 'duration': 3.784}, {'end': 969.01, 'text': 'You can find that in the description of this video.', 'start': 966.728, 'duration': 2.282}, {'end': 975.137, 'text': "If you're interested in learning more about neural networks and how they work, then you might want to check out Neural Networks from Scratch,", 'start': 969.15, 'duration': 5.987}, {'end': 981.765, 'text': "a book by myself and Daniel Kukiewa, inside of which you'll learn how to code neurons activation functions, how to calculate loss,", 'start': 975.137, 'duration': 6.628}, {'end': 986.791, 'text': "do optimization and backpropagation and, of course, apply everything that you've just learned to a problem.", 'start': 981.765, 'duration': 5.026}, {'end': 990.713, 'text': 'The book itself is in full color for graphs and code syntax highlighting.', 'start': 987.111, 'duration': 3.602}], 'summary': 'The transcript offers a book on neural networks by the speaker and daniel kukiewa with detailed coding and full-color visuals.', 'duration': 27.809, 'max_score': 962.904, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4962904.jpg'}], 'start': 566.132, 'title': 'Reinforcement learning in starcraft 2', 'summary': 'Details integrating stablebaselines 3 with python sc2, converting the environment to an openai gym environment, using a state file for inter-system communication, training a reinforcement learning model, scripts involved, and achieving a peak reward of about 200 and a game victory rate of around 70% against the hard computer bot.', 'chapters': [{'end': 619.599, 'start': 566.132, 'title': 'Integrating stablebaselines 3 with python sc2', 'summary': 'Details the challenges of connecting stablebaselines 3 with python sc2, including the need to convert the python sc2 environment to an openai gym environment for communication, and the use of a state file for inter-system communication.', 'duration': 53.467, 'highlights': ['The use of a state file for inter-system communication, containing the mini-map, reward, action, and game status, was implemented to enable communication between the systems, presenting a challenging task due to the complexities of the communication setup.', 'The need to convert the Python SC2 environment to an OpenAI gym environment in order to work with StableBaselines 3 was highlighted as a crucial step to enable communication between the systems.', 'A link to a tutorial series for learning about Stable Baselines 3 and custom environments was provided for those new to the topic, emphasizing its simplicity and ease of understanding compared to the complex integration process discussed in the chapter.']}, {'end': 1026.242, 'start': 619.899, 'title': 'Reinforcement learning with starcraft 2', 'summary': 'Explains the process of training a reinforcement learning model for starcraft 2 using stable baselines 3, detailing the scripts involved and the rewards achieved, with the best model reaching a peak reward of about 200 and a game victory rate of around 70% against the hard computer bot.', 'duration': 406.343, 'highlights': ['The best model landed on a reward of about 200 at its peak. The best reinforcement learning model achieved a peak reward of about 200.', 'we can see game victories at around 70% against the hard computer bot. The trained model achieved a game victory rate of around 70% against the hard computer bot.', 'I spent quite a while tinkering with various rewards in Logic, but the one that worked best was just a simple reward for attacking and, of course, winning or losing. The author experimented with different reward mechanisms and found that a simple reward for attacking and winning or losing worked best.', "I also kept a log of wins and losses, since that's actually what we're truly after. so I wanted to measure the winning rate. The author kept a log of wins and losses to measure the winning rate of the trained model.", 'I would still consider this just the beginning of any reinforcement learning with Starcraft 2. The author views the current progress as just the beginning of reinforcement learning with Starcraft 2 and plans to explore further improvements.']}], 'duration': 460.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q59wap1ELQ4/pics/q59wap1ELQ4566132.jpg', 'highlights': ['The use of a state file for inter-system communication was implemented to enable communication between the systems, presenting a challenging task due to the complexities of the communication setup.', 'The need to convert the Python SC2 environment to an OpenAI gym environment was highlighted as a crucial step to enable communication between the systems.', 'The best model landed on a reward of about 200 at its peak.', 'The trained model achieved a game victory rate of around 70% against the hard computer bot.', 'The author experimented with different reward mechanisms and found that a simple reward for attacking and winning or losing worked best.', 'The author kept a log of wins and losses to measure the winning rate of the trained model.', 'The author views the current progress as just the beginning of reinforcement learning with Starcraft 2 and plans to explore further improvements.']}], 'highlights': ['The best model achieved a game victory rate of around 70% against the hard computer bot.', 'Using Python libraries to play the game involves a learning curve but is accessible to programmers.', "The need for a benchmark to compare the agent's performance and the consideration of testing random actions as a baseline for comparison.", "The model's inputs range from game visuals to a vector of values including units, minerals, and locations.", "The visualization aims to make it easy to see what's happening in the game and facilitate informed decision-making.", 'The focus on incentivizing attacking as the primary goal in reinforcement learning (RL) is emphasized, with the hope that a large reward for finishing the game will overcome the potential issue of keeping the enemy alive to make more units and buildings to destroy.', 'The trained model achieved a game victory rate of around 70% against the hard computer bot.', 'The need to convert the Python SC2 environment to an OpenAI gym environment was highlighted as a crucial step to enable communication between the systems.', 'The best model landed on a reward of about 200 at its peak.', 'The difficulty of designing a reward mechanism in reinforcement learning, particularly in the context of a large number of observations and actions per game.']}