title
CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1
description
Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 4.
Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.
detail
{'title': 'CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1', 'heatmap': [{'end': 527.661, 'start': 429.003, 'weight': 0.785}, {'end': 1198.997, 'start': 1096.909, 'weight': 0.833}, {'end': 2677.005, 'start': 2578.942, 'weight': 0.93}], 'summary': 'Covers upcoming deadlines, office hour changes, midterms preparation, back propagation in neural networks, max gate influence on gradients, neural network layers in torch, jacobian matrices, neural network training and architecture, and neural network capacity, emphasizing computational graph implementation, backpropagation efficiency, and network flexibility.', 'chapters': [{'end': 64.846, 'segs': [{'end': 32.47, 'src': 'embed', 'start': 4.062, 'weight': 0, 'content': [{'end': 10.504, 'text': 'OK, so let me dive into some administrative points first.', 'start': 4.062, 'duration': 6.442}, {'end': 14.845, 'text': 'So again, recall that assignment one is due next Wednesday.', 'start': 11.944, 'duration': 2.901}, {'end': 16.905, 'text': 'You have about 150 hours left.', 'start': 14.865, 'duration': 2.04}, {'end': 20.386, 'text': "And I use hours because there's a more imminent sense of doom.", 'start': 17.245, 'duration': 3.141}, {'end': 23.827, 'text': "And remember that a third of those hours, you'll be unconscious.", 'start': 21.226, 'duration': 2.601}, {'end': 25.847, 'text': "So you don't have that much time.", 'start': 24.047, 'duration': 1.8}, {'end': 26.707, 'text': "It's really running out.", 'start': 25.907, 'duration': 0.8}, {'end': 32.47, 'text': 'You might think that you have late days and so on, but these assignments just get harder over time.', 'start': 29.364, 'duration': 3.106}], 'summary': 'Assignment one is due next wednesday, with about 150 hours left, a third of which will be unconscious, indicating limited time to complete the task.', 'duration': 28.408, 'max_score': 4.062, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4062.jpg'}, {'end': 75.696, 'src': 'embed', 'start': 47.52, 'weight': 2, 'content': [{'end': 49.621, 'text': "So I'll be moving my office hours from Monday to Wednesday.", 'start': 47.52, 'duration': 2.101}, {'end': 56.863, 'text': "Usually I have my office hours at 6 p.m. Instead I'll have them at 5 p.m. And usually it's in gates to 60, but now I'll be in gates to 59.", 'start': 50.101, 'duration': 6.762}, {'end': 57.923, 'text': 'So minus one on both.', 'start': 56.863, 'duration': 1.06}, {'end': 59.504, 'text': 'And yeah.', 'start': 58.944, 'duration': 0.56}, {'end': 64.846, 'text': "And also to note when you're going to be studying for midterm, that's coming up in a few weeks.", 'start': 60.484, 'duration': 4.362}, {'end': 67.529, 'text': 'make sure you go through the lecture notes as well, which are really part of this class.', 'start': 64.846, 'duration': 2.683}, {'end': 71.613, 'text': 'And I kind of pick and choose some of the things that I think are most valuable to present in a lecture.', 'start': 67.87, 'duration': 3.743}, {'end': 75.696, 'text': "But there's quite a bit of more material to be aware of that might pop up in the midterm,", 'start': 71.633, 'duration': 4.063}], 'summary': 'Office hours moved to wednesday at 5 p.m. in gates to 59. midterm review advised in lecture notes.', 'duration': 28.176, 'max_score': 47.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo47520.jpg'}], 'start': 4.062, 'title': 'Upcoming deadlines and office hours changes', 'summary': 'Covers the upcoming deadline for assignment one with approximately 150 hours left, a change in office hours from monday to wednesday at 5 p.m. in gates 59, and a reminder of the midterm in a few weeks.', 'chapters': [{'end': 64.846, 'start': 4.062, 'title': 'Assignment deadline, office hours, and midterm reminder', 'summary': 'Covers the upcoming deadline for assignment one, with approximately 150 hours left, a change in office hours from monday to wednesday at 5 p.m. in gates 59, and a reminder of the midterm in a few weeks.', 'duration': 60.784, 'highlights': ['The upcoming deadline for assignment one is next Wednesday, with approximately 150 hours left for completion, highlighting the urgency of starting early to avoid time running out.', 'The change in office hours from Monday to Wednesday at 5 p.m. in gates 59, along with a mention of holding makeup office hours to discuss projects and related matters.', 'A reminder to note the upcoming midterm in a few weeks, emphasizing the need for preparation and planning for studying.']}], 'duration': 60.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4062.jpg', 'highlights': ['The upcoming deadline for assignment one is next Wednesday, with approximately 150 hours left for completion, highlighting the urgency of starting early to avoid time running out.', 'A reminder to note the upcoming midterm in a few weeks, emphasizing the need for preparation and planning for studying.', 'The change in office hours from Monday to Wednesday at 5 p.m. in gates 59, along with a mention of holding makeup office hours to discuss projects and related matters.']}, {'end': 707.55, 'segs': [{'end': 100.751, 'src': 'embed', 'start': 64.846, 'weight': 0, 'content': [{'end': 67.529, 'text': 'make sure you go through the lecture notes as well, which are really part of this class.', 'start': 64.846, 'duration': 2.683}, {'end': 71.613, 'text': 'And I kind of pick and choose some of the things that I think are most valuable to present in a lecture.', 'start': 67.87, 'duration': 3.743}, {'end': 75.696, 'text': "But there's quite a bit of more material to be aware of that might pop up in the midterm,", 'start': 71.633, 'duration': 4.063}, {'end': 77.977, 'text': "even though I'm covering some of the most important stuff usually in the lecture.", 'start': 75.696, 'duration': 2.281}, {'end': 79.839, 'text': 'So do read through those lecture notes.', 'start': 78.458, 'duration': 1.381}, {'end': 82.121, 'text': "They're complementary to the lectures.", 'start': 79.879, 'duration': 2.242}, {'end': 87.965, 'text': 'And so the material for the midterm will be drawn from both the lectures and the notes.', 'start': 84.122, 'duration': 3.843}, {'end': 93.005, 'text': "OK So having said all that, we're going to dive into the material.", 'start': 88.445, 'duration': 4.56}, {'end': 97.529, 'text': 'So where we are right now, just as a reminder, we have the score function.', 'start': 93.625, 'duration': 3.904}, {'end': 100.751, 'text': 'We looked at several loss functions, such as the SVM loss function last time.', 'start': 97.649, 'duration': 3.102}], 'summary': 'Lecture notes are crucial, covering valuable material for the midterm, including score and loss functions.', 'duration': 35.905, 'max_score': 64.846, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo64846.jpg'}, {'end': 219.807, 'src': 'embed', 'start': 192.738, 'weight': 2, 'content': [{'end': 197.279, 'text': "But the point I'd like to make is that you should think much more of this in terms of computational graphs,", 'start': 192.738, 'duration': 4.541}, {'end': 205.703, 'text': "instead of just thinking of one giant expression that you're going to derive with pen and paper, the expression for the gradient.", 'start': 197.279, 'duration': 8.424}, {'end': 207.703, 'text': 'And the reason for that.', 'start': 206.423, 'duration': 1.28}, {'end': 214.126, 'text': "so here we're thinking about these values flowing through a computational graph where you have these operations along circles.", 'start': 207.703, 'duration': 6.423}, {'end': 219.807, 'text': "And they're basically little function pieces that transform your inputs all the way to the loss function at the end.", 'start': 214.466, 'duration': 5.341}], 'summary': 'Think of computations in terms of graphs, not just as one expression.', 'duration': 27.069, 'max_score': 192.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo192738.jpg'}, {'end': 527.661, 'src': 'heatmap', 'start': 429.003, 'weight': 0.785, 'content': [{'end': 431.265, 'text': "And now we're going to go backwards through this graph.", 'start': 429.003, 'duration': 2.262}, {'end': 438.35, 'text': 'So we want the gradient of f with respect to z.', 'start': 432.208, 'duration': 6.142}, {'end': 444.731, 'text': 'So what is that in this computational graph? x plus 4.', 'start': 438.35, 'duration': 6.381}, {'end': 446.092, 'text': "It's q.", 'start': 444.731, 'duration': 1.361}, {'end': 447.252, 'text': 'So we have that written out right here.', 'start': 446.092, 'duration': 1.16}, {'end': 455.3, 'text': "And what is q in this particular example? It's 3, right? So the gradient on z, according to this, will become just 3.", 'start': 447.392, 'duration': 7.908}, {'end': 461.361, 'text': "So I'm going to be writing the gradients under the lines in red, and the values are in green above the lines.", 'start': 455.3, 'duration': 6.061}, {'end': 467.083, 'text': 'So we have the gradient in the front is 1, and now the gradient on z is 3.', 'start': 461.981, 'duration': 5.102}, {'end': 480.606, 'text': "And what 3 is telling you really intuitively keep in mind the interpretation of a gradient is what that's saying is that the influence of z on the final value is positive and with sort of a force of 3..", 'start': 467.083, 'duration': 13.523}, {'end': 489.968, 'text': "So if I increment z by a small amount h, then the output of the circuit will react by increasing, because it's a positive 3, will increase by 3h.", 'start': 480.606, 'duration': 9.362}, {'end': 493.889, 'text': 'So a small change will result in a positive change in the output.', 'start': 490.588, 'duration': 3.301}, {'end': 504.671, 'text': 'Now the gradient on q in this case will be, so df by dq is z.', 'start': 494.989, 'duration': 9.682}, {'end': 506.792, 'text': 'What is z? Negative 4.', 'start': 504.671, 'duration': 2.121}, {'end': 511.296, 'text': 'Okay, so we get a gradient of negative four on that part of the circuit.', 'start': 506.792, 'duration': 4.504}, {'end': 516.241, 'text': "And what that's saying is that if q were to increase, then the output of the circuit will decrease.", 'start': 511.637, 'duration': 4.604}, {'end': 521.405, 'text': 'Okay, if you increase by h, the output of the circuit will decrease by four h.', 'start': 516.26, 'duration': 5.145}, {'end': 522.587, 'text': "That's the slope, is negative four.", 'start': 521.405, 'duration': 1.182}, {'end': 527.661, 'text': "OK, now we're going to continue this recursive process through this plus gate.", 'start': 523.919, 'duration': 3.742}], 'summary': 'Analyzing the graph and interpreting gradients, including z=3 and the influence on the output.', 'duration': 98.658, 'max_score': 429.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo429003.jpg'}, {'end': 610.831, 'src': 'embed', 'start': 583.614, 'weight': 3, 'content': [{'end': 588.156, 'text': "And now we'd like to know the local influence of y on q.", 'start': 583.614, 'duration': 4.542}, {'end': 597.82, 'text': "And that local influence of y on q is one, because that's the local, as I'll refer to as the local derivative of y for the plus gate.", 'start': 588.156, 'duration': 9.664}, {'end': 601.603, 'text': 'And so the chain rule tells us that the correct thing to do to chain these two gradients,', 'start': 598.32, 'duration': 3.283}, {'end': 607.909, 'text': 'the local gradient of y on q and the global gradient of q on the output of the circuit is to multiply them.', 'start': 601.603, 'duration': 6.306}, {'end': 610.831, 'text': "So we'll get negative 4 times 1.", 'start': 608.289, 'duration': 2.542}], 'summary': 'Local influence of y on q is 1, resulting in a product of -4.', 'duration': 27.217, 'max_score': 583.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo583614.jpg'}], 'start': 64.846, 'title': 'Preparing for midterms and back propagation', 'summary': 'Emphasizes reviewing lecture notes and the score function, and explores back propagation in computational graphs for efficiently computing gradients, focusing on local and global variable influences.', 'chapters': [{'end': 100.751, 'start': 64.846, 'title': 'Midterm preparation reminder', 'summary': 'Emphasizes the importance of reviewing lecture notes alongside the presented material to prepare for the midterm, with a reminder about the coverage of the score function and various loss functions like the svm loss function.', 'duration': 35.905, 'highlights': ['The lecture notes are complementary to the lectures and are important for midterm preparation, as the material for the midterm will be drawn from both the lectures and the notes.', 'The chapter covers the score function and various loss functions, including the SVM loss function, as part of the material for the midterm.']}, {'end': 707.55, 'start': 101.672, 'title': 'Back propagation in computational graphs', 'summary': 'Explores the process of deriving gradients in computational graphs for optimization, highlighting the importance of back propagation and chain rule in efficiently computing the gradients, with a focus on understanding the local and global influences of variables in the graph.', 'duration': 605.878, 'highlights': ['The process of deriving gradients in computational graphs for optimization, highlighting the importance of back propagation and chain rule in efficiently computing the gradients. Deriving gradients in computational graphs, importance of back propagation and chain rule, efficient computation of gradients for optimization', 'Emphasizing the significance of understanding the local and global influences of variables in the graph, particularly in relation to the chain rule and back propagation. Significance of understanding local and global influences, relation to chain rule and back propagation']}], 'duration': 642.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo64846.jpg', 'highlights': ['The lecture notes are important for midterm preparation, as the material for the midterm will be drawn from both the lectures and the notes.', 'The chapter covers the score function, various loss functions, including the SVM loss function, as part of the material for the midterm.', 'The process of deriving gradients in computational graphs for optimization, highlighting the importance of back propagation and chain rule in efficiently computing the gradients.', 'Emphasizing the significance of understanding the local and global influences of variables in the graph, particularly in relation to the chain rule and back propagation.']}, {'end': 1470.135, 'segs': [{'end': 799.682, 'src': 'embed', 'start': 770.506, 'weight': 1, 'content': [{'end': 773.389, 'text': "And then we're proceeding recursively in the reverse order backwards.", 'start': 770.506, 'duration': 2.883}, {'end': 782.993, 'text': "But before that, actually before I get to that part right away, when I get x and y the thing, I'd like to point out that during the forward pass,", 'start': 775.428, 'duration': 7.565}, {'end': 786.655, 'text': "if you're this gate and you get your values x and y, you compute your output z.", 'start': 782.993, 'duration': 3.662}, {'end': 791.677, 'text': "And there's another thing you can compute right away, and that is the local gradients on x and y.", 'start': 786.655, 'duration': 5.022}, {'end': 796.8, 'text': "So I can compute those right away because I'm just a gate and I know what I'm performing, like, say, addition or multiplication.", 'start': 791.677, 'duration': 5.123}, {'end': 799.682, 'text': 'So I know the influence that x and y have on my output value.', 'start': 797.1, 'duration': 2.582}], 'summary': 'During the forward pass, gates compute output and local gradients for x and y.', 'duration': 29.176, 'max_score': 770.506, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo770506.jpg'}, {'end': 851.066, 'src': 'embed', 'start': 827.705, 'weight': 2, 'content': [{'end': 834.41, 'text': "And it turns out that the correct thing to do here by chain rule really what it's saying is the correct thing to do is to multiply your local gradient with that gradient.", 'start': 827.705, 'duration': 6.705}, {'end': 837.453, 'text': 'And that actually gives you the dl by dx.', 'start': 834.931, 'duration': 2.522}, {'end': 840.895, 'text': 'That gives you the influence of x on the final output of the circuit.', 'start': 837.753, 'duration': 3.142}, {'end': 851.066, 'text': "So really chain rule is just this added multiplication where we take our what I'll call global gradient of this gate on the output and we chain it through the local gradient.", 'start': 841.596, 'duration': 9.47}], 'summary': 'Applying chain rule involves multiplying local and global gradients to find the influence of x on the final output.', 'duration': 23.361, 'max_score': 827.705, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo827705.jpg'}, {'end': 921.281, 'src': 'embed', 'start': 884.343, 'weight': 0, 'content': [{'end': 887.366, 'text': 'And these just gets all multiplied through the circuit by these local gradients.', 'start': 884.343, 'duration': 3.023}, {'end': 891.249, 'text': 'And you end up with, and this process is called back propagation.', 'start': 888.307, 'duration': 2.942}, {'end': 896.174, 'text': "It's a way of computing, through a recursive application of chain rule, through computational graph,", 'start': 891.369, 'duration': 4.805}, {'end': 900.217, 'text': 'the influence of every single intermediate value in that graph on the final loss function.', 'start': 896.174, 'duration': 4.043}, {'end': 902.8, 'text': "And so we'll see many examples of this throughout this lecture.", 'start': 900.938, 'duration': 1.862}, {'end': 908.974, 'text': "I'll go into a specific example that is slightly larger, and we'll work through it in detail.", 'start': 904.731, 'duration': 4.243}, {'end': 912.556, 'text': "But I don't know if there are any questions at this point that anyone would like to ask.", 'start': 909.574, 'duration': 2.982}, {'end': 912.776, 'text': 'Go ahead.', 'start': 912.576, 'duration': 0.2}, {'end': 919.5, 'text': "What happens if z is used by two other nodes? If z is used by multiple nodes, I'm going to come back to that.", 'start': 912.796, 'duration': 6.704}, {'end': 921.281, 'text': 'You add the gradients.', 'start': 920.261, 'duration': 1.02}], 'summary': 'Back propagation computes influence of intermediate values on final loss function.', 'duration': 36.938, 'max_score': 884.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo884343.jpg'}, {'end': 1198.997, 'src': 'heatmap', 'start': 1096.909, 'weight': 0.833, 'content': [{'end': 1100.67, 'text': 'The derivative of 1 over x, the local gradient, is negative 1 over x squared.', 'start': 1096.909, 'duration': 3.761}, {'end': 1105.712, 'text': 'So that 1 over x gate, during the forward pass, received input 1.37.', 'start': 1101.171, 'duration': 4.541}, {'end': 1109.634, 'text': 'And right away, that 1 over x gate could have computed what the local gradient was.', 'start': 1105.712, 'duration': 3.922}, {'end': 1111.815, 'text': 'The local gradient was negative 1 over x squared.', 'start': 1109.934, 'duration': 1.881}, {'end': 1121.398, 'text': 'And now, during back propagation, it has to, by chain rule, multiply that local gradient by the gradient of it on the final output of the circuit,', 'start': 1112.355, 'duration': 9.043}, {'end': 1123.159, 'text': 'which is easy because it happens to be at the end.', 'start': 1121.398, 'duration': 1.761}, {'end': 1128.841, 'text': 'So what ends up being the expression for the back-propagated gradient here from the 1 over x gate?', 'start': 1123.919, 'duration': 4.922}, {'end': 1140.765, 'text': 'The chain rule always has two pieces local gradient times, the gradient from the top or from above.', 'start': 1134.643, 'duration': 6.122}, {'end': 1150.835, 'text': 'Minus, yeah, OK.', 'start': 1149.574, 'duration': 1.261}, {'end': 1152.176, 'text': "Yeah, so that's correct.", 'start': 1151.535, 'duration': 0.641}, {'end': 1155.479, 'text': 'So we get minus 1 over x squared, which is the gradient df by dx.', 'start': 1152.536, 'duration': 2.943}, {'end': 1165.827, 'text': 'So that is the local gradient negative 1 over 3.7 squared and then multiplied by 1.0,, which is the gradient from above, which is really just 1,', 'start': 1156.119, 'duration': 9.708}, {'end': 1166.748, 'text': "because we've just started.", 'start': 1165.827, 'duration': 0.921}, {'end': 1173.595, 'text': "And so I'm applying chain rule right away here, and the output is negative 0.53.", 'start': 1167.008, 'duration': 6.587}, {'end': 1177.723, 'text': "So that's the gradient on that piece of the wire where this valley was flowing.", 'start': 1173.595, 'duration': 4.128}, {'end': 1179.888, 'text': 'So it has negative effect on the output.', 'start': 1178.304, 'duration': 1.584}, {'end': 1182.626, 'text': 'And you might expect that right?', 'start': 1181.025, 'duration': 1.601}, {'end': 1189.831, 'text': 'Because if you were to increase this value and then it goes through a gate of 1 over x, then if you increase this, then 1 over x gets smaller.', 'start': 1182.666, 'duration': 7.165}, {'end': 1194.033, 'text': "So that's why you're seeing negative gradient, right? So we're going to continue back propagation here.", 'start': 1189.971, 'duration': 4.062}, {'end': 1198.997, 'text': "The next gate in the circuit, it's adding a constant of 1.", 'start': 1194.774, 'duration': 4.223}], 'summary': 'During back propagation, the 1 over x gate computed a gradient of -0.53 from an input of 1.37.', 'duration': 102.088, 'max_score': 1096.909, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1096909.jpg'}, {'end': 1128.841, 'src': 'embed', 'start': 1101.171, 'weight': 6, 'content': [{'end': 1105.712, 'text': 'So that 1 over x gate, during the forward pass, received input 1.37.', 'start': 1101.171, 'duration': 4.541}, {'end': 1109.634, 'text': 'And right away, that 1 over x gate could have computed what the local gradient was.', 'start': 1105.712, 'duration': 3.922}, {'end': 1111.815, 'text': 'The local gradient was negative 1 over x squared.', 'start': 1109.934, 'duration': 1.881}, {'end': 1121.398, 'text': 'And now, during back propagation, it has to, by chain rule, multiply that local gradient by the gradient of it on the final output of the circuit,', 'start': 1112.355, 'duration': 9.043}, {'end': 1123.159, 'text': 'which is easy because it happens to be at the end.', 'start': 1121.398, 'duration': 1.761}, {'end': 1128.841, 'text': 'So what ends up being the expression for the back-propagated gradient here from the 1 over x gate?', 'start': 1123.919, 'duration': 4.922}], 'summary': "During back propagation, the 1/x gate's local gradient was -1/x^2.", 'duration': 27.67, 'max_score': 1101.171, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1101171.jpg'}, {'end': 1426.042, 'src': 'embed', 'start': 1396.52, 'weight': 7, 'content': [{'end': 1400.803, 'text': 'So a plus gate is kind of like a gradient distributor where if something flows in from the top,', 'start': 1396.52, 'duration': 4.283}, {'end': 1404.885, 'text': 'it will just spread out all the gradients equally to all of its children.', 'start': 1400.803, 'duration': 4.082}, {'end': 1409.088, 'text': "And so we've already received one of the inputs as gradient 0.2.", 'start': 1405.486, 'duration': 3.602}, {'end': 1412.112, 'text': 'here on the very final output of the circuit.', 'start': 1409.088, 'duration': 3.024}, {'end': 1416.778, 'text': 'And so this influence has been computed through a series of applications of chain rule along the way.', 'start': 1412.673, 'duration': 4.105}, {'end': 1421.621, 'text': "So now there was another plus gate that I've skipped over.", 'start': 1419.14, 'duration': 2.481}, {'end': 1426.042, 'text': 'And so this 0.2 kind of distributes to both 0.2, 0.2 equally.', 'start': 1422.061, 'duration': 3.981}], 'summary': 'A plus gate distributes gradients equally, with a 0.2 influence computed through chain rule.', 'duration': 29.522, 'max_score': 1396.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1396520.jpg'}], 'start': 708.45, 'title': 'Backpropagation in neural networks', 'summary': 'Discusses backpropagation and chain rule in neural networks, emphasizing the computation of local gradients and their impact on the final loss function, with specific examples and applications.', 'chapters': [{'end': 979.407, 'start': 708.45, 'title': 'Back propagation and chain rule in neural networks', 'summary': 'Explains the process of back propagation and the application of chain rule in the computational graph to compute the influence of every intermediate value on the final loss function, with an emphasis on local gradients and recursive application.', 'duration': 270.957, 'highlights': ['The process of back propagation involves computing the influence of every single intermediate value in the graph on the final loss function through a recursive application of the chain rule. Explanation of the back propagation process and its recursive application in computing the influence of intermediate values on the final loss function.', 'The local gradients on input values are computed during the forward pass, allowing for the immediate knowledge of their influence on the output value. Explanation of the computation of local gradients on input values during the forward pass.', "The correct application of the chain rule involves multiplying the local gradient with the global gradient to compute the influence of an input value on the final output of the circuit. Explanation of the correct application of the chain rule through the multiplication of local and global gradients for computing the input value's influence on the final output.", 'In the case of z being used by multiple nodes in the circuit, the backward flows are added to compute the gradients. Explanation of how gradients are added when a value is used by multiple nodes in the circuit.']}, {'end': 1470.135, 'start': 979.988, 'title': 'Backpropagation through computational graph', 'summary': 'Explains the process of backpropagation through a computational graph, detailing the influence of input values on the output and computing gradients through chain rule, with specific examples of local gradients and their impact on the final output.', 'duration': 490.147, 'highlights': ['The chapter explains the process of backpropagation through a computational graph, detailing the influence of input values on the output and computing gradients through chain rule, with specific examples of local gradients and their impact on the final output.', 'The local gradient of the 1 over x gate is computed as -1/3.7^2, resulting in a backpropagated gradient of -0.53, illustrating the negative impact of the input value on the output.', 'The backpropagation through a plus gate demonstrates the distribution of gradients equally to all of its inputs, showing that the local gradient of all inputs is 1 and leading to a consistent spread of gradients across the children of the plus gate.', 'The process of backpropagation involves calculating gradients for various operations such as addition, multiplication, and exponentiation, utilizing local gradients derived from calculus to determine the impact of input values on the final output.']}], 'duration': 761.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo708450.jpg', 'highlights': ['Explanation of the back propagation process and its recursive application in computing the influence of intermediate values on the final loss function.', 'Explanation of the computation of local gradients on input values during the forward pass.', "Explanation of the correct application of the chain rule through the multiplication of local and global gradients for computing the input value's influence on the final output.", 'Explanation of how gradients are added when a value is used by multiple nodes in the circuit.', 'The chapter explains the process of backpropagation through a computational graph, detailing the influence of input values on the output and computing gradients through chain rule, with specific examples of local gradients and their impact on the final output.', 'The process of backpropagation involves calculating gradients for various operations such as addition, multiplication, and exponentiation, utilizing local gradients derived from calculus to determine the impact of input values on the final output.', 'The local gradient of the 1 over x gate is computed as -1/3.7^2, resulting in a backpropagated gradient of -0.53, illustrating the negative impact of the input value on the output.', 'The backpropagation through a plus gate demonstrates the distribution of gradients equally to all of its inputs, showing that the local gradient of all inputs is 1 and leading to a consistent spread of gradients across the children of the plus gate.']}, {'end': 1856.974, 'segs': [{'end': 1549.901, 'src': 'embed', 'start': 1526.057, 'weight': 0, 'content': [{'end': 1533.061, 'text': 'Any other questions at this point? So this process takes on the order of the same time it takes to forward propagate.', 'start': 1526.057, 'duration': 7.004}, {'end': 1534.002, 'text': "That's right.", 'start': 1533.681, 'duration': 0.321}, {'end': 1537.263, 'text': 'So the cost of forward and backward propagation is roughly equal.', 'start': 1534.202, 'duration': 3.061}, {'end': 1541.726, 'text': 'Is it actually within a constant??', 'start': 1538.424, 'duration': 3.302}, {'end': 1543.347, 'text': 'Do you know what the constant is?', 'start': 1542.066, 'duration': 1.281}, {'end': 1545.176, 'text': 'Well, it should be.', 'start': 1544.535, 'duration': 0.641}, {'end': 1548.199, 'text': 'it almost always ends up being basically equal when you look at timings.', 'start': 1545.176, 'duration': 3.023}, {'end': 1549.901, 'text': 'Usually the backward pass is slightly slower.', 'start': 1548.279, 'duration': 1.622}], 'summary': 'Forward and backward propagation take roughly equal time.', 'duration': 23.844, 'max_score': 1526.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1526057.jpg'}, {'end': 1614.482, 'src': 'embed', 'start': 1585.286, 'weight': 3, 'content': [{'end': 1589.588, 'text': "And so there's a sigmoid gate here, and I could have done that in a single go, sort of.", 'start': 1585.286, 'duration': 4.302}, {'end': 1597.533, 'text': 'And what I would have had to do if I wanted to have that gate is I need to compute an expression for how this.', 'start': 1590.169, 'duration': 7.364}, {'end': 1600.475, 'text': 'So what is the local gradient for the sigmoid gate, basically? So what is the.', 'start': 1597.533, 'duration': 2.942}, {'end': 1602.576, 'text': 'gradient of the sigmoid gate on its input.', 'start': 1601.095, 'duration': 1.481}, {'end': 1606.458, 'text': "And I have to go through some math, which I'm not going to go into detail, but you end up with that expression over there.", 'start': 1602.836, 'duration': 3.622}, {'end': 1609.84, 'text': 'It ends up being 1 minus sigmoid of x times sigmoid of x.', 'start': 1606.958, 'duration': 2.882}, {'end': 1610.84, 'text': "That's the local gradient.", 'start': 1609.84, 'duration': 1}, {'end': 1614.482, 'text': 'And that allows me to now put this piece into a computational graph.', 'start': 1611.44, 'duration': 3.042}], 'summary': 'The local gradient for the sigmoid gate is 1 minus sigmoid of x times sigmoid of x.', 'duration': 29.196, 'max_score': 1585.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1585286.jpg'}, {'end': 1721.43, 'src': 'embed', 'start': 1694.897, 'weight': 4, 'content': [{'end': 1700.001, 'text': "So if you notice that there's some piece of operation you'd like to do over and over again and it has a very simple local gradient,", 'start': 1694.897, 'duration': 5.104}, {'end': 1703.404, 'text': "then that's something very appealing to actually create a single unit out of.", 'start': 1700.001, 'duration': 3.403}, {'end': 1706.386, 'text': "And we'll see some of those examples actually in a bit, I think.", 'start': 1704.184, 'duration': 2.202}, {'end': 1716.246, 'text': "OK, I'd like to also point out that the reason I like to think about these computational graphs is it really helps your intuition to think about how gradients flow in a neural network.", 'start': 1707.458, 'duration': 8.788}, {'end': 1718.848, 'text': "You don't want this to be a black box to you.", 'start': 1717.226, 'duration': 1.622}, {'end': 1721.43, 'text': 'You want to understand intuitively how this happens.', 'start': 1718.868, 'duration': 2.562}], 'summary': 'Creating single units for repetitive operations with simple local gradients helps understand gradient flow in neural networks.', 'duration': 26.533, 'max_score': 1694.897, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1694897.jpg'}, {'end': 1749.041, 'src': 'embed', 'start': 1722.091, 'weight': 1, 'content': [{'end': 1726.514, 'text': 'And you start to develop, after a while of looking at computational graphs, intuitions about how these gradients flow.', 'start': 1722.091, 'duration': 4.423}, {'end': 1732.002, 'text': "And this, by the way, helps you debug some issues like, say, we'll go to vanishing gradient problem.", 'start': 1727.575, 'duration': 4.427}, {'end': 1737.749, 'text': "It's much easier to understand exactly what's going wrong in your optimization if you understand how gradients flow in networks.", 'start': 1732.262, 'duration': 5.487}, {'end': 1740.433, 'text': 'It will help you debug these networks much more efficiently.', 'start': 1738.03, 'duration': 2.403}, {'end': 1747.18, 'text': 'And so some intuitions, for example, we already saw that the 8th ad gate It has a local gradient of 1 to all of its inputs.', 'start': 1741.114, 'duration': 6.066}, {'end': 1749.041, 'text': "So it's just a gradient distributor.", 'start': 1747.44, 'duration': 1.601}], 'summary': 'Understanding gradient flow in networks aids in efficient debugging and optimization.', 'duration': 26.95, 'max_score': 1722.091, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1722091.jpg'}, {'end': 1828.089, 'src': 'embed', 'start': 1792.627, 'weight': 2, 'content': [{'end': 1794.227, 'text': "and that's what ends up propagating through the gate.", 'start': 1792.627, 'duration': 1.6}, {'end': 1797.808, 'text': 'So you end up with a gradient of one on the larger one of the inputs.', 'start': 1795.127, 'duration': 2.681}, {'end': 1801.769, 'text': "And so that's why max gate is a gradient router.", 'start': 1798.468, 'duration': 3.301}, {'end': 1806.92, 'text': "If I'm a max gate and I have received several inputs, one of them was largest of all of them,", 'start': 1802.918, 'duration': 4.002}, {'end': 1808.761, 'text': "and that's the value that I propagated through the circuit.", 'start': 1806.92, 'duration': 1.841}, {'end': 1814.943, 'text': "At back propagation time, I'm just going to receive my gradient from above, and I'm going to route it to whoever was my largest input.", 'start': 1809.481, 'duration': 5.462}, {'end': 1816.584, 'text': "So it's a gradient router.", 'start': 1815.744, 'duration': 0.84}, {'end': 1820.746, 'text': 'And the multiply gate is a gradient switcher.', 'start': 1817.865, 'duration': 2.881}, {'end': 1828.089, 'text': "I actually don't think that's a very good way to look at it, but I'm referring to the fact that it's not actually, nevermind about that part.", 'start': 1821.066, 'duration': 7.023}], 'summary': "Max gate acts as a gradient router, propagating the largest input's value.", 'duration': 35.462, 'max_score': 1792.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1792627.jpg'}], 'start': 1470.595, 'title': 'Backward propagation and gradients in neural networks', 'summary': 'Explains the process of back-propagation in neural networks, discussing the equality of cost between forward and backward propagation, and emphasizes the importance of understanding gradients, highlighting the efficiency gained from analyzing computational graphs and the specific roles of different gates.', 'chapters': [{'end': 1670.729, 'start': 1470.595, 'title': 'Backward propagation in neural networks', 'summary': 'Explains the process of back-propagation in neural networks, highlighting that the cost of forward and backward propagation is roughly equal and discussing the arbitrary setting of gates and the potential to collapse them into a single gate.', 'duration': 200.134, 'highlights': ['The cost of forward and backward propagation is roughly equal, with the backward pass usually slightly slower.', 'The gates in neural networks are arbitrary and can be collapsed into a single gate, such as the sigmoid gate, simplifying the computational graph and local gradient computations.', 'The local gradient for the sigmoid gate is calculated as 1 minus sigmoid of x times sigmoid of x, allowing for back propagation through the gate using the chain rule.']}, {'end': 1856.974, 'start': 1671.43, 'title': 'Understanding gradients in neural networks', 'summary': 'Discusses the importance of understanding gradients in neural networks, emphasizing the efficiency and intuition gained from analyzing computational graphs and highlighting the specific roles of different gates, such as the max gate as a gradient router and the multiply gate as a gradient switcher.', 'duration': 185.544, 'highlights': ['The max gate serves as a gradient router, with a gradient of 1 on the larger input and 0 on the smaller input, efficiently propagating gradients through the circuit.', 'Understanding computational graphs helps in intuitively grasping gradient flow in neural networks, aiding in efficient debugging and addressing issues like the vanishing gradient problem.', 'Identifying simple operations with straightforward local gradients enables the creation of single units, enhancing computational efficiency in neural networks.', 'The multiply gate functions as a gradient switcher, although further details are not provided in the transcript.', 'Analyzing computational graphs provides intuition on gradient distribution in networks, enabling efficient debugging and optimization.', 'Recognition of operations with simple local gradients facilitates the creation of single units, optimizing computational efficiency.']}], 'duration': 386.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1470595.jpg', 'highlights': ['The cost of forward and backward propagation is roughly equal, with the backward pass usually slightly slower.', 'Understanding computational graphs helps in intuitively grasping gradient flow in neural networks, aiding in efficient debugging and addressing issues like the vanishing gradient problem.', 'The max gate serves as a gradient router, with a gradient of 1 on the larger input and 0 on the smaller input, efficiently propagating gradients through the circuit.', 'The gates in neural networks are arbitrary and can be collapsed into a single gate, such as the sigmoid gate, simplifying the computational graph and local gradient computations.', 'Identifying simple operations with straightforward local gradients enables the creation of single units, enhancing computational efficiency in neural networks.']}, {'end': 2266.604, 'segs': [{'end': 1888.491, 'src': 'embed', 'start': 1858.945, 'weight': 0, 'content': [{'end': 1861.55, 'text': 'Yeah, but that basically never happens in actual practice.', 'start': 1858.945, 'duration': 2.605}, {'end': 1868.6, 'text': 'OK, so max gradient here, actually I have an example.', 'start': 1865.578, 'duration': 3.022}, {'end': 1871.681, 'text': 'So z here was larger than w.', 'start': 1868.86, 'duration': 2.821}, {'end': 1874.223, 'text': 'So only z has an influence on the output of this max gate.', 'start': 1871.681, 'duration': 2.542}, {'end': 1877.945, 'text': 'So when 2 flows into the max gate, it gets routed to z.', 'start': 1874.983, 'duration': 2.962}, {'end': 1881.407, 'text': 'And w gets a 0 gradient, because its effect on the circuit is nothing.', 'start': 1877.945, 'duration': 3.462}, {'end': 1882.427, 'text': "There's 0.", 'start': 1881.507, 'duration': 0.92}, {'end': 1888.491, 'text': "Because when you change it, it doesn't matter when you change it, because z is the larger value going through the computational graph.", 'start': 1882.427, 'duration': 6.064}], 'summary': 'In practice, z has a larger influence on the output of the max gate, resulting in w receiving a 0 gradient.', 'duration': 29.546, 'max_score': 1858.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1858945.jpg'}, {'end': 1921.161, 'src': 'embed', 'start': 1898.488, 'weight': 1, 'content': [{'end': 1911.216, 'text': 'that if you have these circuits and sometimes you have a value that branches out into a circuit and is used in multiple parts of the circuit the correct thing to do by multivariate chain rule is to actually add up the contributions at the operation.', 'start': 1898.488, 'duration': 12.728}, {'end': 1912.777, 'text': 'So gradients add.', 'start': 1911.717, 'duration': 1.06}, {'end': 1916.219, 'text': 'when they backpropagate backwards through the circuit.', 'start': 1914.278, 'duration': 1.941}, {'end': 1920.001, 'text': 'If they ever flow, they add up in this backward flow.', 'start': 1916.259, 'duration': 3.742}, {'end': 1921.161, 'text': 'All right.', 'start': 1920.941, 'duration': 0.22}], 'summary': 'Circuits with value branching should add contributions for gradients during backward flow.', 'duration': 22.673, 'max_score': 1898.488, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1898488.jpg'}, {'end': 1964.931, 'src': 'embed', 'start': 1939.473, 'weight': 2, 'content': [{'end': 1944.278, 'text': "Because what we'll do is we will take a recurrent neural network and we'll unfold it through time steps.", 'start': 1939.473, 'duration': 4.805}, {'end': 1951.244, 'text': "And this will all become, there will never be a loop in the unfolded graph where we've copy pasted that small recurrent piece over time.", 'start': 1944.358, 'duration': 6.886}, {'end': 1953.186, 'text': "You'll see that more when we actually get into it.", 'start': 1951.665, 'duration': 1.521}, {'end': 1954.948, 'text': 'But these are always DAGs.', 'start': 1953.627, 'duration': 1.321}, {'end': 1955.889, 'text': "There's no loops.", 'start': 1955.148, 'duration': 0.741}, {'end': 1959.01, 'text': 'OK, awesome.', 'start': 1958.41, 'duration': 0.6}, {'end': 1962.091, 'text': "So let's look at implementation and how this is actually implemented in practice.", 'start': 1959.37, 'duration': 2.721}, {'end': 1964.931, 'text': 'And I think it will help make this more concrete as well.', 'start': 1962.131, 'duration': 2.8}], 'summary': 'Explaining the unfolding of a recurrent neural network through time steps to avoid loops in the unfolded graph.', 'duration': 25.458, 'max_score': 1939.473, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1939473.jpg'}, {'end': 2059.114, 'src': 'embed', 'start': 2029.89, 'weight': 3, 'content': [{'end': 2033.393, 'text': 'And they all get chained up and computing the analytic gradient at the back.', 'start': 2029.89, 'duration': 3.503}, {'end': 2037.276, 'text': 'So really, a net object is a very thin wrapper around all these gates.', 'start': 2033.993, 'duration': 3.283}, {'end': 2039.318, 'text': "Or as we'll see, they're called layers.", 'start': 2037.536, 'duration': 1.782}, {'end': 2040.819, 'text': 'Layers are gates.', 'start': 2040.119, 'duration': 0.7}, {'end': 2042.22, 'text': "I'm going to use those interchangeably.", 'start': 2041.059, 'duration': 1.161}, {'end': 2047.765, 'text': "And they're just very thin wrappers around connectivity structure of these gates and calling a forward and a backward function on them.", 'start': 2042.881, 'duration': 4.884}, {'end': 2051.929, 'text': "And then let's look at a specific example of one of the gates and how this might be implemented.", 'start': 2048.746, 'duration': 3.183}, {'end': 2054.071, 'text': 'And this is not just a pseudocode.', 'start': 2052.79, 'duration': 1.281}, {'end': 2059.114, 'text': 'This is actually more like correct implementation in some sense, like this might run at the end.', 'start': 2054.091, 'duration': 5.023}], 'summary': 'Neural network layers are thin wrappers around gates, allowing for computation of analytic gradients.', 'duration': 29.224, 'max_score': 2029.89, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2029890.jpg'}, {'end': 2201.803, 'src': 'embed', 'start': 2175.02, 'weight': 4, 'content': [{'end': 2179.764, 'text': "So, basically, when we end up running these networks at runtime, just always keep in mind that, as you're doing this forward pass,", 'start': 2175.02, 'duration': 4.744}, {'end': 2181.645, 'text': 'a huge amount of stuff gets cached in your memory.', 'start': 2179.764, 'duration': 1.881}, {'end': 2185.949, 'text': 'And that all has to stick around, because during back propagation, you might need access to some of those variables.', 'start': 2181.966, 'duration': 3.983}, {'end': 2189.732, 'text': 'And so your memory ends up ballooning up during the forward pass.', 'start': 2186.609, 'duration': 3.123}, {'end': 2191.613, 'text': 'And then in backward pass, it gets all consumed.', 'start': 2189.832, 'duration': 1.781}, {'end': 2194.435, 'text': 'And you need all those intermediates to actually compute the proper backward pass.', 'start': 2191.753, 'duration': 2.682}, {'end': 2201.803, 'text': "Is that where you can do a forward pass if you're not going to do a backward pass fast, like a lot faster? Yes.", 'start': 2195.356, 'duration': 6.447}], 'summary': 'During runtime, network caching leads to memory ballooning in forward pass, essential for proper backward pass computation.', 'duration': 26.783, 'max_score': 2175.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2175020.jpg'}], 'start': 1858.945, 'title': 'Max gate influence on gradient and neural network implementation', 'summary': 'Discusses the influence of the max gate on gradients, highlighting its impact on inputs and emphasizing the importance of correct backpropagation. it also covers the implementation of neural networks, emphasizing the absence of loops in graphs, the role of net objects, and the necessity of caching variables for proper computation.', 'chapters': [{'end': 1921.161, 'start': 1858.945, 'title': 'Max gate influence on gradient', 'summary': 'Discusses the influence of the max gate on gradients, emphasizing that when one input is larger than the other, it solely affects the output, resulting in 0 gradient for the smaller input, and highlights the importance of adding up contributions at the operation for correct backpropagation through circuits.', 'duration': 62.216, 'highlights': ['The influence of the max gate on gradients is illustrated, showing that when one input is larger than the other, it solely affects the output, resulting in 0 gradient for the smaller input.', 'The importance of adding up contributions at the operation for correct backpropagation through circuits is emphasized, indicating that gradients add during the backward flow.']}, {'end': 2266.604, 'start': 1921.621, 'title': 'Neural network implementation', 'summary': 'Discusses the implementation of neural networks, emphasizing the absence of loops in computational graphs, the role of net objects in maintaining connectivity structure, and the necessity of caching variables during forward pass for proper backward pass computation.', 'duration': 344.983, 'highlights': ['The absence of loops in computational graphs, including recurrent neural networks, is emphasized, ensuring that the unfolded graph is always a Directed Acyclic Graph (DAG) with proper connectivity patterns.', 'The role of net objects in maintaining proper connectivity structure of gates, handling forward and backward passes by iterating through the graph in topological order, thus ensuring proper communication of gradients and computing analytic gradients at the back.', 'The necessity of caching variables during the forward pass to remember inputs and intermediate calculations for proper backward pass computation, leading to increased memory consumption during the forward pass and potential memory optimization strategies for specific scenarios.']}], 'duration': 407.659, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo1858945.jpg', 'highlights': ['The influence of the max gate on gradients is illustrated, showing that when one input is larger than the other, it solely affects the output, resulting in 0 gradient for the smaller input.', 'The importance of adding up contributions at the operation for correct backpropagation through circuits is emphasized, indicating that gradients add during the backward flow.', 'The absence of loops in computational graphs, including recurrent neural networks, is emphasized, ensuring that the unfolded graph is always a Directed Acyclic Graph (DAG) with proper connectivity patterns.', 'The role of net objects in maintaining proper connectivity structure of gates, handling forward and backward passes by iterating through the graph in topological order, thus ensuring proper communication of gradients and computing analytic gradients at the back.', 'The necessity of caching variables during the forward pass to remember inputs and intermediate calculations for proper backward pass computation, leading to increased memory consumption during the forward pass and potential memory optimization strategies for specific scenarios.']}, {'end': 2605.401, 'segs': [{'end': 2323.836, 'src': 'embed', 'start': 2287.682, 'weight': 0, 'content': [{'end': 2293.908, 'text': 'Torch is a deep learning framework, which we might go into a bit near the end of the class, that some of you might end up using for your projects.', 'start': 2287.682, 'duration': 6.226}, {'end': 2301.675, 'text': "If you go into the GitHub repo for Torch and you look at Basically, it's just a giant collection of these layer objects.", 'start': 2295.629, 'duration': 6.046}, {'end': 2302.576, 'text': 'And these are the gates.', 'start': 2301.856, 'duration': 0.72}, {'end': 2303.818, 'text': 'Layers, gates, the same thing.', 'start': 2302.797, 'duration': 1.021}, {'end': 2305.439, 'text': "So there's all these layers.", 'start': 2304.318, 'duration': 1.121}, {'end': 2307.021, 'text': "That's really what a deep learning framework is.", 'start': 2305.519, 'duration': 1.502}, {'end': 2312.867, 'text': "It's just a whole bunch of layers and a very thin computational graph thing that keeps track of all the layer connectivity.", 'start': 2307.081, 'duration': 5.786}, {'end': 2317.412, 'text': 'And so really, the image to have in mind is all these things are your Lego blocks.', 'start': 2313.788, 'duration': 3.624}, {'end': 2319.114, 'text': "And then we're building up.", 'start': 2318.033, 'duration': 1.081}, {'end': 2323.836, 'text': 'these computational graphs out of your Lego blocks, out of the layers.', 'start': 2319.995, 'duration': 3.841}], 'summary': 'Torch is a deep learning framework with a collection of layer objects for building computational graphs.', 'duration': 36.154, 'max_score': 2287.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2287682.jpg'}, {'end': 2365.954, 'src': 'embed', 'start': 2337.199, 'weight': 3, 'content': [{'end': 2340.16, 'text': 'And that function piece knows how to do a forward, and it knows how to do a backward.', 'start': 2337.199, 'duration': 2.961}, {'end': 2345.041, 'text': "So just to give a specific example, let's look at the mall constant.", 'start': 2340.92, 'duration': 4.121}, {'end': 2348.003, 'text': 'layer in Torch.', 'start': 2347.162, 'duration': 0.841}, {'end': 2352.565, 'text': 'The mall constant layer performs just a scaling by a scalar.', 'start': 2348.543, 'duration': 4.022}, {'end': 2355.007, 'text': 'So it takes some tensor x.', 'start': 2352.946, 'duration': 2.061}, {'end': 2357.989, 'text': "So this is not just a scalar, but it's actually like an array of numbers, basically.", 'start': 2355.007, 'duration': 2.982}, {'end': 2361.291, 'text': 'Because when we actually work with these, we do a lot of vectorized operations.', 'start': 2358.869, 'duration': 2.422}, {'end': 2365.954, 'text': 'So we receive a tensor, which is really just an n-dimensional array, and we scale it by a constant.', 'start': 2361.331, 'duration': 4.623}], 'summary': 'The mall constant layer in torch performs scaling by a scalar on an n-dimensional array of numbers.', 'duration': 28.755, 'max_score': 2337.199, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2337199.jpg'}, {'end': 2572.566, 'src': 'embed', 'start': 2544.147, 'weight': 2, 'content': [{'end': 2550.091, 'text': 'So forward computes the loss, backward computes your gradient, and then the update uses the gradient to increment your weights a bit.', 'start': 2544.147, 'duration': 5.944}, {'end': 2552.413, 'text': "So that's what keeps happening in the loop.", 'start': 2550.771, 'duration': 1.642}, {'end': 2554.034, 'text': "When you train a neural network, that's all that's happening.", 'start': 2552.433, 'duration': 1.601}, {'end': 2556.315, 'text': 'Forward, backward, update, forward, backward, update.', 'start': 2554.354, 'duration': 1.961}, {'end': 2557.136, 'text': "We'll see that in a bit.", 'start': 2556.495, 'duration': 0.641}, {'end': 2558.236, 'text': 'Go ahead.', 'start': 2557.156, 'duration': 1.08}, {'end': 2566.322, 'text': 'The reason for having for loop there is simply because this is a four vector format.', 'start': 2558.256, 'duration': 8.066}, {'end': 2572.566, 'text': "You're asking about the for loop? In the backward Oh, is there a for loop here? I didn't even notice.", 'start': 2566.762, 'duration': 5.804}], 'summary': 'Neural network training involves forward, backward, and update in a loop.', 'duration': 28.419, 'max_score': 2544.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2544147.jpg'}], 'start': 2267.385, 'title': 'Neural network layers and computational graphs in torch', 'summary': 'Explains the concept of layers in a deep learning framework, using torch as an example, involving building computational graphs, implementing forward and backward functions, with a specific example of the mall constant layer performing a scaling operation on a tensor. it also discusses the implementation of the forward-backward api in deep learning frameworks, emphasizing the computation of gradients and the iterative process of forward pass, backward pass, and update for training neural networks.', 'chapters': [{'end': 2374.659, 'start': 2267.385, 'title': 'Neural network layers and computational graphs in torch', 'summary': 'Explains the concept of layers in a deep learning framework, using torch as an example, which involves building computational graphs from layers and implementing forward and backward functions, with a specific example of the mall constant layer performing a scaling operation on a tensor.', 'duration': 107.274, 'highlights': ['The concept of deep learning frameworks is built on a collection of layers, which are essentially the gates for building computational graphs, with Torch being an example (relevance: 5)', 'The layers in a deep learning framework are like Lego blocks that are used to build computational graphs, allowing flexibility in achieving various objectives (relevance: 4)', 'Each layer in a deep learning framework implements a small function piece that performs forward and backward operations, with the example of the mall constant layer in Torch performing a scaling operation on a tensor (relevance: 3)', 'The mall constant layer in Torch performs a scaling operation by a scalar on a tensor, using vectorized operations, and consists of 40 lines of code in Lua (relevance: 2)']}, {'end': 2605.401, 'start': 2375.22, 'title': 'Understanding forward-backward api in deep learning', 'summary': 'Discusses the implementation of the forward-backward api in deep learning frameworks like torch and cafe, emphasizing the computation of gradients and the iterative process of forward pass, backward pass, and update for training neural networks.', 'duration': 230.181, 'highlights': ['The forward-backward API involves the forward pass, which computes the loss, the backward pass, which computes the gradient, and the update, which uses the gradient to increment the weights. The forward-backward API in deep learning involves the iterative process of forward pass, backward pass, and update, where the forward pass computes the loss, the backward pass computes the gradient, and the update uses the gradient to increment the weights.', 'The implementation of the forward-backward API in deep learning frameworks like Torch and Cafe involves copying gradients, scaling gradients, and computing the local gradients of layers. The implementation of the forward-backward API in deep learning frameworks like Torch and Cafe involves copying gradients, scaling gradients, and computing the local gradients of layers to facilitate the computation of gradients during the backward pass.', 'The sigmoid layer in Cafe, for example, computes the sigmoid function element-wise in the forward pass and calculates the gradient using the chain rule in the backward pass. In Cafe, the sigmoid layer computes the sigmoid function element-wise in the forward pass and calculates the gradient using the chain rule in the backward pass to update the weights of the neural network.']}], 'duration': 338.016, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2267385.jpg', 'highlights': ['The concept of deep learning frameworks is built on a collection of layers, which are essentially the gates for building computational graphs, with Torch being an example (relevance: 5)', 'The layers in a deep learning framework are like Lego blocks that are used to build computational graphs, allowing flexibility in achieving various objectives (relevance: 4)', 'The forward-backward API involves the forward pass, which computes the loss, the backward pass, which computes the gradient, and the update, which uses the gradient to increment the weights. The forward-backward API in deep learning involves the iterative process of forward pass, backward pass, and update, where the forward pass computes the loss, the backward pass computes the gradient, and the update uses the gradient to increment the weights.', 'The layers in a deep learning framework implement small function pieces that perform forward and backward operations, with the example of the mall constant layer in Torch performing a scaling operation on a tensor (relevance: 3)']}, {'end': 3421.362, 'segs': [{'end': 2797.061, 'src': 'embed', 'start': 2767.124, 'weight': 0, 'content': [{'end': 2772.089, 'text': 'the Jacobian is still a giant 4096 by 4096 matrix, but it has special structure right?', 'start': 2767.124, 'duration': 4.965}, {'end': 2774.031, 'text': 'And what is that special structure?', 'start': 2772.97, 'duration': 1.061}, {'end': 2774.932, 'text': 'Go ahead.', 'start': 2774.431, 'duration': 0.501}, {'end': 2777.875, 'text': 'Yeah, so this Jacobian is huge.', 'start': 2774.952, 'duration': 2.923}, {'end': 2779.797, 'text': "So it's 4096 by 4096 matrix.", 'start': 2778.375, 'duration': 1.422}, {'end': 2789.258, 'text': "But there's only elements on the diagonal, because this is an element-wise operation.", 'start': 2785.897, 'duration': 3.361}, {'end': 2792.7, 'text': "And moreover, they're not just ones.", 'start': 2789.979, 'duration': 2.721}, {'end': 2797.061, 'text': 'But for whichever element was less than zero, it was clamped to zero.', 'start': 2793.62, 'duration': 3.441}], 'summary': 'The jacobian is a 4096x4096 matrix with diagonal elements only, clamped to zero if less than zero.', 'duration': 29.937, 'max_score': 2767.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2767124.jpg'}, {'end': 2849.699, 'src': 'embed', 'start': 2819.093, 'weight': 1, 'content': [{'end': 2821.756, 'text': "because there's special structure that we want to take advantage of.", 'start': 2819.093, 'duration': 2.663}, {'end': 2827.481, 'text': 'And so in particular, the gradient, the backward pass for this operation is very, very easy.', 'start': 2822.256, 'duration': 5.225}, {'end': 2835.207, 'text': 'Because you just want to look at all the dimensions where your input was less than 0, and you want to kill the gradient in those dimensions.', 'start': 2828.281, 'duration': 6.926}, {'end': 2837.329, 'text': 'You want to set the gradient to 0 in those dimensions.', 'start': 2835.227, 'duration': 2.102}, {'end': 2843.053, 'text': 'So you take the grad output here, and whichever numbers were less than 0, just set them to 0.', 'start': 2837.789, 'duration': 5.264}, {'end': 2845.736, 'text': 'Set those gradients to 0, and then you continue backward pass.', 'start': 2843.053, 'duration': 2.683}, {'end': 2849.699, 'text': 'So very simple operations in the end in terms of efficiency.', 'start': 2846.556, 'duration': 3.143}], 'summary': 'Efficient gradient backward pass by setting negative gradients to 0.', 'duration': 30.606, 'max_score': 2819.093, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2819093.jpg'}, {'end': 2963.869, 'src': 'embed', 'start': 2938.228, 'weight': 2, 'content': [{'end': 2945.373, 'text': 'But all the examples in a mini-batch are processed independently of each other, in parallel, and so this Jacobian matrix really ends up being 400,', 'start': 2938.228, 'duration': 7.145}, {'end': 2946.433, 'text': '000 by 400, 000..', 'start': 2945.373, 'duration': 1.06}, {'end': 2949.276, 'text': 'So you never form these, basically.', 'start': 2946.434, 'duration': 2.842}, {'end': 2955.241, 'text': 'And you take care to actually take advantage of the sparsity structure in that Jacobian.', 'start': 2949.597, 'duration': 5.644}, {'end': 2956.843, 'text': 'And you hand code operations.', 'start': 2955.602, 'duration': 1.241}, {'end': 2960.586, 'text': "You don't actually write the fully generalized chain rule inside any gate implementation.", 'start': 2956.863, 'duration': 3.723}, {'end': 2963.869, 'text': 'OK, cool.', 'start': 2962.367, 'duration': 1.502}], 'summary': 'Mini-batch examples processed independently in parallel, resulting in a 400,000 by 400,000 jacobian matrix. sparsity structure is utilized, and operations are hand-coded.', 'duration': 25.641, 'max_score': 2938.228, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2938228.jpg'}, {'end': 3099.592, 'src': 'embed', 'start': 3071.862, 'weight': 3, 'content': [{'end': 3075.504, 'text': 'And this communication is always along like vectors being passed around.', 'start': 3071.862, 'duration': 3.642}, {'end': 3080.406, 'text': "In practice, when we write these implementations, what we're passing around are these n-dimensional tensors.", 'start': 3076.145, 'duration': 4.261}, {'end': 3084.647, 'text': 'Really what that means is just an n-dimensional array, so like a numpy array.', 'start': 3081.126, 'duration': 3.521}, {'end': 3086.488, 'text': 'Those are what goes between the gates.', 'start': 3084.807, 'duration': 1.681}, {'end': 3090.569, 'text': 'And then internally, every single gate knows what to do in the forward and the backward pass.', 'start': 3086.888, 'duration': 3.681}, {'end': 3097.211, 'text': "OK, so at this point, I'm going to end with backpropagation, and I'm going to go into neural networks.", 'start': 3092.43, 'duration': 4.781}, {'end': 3099.592, 'text': 'So any questions before we move on from backprop? Go ahead.', 'start': 3097.531, 'duration': 2.061}], 'summary': 'Communication involves passing n-dimensional tensors, like numpy arrays, between gates in backpropagation.', 'duration': 27.73, 'max_score': 3071.862, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3071862.jpg'}, {'end': 3281.476, 'src': 'embed', 'start': 3250.439, 'weight': 4, 'content': [{'end': 3255.483, 'text': 'So hidden vector h of 100 numbers, say, or whatever you want your size of the neural network to be.', 'start': 3250.439, 'duration': 5.044}, {'end': 3258.826, 'text': "So this is a hyperparameter that's, say, 100.", 'start': 3255.944, 'duration': 2.882}, {'end': 3260.707, 'text': 'And we go through this intermediate representation.', 'start': 3258.826, 'duration': 1.881}, {'end': 3265.869, 'text': 'So matrix multiply gives us 100 numbers, threshold at 0, and then one more matrix multiply to get the scores.', 'start': 3260.767, 'duration': 5.102}, {'end': 3271.111, 'text': 'And since we have more numbers, we have more wiggle to do more interesting things.', 'start': 3266.929, 'duration': 4.182}, {'end': 3281.476, 'text': 'So one particular example of something interesting you might think that a neural network could do is going back to this example of interpreting linear classifiers on CIFAR-10..', 'start': 3271.651, 'duration': 9.825}], 'summary': 'Neural network uses hidden vector of 100 numbers, offering more flexibility.', 'duration': 31.037, 'max_score': 3250.439, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3250439.jpg'}, {'end': 3363.272, 'src': 'embed', 'start': 3330.653, 'weight': 5, 'content': [{'end': 3333.234, 'text': 'So now we can have a template for all these different modes.', 'start': 3330.653, 'duration': 2.581}, {'end': 3335.955, 'text': 'And so these neurons turn on or off.', 'start': 3334.194, 'duration': 1.761}, {'end': 3346.029, 'text': "if they find the thing, they're looking for a car of some specific type, and then this W2 matrix can sum across all those little car templates.", 'start': 3335.955, 'duration': 10.074}, {'end': 3348.929, 'text': 'So now we have, say, 20 car templates of what cars could look like.', 'start': 3346.309, 'duration': 2.62}, {'end': 3355.871, 'text': "And now to compute the score of car classifier, there's an additional matrix multiplier, so we have a choice of doing a weighted sum over them.", 'start': 3349.59, 'duration': 6.281}, {'end': 3363.272, 'text': 'And so if any one of them turn on, then through my weighted sum with positive weights, presumably, I would be adding up and getting a higher score.', 'start': 3356.371, 'duration': 6.901}], 'summary': 'Neurons activate templates to compute scores for car classification.', 'duration': 32.619, 'max_score': 3330.653, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3330653.jpg'}, {'end': 3413.939, 'src': 'embed', 'start': 3386.842, 'weight': 6, 'content': [{'end': 3391.185, 'text': 'So the question is if h had less than 10 units, would it be inferior to a linear classifier??', 'start': 3386.842, 'duration': 4.343}, {'end': 3395.087, 'text': "I think that's actually not obvious to me.", 'start': 3391.825, 'duration': 3.262}, {'end': 3396.228, 'text': "It's an interesting question.", 'start': 3395.307, 'duration': 0.921}, {'end': 3399.41, 'text': 'I think you could make that work.', 'start': 3396.648, 'duration': 2.762}, {'end': 3400.511, 'text': 'I think you could make it work.', 'start': 3399.85, 'duration': 0.661}, {'end': 3404.695, 'text': 'Yeah, I think that would actually work.', 'start': 3403.674, 'duration': 1.021}, {'end': 3406.896, 'text': 'Someone should try that for extra points on the assignment.', 'start': 3405.055, 'duration': 1.841}, {'end': 3409.497, 'text': "So you'll have a section on the assignment, do something fun or extra.", 'start': 3407.336, 'duration': 2.161}, {'end': 3413.939, 'text': "And so you get to come up with whatever you think is interesting experiment, and we'll give you some bonus points.", 'start': 3410.017, 'duration': 3.922}], 'summary': 'Exploring if h had less than 10 units and its impact on linear classifier, proposing it as an interesting experiment for extra points on the assignment.', 'duration': 27.097, 'max_score': 3386.842, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3386842.jpg'}], 'start': 2606.083, 'title': 'Neural network computations', 'summary': 'Covers jacobian matrices in neural networks, focusing on their structure, impact on backpropagation efficiency, and handling high-dimensional data. it also discusses computational structures, n-dimensional tensors, hidden layers, and their applications in multimodal classification.', 'chapters': [{'end': 3052.456, 'start': 2606.083, 'title': 'Understanding jacobian matrices in neural networks', 'summary': 'Discusses the concept of jacobian matrices in neural networks, specifically focusing on the special structure of the jacobian matrix, the implications for backpropagation efficiency, and the practical considerations in handling high-dimensional data and mini-batches.', 'duration': 446.373, 'highlights': ['Special Structure of Jacobian Matrix The Jacobian matrix for a common non-linearity operation in neural networks has a huge size of 4096 by 4096, but it exhibits a special structure with elements mainly on the diagonal and some being zero, indicating the element-wise nature of the operation.', 'Efficiency in Backpropagation The special structure of the Jacobian matrix allows for a simplified backward pass, where gradients are set to 0 in dimensions where the input was less than 0, leading to simple operations and efficiency in backpropagation.', 'Handling High-Dimensional Data and Mini-Batches The practical implementation of neural network operations involves considering high-dimensional data, such as 4096-dimensional vectors, and processing mini-batches of 100 elements in parallel, leading to the need for efficient handling and taking advantage of the sparsity structure in the Jacobian matrix.']}, {'end': 3421.362, 'start': 3054.324, 'title': 'Neural networks and backpropagation', 'summary': 'Discusses the computational structures and communication between nodes in neural networks, the use of n-dimensional tensors for passing data between gates, and the implementation of neural networks with hidden layers and their applications in multimodal classification.', 'duration': 367.038, 'highlights': ['The communication between nodes in neural networks is facilitated by passing n-dimensional tensors, which are essentially numpy arrays, and each gate knows how to process data in the forward and backward passes.', 'The implementation of neural networks involves matrix multiplication, thresholding at 0, and activation functions to generate scores, allowing for the creation of hidden layers that can capture multimodal features for improved classification.', "The introduction of hidden layers in neural networks enables the creation of templates for different modes, allowing neurons to activate based on specific features in the input data and thereby improving the model's ability to classify complex patterns.", 'The potential of using a neural network with less than 10 units in the hidden layer compared to a linear classifier is highlighted as an interesting experiment that could be explored for bonus points in an assignment.']}], 'duration': 815.279, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo2606083.jpg', 'highlights': ['Special Structure of Jacobian Matrix: The Jacobian matrix exhibits a special structure with elements mainly on the diagonal and some being zero, indicating the element-wise nature of the operation.', 'Efficiency in Backpropagation: The special structure of the Jacobian matrix allows for a simplified backward pass, leading to simple operations and efficiency in backpropagation.', 'Handling High-Dimensional Data and Mini-Batches: Practical implementation involves considering high-dimensional data, such as 4096-dimensional vectors, and processing mini-batches of 100 elements in parallel, leading to the need for efficient handling and taking advantage of the sparsity structure in the Jacobian matrix.', 'Communication between Nodes: The communication between nodes in neural networks is facilitated by passing n-dimensional tensors, which are essentially numpy arrays, and each gate knows how to process data in the forward and backward passes.', 'Hidden Layers and Multimodal Classification: The implementation of neural networks involves matrix multiplication, thresholding at 0, and activation functions to generate scores, allowing for the creation of hidden layers that can capture multimodal features for improved classification.', "Neural Network's Hidden Layers: The introduction of hidden layers in neural networks enables the creation of templates for different modes, allowing neurons to activate based on specific features in the input data and thereby improving the model's ability to classify complex patterns.", 'Potential of Using a Neural Network with Less than 10 Units: The potential of using a neural network with less than 10 units in the hidden layer compared to a linear classifier is highlighted as an interesting experiment that could be explored for bonus points in an assignment.']}, {'end': 4078.53, 'segs': [{'end': 3518.869, 'src': 'embed', 'start': 3485.117, 'weight': 2, 'content': [{'end': 3490.161, 'text': "So the number 100 for h was like chosen arbitrary? That's right.", 'start': 3485.117, 'duration': 5.044}, {'end': 3492.222, 'text': "So that's the size of a hidden layer, and that's a hyperparameter.", 'start': 3490.201, 'duration': 2.021}, {'end': 3492.943, 'text': 'We get to choose that.', 'start': 3492.242, 'duration': 0.701}, {'end': 3494.723, 'text': 'So I chose 100.', 'start': 3494.023, 'duration': 0.7}, {'end': 3496.884, 'text': "Usually that's going to be.", 'start': 3494.723, 'duration': 2.161}, {'end': 3499.524, 'text': "usually you'll see that with neural networks we'll go into this a lot,", 'start': 3496.884, 'duration': 2.64}, {'end': 3503.425, 'text': 'but usually you want them to be as big as possible as it fits in your computer and so on.', 'start': 3499.524, 'duration': 3.901}, {'end': 3504.626, 'text': 'so more is better.', 'start': 3503.425, 'duration': 1.201}, {'end': 3505.746, 'text': "But we'll go into that.", 'start': 3504.646, 'duration': 1.1}, {'end': 3507.026, 'text': 'Go ahead.', 'start': 3506.866, 'duration': 0.16}, {'end': 3518.869, 'text': "Is it always the max of 0 and h? So you're asking do we always take max of 0 and h? And we don't, and I'll get, it's like five slides away.", 'start': 3507.046, 'duration': 11.823}], 'summary': 'The size of a hidden layer is a hyperparameter, typically chosen to be as large as possible within computer constraints.', 'duration': 33.752, 'max_score': 3485.117, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3485117.jpg'}, {'end': 3573.524, 'src': 'embed', 'start': 3544.624, 'weight': 0, 'content': [{'end': 3552.008, 'text': 'Now, one other slide I wanted to flash is that Training a two-layer neural network is actually quite simple when it comes down to it.', 'start': 3544.624, 'duration': 7.384}, {'end': 3554.01, 'text': 'So this is a slide borrowed from a blog post I found.', 'start': 3552.028, 'duration': 1.982}, {'end': 3555.611, 'text': 'And basically,', 'start': 3554.67, 'duration': 0.941}, {'end': 3563.956, 'text': 'it suffices roughly 11 lines of Python to implement a two-layer neural network doing binary classification on what is this two-dimensional data?', 'start': 3555.611, 'duration': 8.345}, {'end': 3566.437, 'text': 'So you have a two-dimensional data matrix, x.', 'start': 3564.416, 'duration': 2.021}, {'end': 3573.524, 'text': "you have sorry, it's three-dimensional, and you have binary labels for y and then sin zero, sin one.", 'start': 3567.441, 'duration': 6.083}], 'summary': 'Training a two-layer neural network requires roughly 11 lines of python code for binary classification on three-dimensional data.', 'duration': 28.9, 'max_score': 3544.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3544624.jpg'}, {'end': 3611.638, 'src': 'embed', 'start': 3586.07, 'weight': 3, 'content': [{'end': 3592.672, 'text': "What you're seeing here is we're computing the first layer activations, but this is using a sigmoid nonlinearity, not a max of 0 and x.", 'start': 3586.07, 'duration': 6.602}, {'end': 3595.612, 'text': "And we'll go into a bit of what these nonlinearities might be.", 'start': 3592.672, 'duration': 2.94}, {'end': 3597.373, 'text': 'So sigmoid is one form.', 'start': 3596.213, 'duration': 1.16}, {'end': 3603.314, 'text': "It's computing the first layer, and then it's computing the second layer, and then it's computing here right away the backward pass.", 'start': 3597.773, 'duration': 5.541}, {'end': 3605.675, 'text': 'So this is the L2 delta as the gradient on L2.', 'start': 3603.674, 'duration': 2.001}, {'end': 3611.638, 'text': 'the gradient on L1, and the gradient and this is an update here.', 'start': 3606.415, 'duration': 5.223}], 'summary': 'Discussing computation of layer activations using sigmoid nonlinearity and backward pass for gradient updates.', 'duration': 25.568, 'max_score': 3586.07, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3586070.jpg'}, {'end': 3816.078, 'src': 'embed', 'start': 3787.241, 'weight': 4, 'content': [{'end': 3791.947, 'text': 'And so a neuron, just very briefly, just to give you an idea about where this is all coming from.', 'start': 3787.241, 'duration': 4.706}, {'end': 3799.058, 'text': "you have the cell body or a soma, as people like to call it, and it's got all these dendrites that are connected to other neurons.", 'start': 3793.291, 'duration': 5.767}, {'end': 3805.625, 'text': "So there's a cluster of other neurons and cell bodies over here, and dendrites are really these appendages that listen to them.", 'start': 3799.118, 'duration': 6.507}, {'end': 3808.869, 'text': 'So this is your inputs to a neuron,', 'start': 3806.346, 'duration': 2.523}, {'end': 3816.078, 'text': "and then it's got a single axon that comes out of a neuron that carries the the output of the computation that this neuron performs.", 'start': 3808.869, 'duration': 7.209}], 'summary': 'A neuron consists of a cell body, dendrites for inputs, and a single axon for output.', 'duration': 28.837, 'max_score': 3787.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3787241.jpg'}, {'end': 4086.197, 'src': 'embed', 'start': 4057.396, 'weight': 1, 'content': [{'end': 4061.079, 'text': 'So right now, if you wanted a default choice for non-linearity, use ReLU.', 'start': 4057.396, 'duration': 3.683}, {'end': 4063.02, 'text': "That's the current default recommendation.", 'start': 4061.299, 'duration': 1.721}, {'end': 4067.003, 'text': "And then there's a few kind of hipster activation functions here.", 'start': 4063.621, 'duration': 3.382}, {'end': 4070.365, 'text': 'And so leaky ReLUs were proposed a few years ago.', 'start': 4067.623, 'duration': 2.742}, {'end': 4071.926, 'text': 'Maxout is interesting.', 'start': 4070.985, 'duration': 0.941}, {'end': 4073.667, 'text': 'Very recently, ILU.', 'start': 4072.186, 'duration': 1.481}, {'end': 4078.53, 'text': 'And so you can come up with different activation functions, and you can describe why these might work better or not.', 'start': 4074.188, 'duration': 4.342}, {'end': 4083.214, 'text': 'And so this is an active area of research is trying to come up with these activation functions that perform,', 'start': 4078.711, 'duration': 4.503}, {'end': 4086.197, 'text': 'that have better properties in one way or another.', 'start': 4084.255, 'duration': 1.942}], 'summary': 'Current default recommendation for non-linearity is relu, with various other hipster activation functions like leaky relus, maxout, and ilu being proposed in recent years, forming an active area of research.', 'duration': 28.801, 'max_score': 4057.396, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4057396.jpg'}], 'start': 3421.422, 'title': 'Neural network training and architecture', 'summary': 'Discusses the architecture and training of neural networks, including the allocation of hidden layers, simplicity of training a two-layer neural network, and use of sigmoid nonlinearity. it also explains the computation steps involved, structure of neurons, and recommends using relu as the default activation function.', 'chapters': [{'end': 3644.6, 'start': 3421.422, 'title': 'Neural network training and architecture', 'summary': 'Discusses the architecture and training of a neural network, including the allocation of the hidden layer, the simplicity of training a two-layer neural network, and the use of a sigmoid nonlinearity in the computational process.', 'duration': 223.178, 'highlights': ["The size of the hidden layer, denoted as 'h', is a hyperparameter and was chosen as 100, aiming for a larger size as long as it fits in the computer.", 'Training a two-layer neural network for binary classification can be implemented with roughly 11 lines of Python code, demonstrating the simplicity of the process.', 'The computational process involves computing the first layer activations using a sigmoid nonlinearity, followed by computing the second layer and the backward pass for updating the weights.']}, {'end': 4078.53, 'start': 3644.74, 'title': 'Neural network training and structure', 'summary': 'Explains the process of training neural networks, including the computation steps involved and the structure of neurons, with a recommendation to use relu as the default activation function.', 'duration': 433.79, 'highlights': ['The process of training neural networks involves simple computation steps such as forward pass, backward pass, and updating, with very few lines of code required. The process of training neural networks involves simple computation steps such as forward pass, backward pass, and updating, with very few lines of code required.', 'A recommendation is made to use ReLU as the default choice for non-linearity due to its faster convergence compared to other activation functions like sigmoid and tanh. A recommendation is made to use ReLU as the default choice for non-linearity due to its faster convergence compared to other activation functions like sigmoid and tanh.', 'The structure of neurons is explained, illustrating the components such as dendrites, soma, and axon, along with a simple model of neuron computation involving weighted inputs and an activation function. The structure of neurons is explained, illustrating the components such as dendrites, soma, and axon, along with a simple model of neuron computation involving weighted inputs and an activation function.']}], 'duration': 657.108, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo3421422.jpg', 'highlights': ['Training a two-layer neural network for binary classification can be implemented with roughly 11 lines of Python code, demonstrating the simplicity of the process.', 'A recommendation is made to use ReLU as the default choice for non-linearity due to its faster convergence compared to other activation functions like sigmoid and tanh.', "The size of the hidden layer, denoted as 'h', is a hyperparameter and was chosen as 100, aiming for a larger size as long as it fits in the computer.", 'The computational process involves computing the first layer activations using a sigmoid nonlinearity, followed by computing the second layer and the backward pass for updating the weights.', 'The structure of neurons is explained, illustrating the components such as dendrites, soma, and axon, along with a simple model of neuron computation involving weighted inputs and an activation function.']}, {'end': 4771.181, 'segs': [{'end': 4168.287, 'src': 'embed', 'start': 4141.09, 'weight': 2, 'content': [{'end': 4147.612, 'text': 'So, instead of having an amorphous blob of neurons and every one of them has to be computed independently having them in layers allows us to use vectorized operations.', 'start': 4141.09, 'duration': 6.522}, {'end': 4153.956, 'text': 'And so we can compute an entire set of neurons in a single hidden layer at a single time as a matrix multiply.', 'start': 4148.152, 'duration': 5.804}, {'end': 4160.321, 'text': "And that's why we arranged them in these layers, where neurons inside a layer can be evaluated completely in parallel and they all see the same input.", 'start': 4154.277, 'duration': 6.044}, {'end': 4163.462, 'text': "So it's a computational trick to arrange them in layers.", 'start': 4160.841, 'duration': 2.621}, {'end': 4168.287, 'text': 'So this is a three-layer neural net, and this is how you would compute it.', 'start': 4163.964, 'duration': 4.323}], 'summary': 'Neural net with 3 layers allows vectorized computations, improving parallel evaluation.', 'duration': 27.197, 'max_score': 4141.09, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4141090.jpg'}, {'end': 4202.894, 'src': 'embed', 'start': 4175.054, 'weight': 0, 'content': [{'end': 4177.988, 'text': "So now I'd like to show you a demo of how these neural networks work.", 'start': 4175.054, 'duration': 2.934}, {'end': 4181.941, 'text': "So this is a JavaScript demo that I'll show you in a bit.", 'start': 4180.1, 'duration': 1.841}, {'end': 4189.406, 'text': 'But basically, this is an example of a two-layer neural network doing a binary classification task.', 'start': 4182.562, 'duration': 6.844}, {'end': 4191.006, 'text': 'So we have two classes, red and green.', 'start': 4189.685, 'duration': 1.321}, {'end': 4193.008, 'text': 'And so we have these points in two dimensions.', 'start': 4191.506, 'duration': 1.502}, {'end': 4195.95, 'text': "And I'm drawing the decision boundaries by the neural network.", 'start': 4193.528, 'duration': 2.422}, {'end': 4202.894, 'text': 'And so what you can see is when I train a neural network on this data, the more hidden neurons I have in my hidden layer,', 'start': 4196.51, 'duration': 6.384}], 'summary': 'Demo of two-layer neural network for binary classification in javascript', 'duration': 27.84, 'max_score': 4175.054, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4175054.jpg'}, {'end': 4243.415, 'src': 'embed', 'start': 4215.635, 'weight': 1, 'content': [{'end': 4219.998, 'text': "So you can see that when you insist that your w's are very small, you end up with very smooth functions.", 'start': 4215.635, 'duration': 4.363}, {'end': 4223.24, 'text': "So they don't have as much variance.", 'start': 4220.798, 'duration': 2.442}, {'end': 4227.772, 'text': "So these neural networks, there's not as much wiggle that they can give you.", 'start': 4225.071, 'duration': 2.701}, {'end': 4232.753, 'text': 'And then as you decrease the regularization, these neural networks can do more and more complex tasks.', 'start': 4228.352, 'duration': 4.401}, {'end': 4237.854, 'text': 'So they can kind of get in and get these little squeezed out points to cover them in the training data.', 'start': 4232.793, 'duration': 5.061}, {'end': 4243.415, 'text': 'So let me show you what this looks like during training.', 'start': 4238.174, 'duration': 5.241}], 'summary': "Smaller w's lead to smoother functions, less variance in neural networks, while decreasing regularization allows for more complex tasks.", 'duration': 27.78, 'max_score': 4215.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4215635.jpg'}, {'end': 4633.068, 'src': 'embed', 'start': 4602.961, 'weight': 3, 'content': [{'end': 4604.182, 'text': 'More is always better.', 'start': 4602.961, 'duration': 1.221}, {'end': 4608.099, 'text': "It's usually computational constraint.", 'start': 4606.638, 'duration': 1.461}, {'end': 4612.02, 'text': 'So more will always work better, but then you have to be careful to regularize it properly.', 'start': 4608.339, 'duration': 3.681}, {'end': 4617.082, 'text': 'So the correct way to constrain your neural networks to not overfit your data is not by making the network smaller.', 'start': 4612.36, 'duration': 4.722}, {'end': 4619.643, 'text': 'The correct way to do it is to increase your regularization.', 'start': 4617.522, 'duration': 2.121}, {'end': 4624.985, 'text': 'So you always want to use as large of a network as you want, but then you have to make sure to properly regularize it.', 'start': 4620.503, 'duration': 4.482}, {'end': 4627.266, 'text': 'But most of the time because computational reasons.', 'start': 4625.385, 'duration': 1.881}, {'end': 4628.406, 'text': 'you have finite amount of time.', 'start': 4627.266, 'duration': 1.14}, {'end': 4630.487, 'text': "you don't want to wait forever to train your networks.", 'start': 4628.406, 'duration': 2.081}, {'end': 4633.068, 'text': "you'll use smaller ones for practical reasons.", 'start': 4630.487, 'duration': 2.581}], 'summary': 'Larger networks work better, but proper regularization is key to avoid overfitting. smaller networks are used for practical reasons.', 'duration': 30.107, 'max_score': 4602.961, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4602961.jpg'}, {'end': 4706.29, 'src': 'embed', 'start': 4681.875, 'weight': 4, 'content': [{'end': 4688.099, 'text': "Question What's the difference between having more neurons in your hidden layer versus just making the vector size bigger?", 'start': 4681.875, 'duration': 6.224}, {'end': 4690.14, 'text': 'So instead of 100, you can make it 500?', 'start': 4688.119, 'duration': 2.021}, {'end': 4692.561, 'text': 'So what is the trade-off between depth and size roughly?', 'start': 4690.14, 'duration': 2.421}, {'end': 4693.462, 'text': 'How do you allocate?', 'start': 4692.801, 'duration': 0.661}, {'end': 4695.603, 'text': 'Not a good answer for that, unfortunately.', 'start': 4693.922, 'duration': 1.681}, {'end': 4702.087, 'text': "So you want depth is good, but maybe after 10 layers, maybe if you have simple data set, it's not really adding too much.", 'start': 4696.404, 'duration': 5.683}, {'end': 4705.489, 'text': 'We have one more minute, so I can still take some questions.', 'start': 4702.107, 'duration': 3.382}, {'end': 4706.29, 'text': 'You had a question for a while.', 'start': 4705.509, 'duration': 0.781}], 'summary': 'Discussion on trade-offs between depth and size in neural networks.', 'duration': 24.415, 'max_score': 4681.875, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4681875.jpg'}], 'start': 4078.711, 'title': 'Neural network capacity', 'summary': "Explores the impact of neuron number and regularization strength on a neural network's flexibility, showcasing a javascript demo of a two-layer neural network for binary classification, and discusses the trade-off between depth and size in network performance.", 'chapters': [{'end': 4503.073, 'start': 4078.711, 'title': 'Neural network activation functions', 'summary': "Discusses the arrangement of neurons into layers within a neural network, showcasing the effect of the number of neurons and regularization strength on the network's flexibility and complexity, and demonstrates a javascript demo of a two-layer neural network for binary classification.", 'duration': 424.362, 'highlights': ["The demonstration of a JavaScript demo of a two-layer neural network for binary classification showcases the impact of the number of neurons and regularization strength on the network's flexibility and complexity, offering insights into how the network warps the input space and the minimum number of neurons required for separability.", 'The explanation of arranging neurons into layers within a neural network highlights the computational efficiency achieved through vectorized operations and the use of matrix multiplies, enabling parallel evaluation of neurons in a single hidden layer.', 'The discussion of the impact of regularization strength on neural networks demonstrates how higher regularization leads to smoother functions with reduced variance, while lower regularization allows for more complex tasks and the ability to cover squeezed out points in the training data.']}, {'end': 4771.181, 'start': 4503.153, 'title': 'Neural networks and capacity', 'summary': 'Discusses arranging neural networks into fully connected layers, the impact of more neurons on network performance, the trade-off between depth and size, and the use of different activation functions for different layers.', 'duration': 268.028, 'highlights': ['Arranging neural networks into fully connected layers and the impact of backpropagation on computational graphs The chapter discusses arranging neural networks into fully connected layers and explores backpropagation and its impact on computational graphs.', 'The impact of more neurons on network performance and the need for proper regularization The chapter emphasizes that having more neurons in a neural network is generally better, but proper regularization is necessary to avoid overfitting.', 'The trade-off between depth and size in neural networks The chapter addresses the trade-off between depth and size in neural networks, highlighting that depth is crucial for images but less critical for simpler datasets.', 'The use of different activation functions for different layers and the preference for using a single activation function throughout The chapter discusses the use of different activation functions for different layers in neural networks and indicates that using a single activation function throughout, such as ReLUCE, is the common practice.']}], 'duration': 692.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/i94OvYb6noo/pics/i94OvYb6noo4078711.jpg', 'highlights': ["The demonstration of a JavaScript demo of a two-layer neural network for binary classification showcases the impact of the number of neurons and regularization strength on the network's flexibility and complexity, offering insights into how the network warps the input space and the minimum number of neurons required for separability.", 'The discussion of the impact of regularization strength on neural networks demonstrates how higher regularization leads to smoother functions with reduced variance, while lower regularization allows for more complex tasks and the ability to cover squeezed out points in the training data.', 'The explanation of arranging neurons into layers within a neural network highlights the computational efficiency achieved through vectorized operations and the use of matrix multiplies, enabling parallel evaluation of neurons in a single hidden layer.', 'The impact of more neurons on network performance and the need for proper regularization The chapter emphasizes that having more neurons in a neural network is generally better, but proper regularization is necessary to avoid overfitting.', 'The trade-off between depth and size in neural networks The chapter addresses the trade-off between depth and size in neural networks, highlighting that depth is crucial for images but less critical for simpler datasets.']}], 'highlights': ['The upcoming deadline for assignment one is next Wednesday, with approximately 150 hours left for completion, highlighting the urgency of starting early to avoid time running out.', 'Understanding computational graphs helps in intuitively grasping gradient flow in neural networks, aiding in efficient debugging and addressing issues like the vanishing gradient problem.', 'The max gate serves as a gradient router, with a gradient of 1 on the larger input and 0 on the smaller input, efficiently propagating gradients through the circuit.', 'The concept of deep learning frameworks is built on a collection of layers, which are essentially the gates for building computational graphs, with Torch being an example.', 'Efficiency in Backpropagation: The special structure of the Jacobian matrix allows for a simplified backward pass, leading to simple operations and efficiency in backpropagation.', "The demonstration of a JavaScript demo of a two-layer neural network for binary classification showcases the impact of the number of neurons and regularization strength on the network's flexibility and complexity, offering insights into how the network warps the input space and the minimum number of neurons required for separability."]}