title

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 3 – Neural Networks

description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3kzqrg1
Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)
To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs224n/index.html#schedule
0:00 Introduction
0:21 Course plan: coming up
2:02 Homeworks
6:49 Classification setup and notation
7:08 Classification intuition
8:52 Details of the softmax classifier
12:01 Background: What is "cross entropy" loss/error?
16:56 Traditional ML optimization
19:01 Neural Network Classifiers • Softmax le logistic regression alone not very powerful
22:43 Classification difference with word vectors
26:10 Neural computation
27:16 An artificial neuron
28:49 A neuron can be a binary logistic regression unit
35:17 Matrix notation for a layer
46:34 Binary word window classification
48:08 Window classification: Softmax
51:10 Neural Network Feed-forward Computation
55:08 Computing Gradients by Hand
59:44 Jacobian Matrix: Generalization of the Gradient

detail

{'title': 'Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 3 – Neural Networks', 'heatmap': [{'end': 2941.263, 'start': 2835.133, 'weight': 0.867}, {'end': 3078.695, 'start': 2969.561, 'weight': 0.756}], 'summary': 'The lecture series covers fundamental topics in neural networks and machine learning, including backpropagation, classification, softmax regression, cross-entropy loss, and their applications in natural language processing. it also explores the origins, evolution, and structure of neural networks, emphasizing non-linearity, named entity recognition, word classification, neural network layers, and calculus in deep learning.', 'chapters': [{'end': 371.874, 'segs': [{'end': 115.799, 'src': 'embed', 'start': 43.476, 'weight': 1, 'content': [{'end': 50.662, 'text': 'and how we can learn good neural networks by backpropagation, which means in particular,', 'start': 43.476, 'duration': 7.186}, {'end': 57.528, 'text': "we're gonna be sort of talking about the training algorithms and doing calculus to work out gradients for improving them.", 'start': 50.662, 'duration': 6.866}, {'end': 67.052, 'text': "So we'll look at a little bit at, um, um, word window classification, named entity recognition.", 'start': 59.468, 'duration': 7.584}, {'end': 70.914, 'text': "So there's a teeny bit of natural language processing in there.", 'start': 67.112, 'duration': 3.802}, {'end': 82.578, 'text': 'But basically, sort of, week two is sort of, um, math of deep learning and neural network models and sort of really neural network fundamentals.', 'start': 71.174, 'duration': 11.404}, {'end': 94.042, 'text': 'Um, but the hope is that that will give you kind of a good understanding of how these things really work and will give you all the information you need to do um the coming up homework.', 'start': 82.598, 'duration': 11.444}, {'end': 98.185, 'text': 'And so then in week three, we kind of flip.', 'start': 94.542, 'duration': 3.643}, {'end': 103.549, 'text': 'So then week three is going to be mainly about natural language processing.', 'start': 98.265, 'duration': 5.284}, {'end': 108.293, 'text': "So we're then gonna talk about how to put syntactic structures over sentences.", 'start': 103.589, 'duration': 4.704}, {'end': 114.138, 'text': "um, for building dependency, parsers of sentences, which is then actually what's used in homework three.", 'start': 108.293, 'duration': 5.845}, {'end': 115.799, 'text': "So we're chugging along rapidly.", 'start': 114.278, 'duration': 1.521}], 'summary': 'Learning neural networks and natural language processing for week two and three, including training algorithms and syntactic structures.', 'duration': 72.323, 'max_score': 43.476, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo43476.jpg'}, {'end': 216.677, 'src': 'embed', 'start': 164.1, 'weight': 4, 'content': [{'end': 172.448, 'text': "So, on the first part of it, we're expecting you to grind through some math problems of working out gradient derivations, um.", 'start': 164.1, 'duration': 8.348}, {'end': 179.23, 'text': 'And then the second part of it is then implementing your own version of Word2Vec, making use of NumPy.', 'start': 172.768, 'duration': 6.462}, {'end': 184.251, 'text': "And so this time, sort of writing a Python program, it's no longer an IPython notebook.", 'start': 179.63, 'duration': 4.621}, {'end': 191.692, 'text': 'Um, encourage you to get early, um, look at the, um, materials, um, on the web.', 'start': 184.791, 'duration': 6.901}, {'end': 195.033, 'text': "I mean in particular corresponding to today's lecture.", 'start': 191.953, 'duration': 3.08}, {'end': 201.895, 'text': "there's um some quite good tutorial materials that are available on the website, and so also encourage you to look at those.", 'start': 195.033, 'duration': 6.862}, {'end': 207.632, 'text': 'Um, more generally just to make a couple more comments on things.', 'start': 203.57, 'duration': 4.062}, {'end': 212.455, 'text': 'I mean, I guess this is true of a lot of classes at Stanford.', 'start': 208.052, 'duration': 4.403}, {'end': 216.677, 'text': 'But you know, when we get the course reviews for this class,', 'start': 212.735, 'duration': 3.942}], 'summary': 'Expect to work on math problems and implement word2vec using numpy.', 'duration': 52.577, 'max_score': 164.1, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo164100.jpg'}, {'end': 281.466, 'src': 'embed', 'start': 255.43, 'weight': 7, 'content': [{'end': 264.134, 'text': "We'd like everyone to succeed, but also like graduate level class, um, we'd like you to, you know, take some initiative in your success.", 'start': 255.43, 'duration': 8.704}, {'end': 269.858, 'text': "meaning. if there are things that you need to know to do the assignments and you don't know them, um,", 'start': 264.694, 'duration': 5.164}, {'end': 273.801, 'text': 'then you should be taking some initiative to find some tutorials,', 'start': 269.858, 'duration': 3.943}, {'end': 281.466, 'text': 'come to office hours and talk to people and get any help you need and learn to sort of for any holes in your knowledge.', 'start': 273.801, 'duration': 7.665}], 'summary': 'Encouraging students to take initiative for success by seeking help and learning independently.', 'duration': 26.036, 'max_score': 255.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo255430.jpg'}, {'end': 328.386, 'src': 'embed', 'start': 296.479, 'weight': 0, 'content': [{'end': 299.18, 'text': 'So talk a little bit about classification.', 'start': 296.479, 'duration': 2.701}, {'end': 311.009, 'text': 'um introduce neural networks um little detour into named entity recognition, then sort of show a model of doing um window- word window classification.', 'start': 299.18, 'duration': 11.829}, {'end': 319.319, 'text': 'And then at the end part, um, we sort of then dive deeper into what kind of tools we need, um, to learn neural networks.', 'start': 311.029, 'duration': 8.29}, {'end': 328.386, 'text': "And so today, um, we're gonna go through um somewhere between review and primer of um matrix calculus,", 'start': 319.679, 'duration': 8.707}], 'summary': 'Introduction to classification and neural networks, with a dive into named entity recognition and matrix calculus.', 'duration': 31.907, 'max_score': 296.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo296479.jpg'}], 'start': 5.214, 'title': 'Neural network and machine learning lecture', 'summary': 'Covers neural network fundamentals including training algorithms and backpropagation, emphasizing the math of deep learning. it also highlights the importance of submitting homework on time and outlines the plan for the machine learning lecture, including topics like classification, neural networks, and matrix calculus.', 'chapters': [{'end': 191.692, 'start': 5.214, 'title': 'Week 2: neural network fundamentals', 'summary': 'Covers the fundamentals of neural networks, including training algorithms, backpropagation, and word window classification, with a focus on the math of deep learning. it also emphasizes the importance of submitting homework on time and introduces the tasks for homework two.', 'duration': 186.478, 'highlights': ['The chapter emphasizes the fundamentals of neural networks, including training algorithms, backpropagation, and word window classification, with a focus on the math of deep learning.', 'It introduces the importance of submitting homework on time and mentions the tasks for homework two, which include working out gradient derivations and implementing Word2Vec using NumPy.', 'Week two focuses on the math of deep learning and neural network models, providing fundamental understanding for upcoming homework.', 'Week three will mainly cover natural language processing, including putting syntactic structures over sentences and discussing the probability of a sentence, leading into neural language models.', "Homework one was due and students are encouraged to submit it quickly to avoid using late days, while homework two corresponds to the week's lectures, involving math problems and implementing Word2Vec using NumPy."]}, {'end': 371.874, 'start': 191.953, 'title': 'Machine learning lecture update', 'summary': 'Covers the importance of tutorial materials, encourages initiative in learning, and outlines the plan for the lecture, including topics like classification, neural networks, named entity recognition, and matrix calculus.', 'duration': 179.921, 'highlights': ['The importance of tutorial materials is emphasized, encouraging students to utilize them for learning.', 'Initiative in learning is encouraged, particularly in seeking help and filling knowledge gaps for assignments.', 'The lecture plan includes topics such as classification, neural networks, named entity recognition, and matrix calculus, catering to varied levels of understanding and aiming to provide a useful review for a large percentage of students.']}], 'duration': 366.66, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo5214.jpg', 'highlights': ['The lecture plan includes topics like classification, neural networks, named entity recognition, and matrix calculus.', 'Week three will cover natural language processing, including putting syntactic structures over sentences and discussing the probability of a sentence, leading into neural language models.', 'Week two focuses on the math of deep learning and neural network models, providing fundamental understanding for upcoming homework.', 'The chapter emphasizes the fundamentals of neural networks, including training algorithms, backpropagation, and word window classification, with a focus on the math of deep learning.', 'The importance of submitting homework on time and mentions the tasks for homework two, which include working out gradient derivations and implementing Word2Vec using NumPy.', "Homework one was due and students are encouraged to submit it quickly to avoid using late days, while homework two corresponds to the week's lectures, involving math problems and implementing Word2Vec using NumPy.", 'The importance of tutorial materials is emphasized, encouraging students to utilize them for learning.', 'Initiative in learning is encouraged, particularly in seeking help and filling knowledge gaps for assignments.']}, {'end': 1078.917, 'segs': [{'end': 416.875, 'src': 'embed', 'start': 389.443, 'weight': 0, 'content': [{'end': 400.668, 'text': 'So we have assumed we have uh training dataset where we have these um, vector x, um of our x points, and then for each of one of them we have a class.', 'start': 389.443, 'duration': 11.225}, {'end': 406.151, 'text': 'Um, so the inputs might be words or sentences, documents or something.', 'start': 401.328, 'duration': 4.823}, {'end': 408.031, 'text': "They're a d-dimensional vector.", 'start': 406.391, 'duration': 1.64}, {'end': 416.875, 'text': "Um, the yi are the labels or classes that we want to classify to, and we've got a set of c classes that we're trying to predict.", 'start': 408.472, 'duration': 8.403}], 'summary': 'Training dataset with d-dimensional vectors, c classes for classification.', 'duration': 27.432, 'max_score': 389.443, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo389443.jpg'}, {'end': 537.573, 'src': 'embed', 'start': 506.696, 'weight': 1, 'content': [{'end': 517.565, 'text': "Okay And in particular, if you've got a softmax classifier or a logistic regression classifier, these are what are called linear classifiers.", 'start': 506.696, 'duration': 10.869}, {'end': 525.832, 'text': 'So, the decision boundary between two classes here is a line in some suitably high-dimensional space.', 'start': 517.905, 'duration': 7.927}, {'end': 529.916, 'text': "So, it's a plane or a hyperplane once you've got a bigger x vector.", 'start': 525.852, 'duration': 4.064}, {'end': 537.573, 'text': "Okay So, here's our softmax classifier, um, and there are sort of two parts to that.", 'start': 531.368, 'duration': 6.205}], 'summary': 'Softmax and logistic regression classifiers are linear classifiers with decision boundaries in high-dimensional space.', 'duration': 30.877, 'max_score': 506.696, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo506696.jpg'}, {'end': 750.443, 'src': 'embed', 'start': 722.648, 'weight': 2, 'content': [{'end': 730.475, 'text': 'Um, so the concept of cross-entropy comes from baby information theory which is about the amount of information theory I know.', 'start': 722.648, 'duration': 7.827}, {'end': 740.88, 'text': "Um, so we're assuming that there's some true probability distribution P and our model, we've built some probability distribution Q.", 'start': 730.495, 'duration': 10.385}, {'end': 743.441, 'text': "That's what we've built with our softmax regression.", 'start': 740.88, 'duration': 2.561}, {'end': 750.443, 'text': 'And we want to have a measure of whether our estimated probability distribution is a good one.', 'start': 743.781, 'duration': 6.662}], 'summary': 'Cross-entropy measures the accuracy of estimated probability distribution in softmax regression.', 'duration': 27.795, 'max_score': 722.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo722648.jpg'}, {'end': 952.36, 'src': 'embed', 'start': 907.967, 'weight': 4, 'content': [{'end': 912.909, 'text': "So if human beings said I'm not sure whether this should be class three or four,", 'start': 907.967, 'duration': 4.942}, {'end': 918.212, 'text': 'you could imagine that we can make training data where we put probability half on both of them.', 'start': 912.909, 'duration': 5.303}, {'end': 921.314, 'text': "um, and that wouldn't be a crazy thing to do.", 'start': 918.212, 'duration': 3.102}, {'end': 926.278, 'text': "And so then you'd, yeah, have a true cross entropy loss using more of a distribution.", 'start': 921.655, 'duration': 4.623}, {'end': 934.244, 'text': "Um, the case where it's much more commonly used in actual practice is.", 'start': 926.298, 'duration': 7.946}, {'end': 939.428, 'text': 'there are many circumstances in which people want to do semi-supervised learning.', 'start': 934.244, 'duration': 5.184}, {'end': 948.016, 'text': "So I guess this is a topic that both my group and Chris Ray's group have worked on quite a lot, where we don't actually have fully labeled data,", 'start': 939.529, 'duration': 8.487}, {'end': 952.36, 'text': "but we've got some means of guessing what the labels of the data are.", 'start': 948.016, 'duration': 4.344}], 'summary': 'Using probability distribution for semi-supervised learning, commonly used in actual practice.', 'duration': 44.393, 'max_score': 907.967, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo907967.jpg'}, {'end': 1029.426, 'src': 'embed', 'start': 1005.522, 'weight': 3, 'content': [{'end': 1016.49, 'text': "So, we can sort of simplify what we're writing here and we can sort of use matrix notation and just work directly in terms of the matrix W.", 'start': 1005.522, 'duration': 10.968}, {'end': 1027.126, 'text': 'Okay So, for traditional, ML optimization, our parameters are these sets of weights, um, for the different classes.', 'start': 1016.49, 'duration': 10.636}, {'end': 1029.426, 'text': 'So for each of the classes,', 'start': 1027.205, 'duration': 2.221}], 'summary': 'Using matrix notation to work with matrix w in traditional ml optimization.', 'duration': 23.904, 'max_score': 1005.522, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1005522.jpg'}], 'start': 371.975, 'title': 'Classification, softmax regression, and cross-entropy loss', 'summary': 'Covers classification, softmax regression, and cross-entropy loss in machine learning, focusing on classification setup, intuition behind linear classifiers, and the concept of cross-entropy loss. it also explains the use of cross-entropy loss in semi-supervised learning and optimization in machine learning.', 'chapters': [{'end': 879.571, 'start': 371.975, 'title': 'Classification and softmax regression', 'summary': 'Covers the concepts of classification, softmax regression, and cross-entropy loss in machine learning, aiming to bring everyone up to speed in week two with a focus on classification setup, intuition behind linear classifiers, and the concept of cross-entropy loss.', 'duration': 507.596, 'highlights': ['The chapter introduces the classification setup with d-dimensional vector inputs, yi labels, and c classes, aiming to bring everyone up to speed in week two.', 'The chapter explains the intuition behind linear classifiers, discussing the concept of a decision boundary in a high-dimensional space and the implementation of softmax and logistic regression classifiers.', 'The concept of cross-entropy loss is introduced, explaining its origins in information theory, the comparison of estimated and true probability distributions, and the calculation of the cross-entropy measure.']}, {'end': 1078.917, 'start': 879.591, 'title': 'Understanding cross-entropy loss and semi-supervised learning', 'summary': "Explains the concept of cross-entropy loss, including scenarios where it's used, such as semi-supervised learning, and delves into the use of matrix notation for optimization in machine learning.", 'duration': 199.326, 'highlights': ['The concept of cross-entropy loss is introduced, which includes scenarios where human uncertainty leads to the use of probability distributions in training data.', 'Semi-supervised learning is discussed, highlighting the need to guess labels for unlabeled data and the use of probability distributions for this purpose.', 'Explanation of the use of matrix notation for optimization in traditional machine learning, particularly in relation to parameter weights and gradient descent updates.']}], 'duration': 706.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo371975.jpg', 'highlights': ['The chapter introduces the classification setup with d-dimensional vector inputs, yi labels, and c classes, aiming to bring everyone up to speed in week two.', 'The chapter explains the intuition behind linear classifiers, discussing the concept of a decision boundary in a high-dimensional space and the implementation of softmax and logistic regression classifiers.', 'The concept of cross-entropy loss is introduced, explaining its origins in information theory, the comparison of estimated and true probability distributions, and the calculation of the cross-entropy measure.', 'Explanation of the use of matrix notation for optimization in traditional machine learning, particularly in relation to parameter weights and gradient descent updates.', 'The concept of cross-entropy loss is introduced, which includes scenarios where human uncertainty leads to the use of probability distributions in training data.', 'Semi-supervised learning is discussed, highlighting the need to guess labels for unlabeled data and the use of probability distributions for this purpose.']}, {'end': 1520.152, 'segs': [{'end': 1163.627, 'src': 'embed', 'start': 1141.044, 'weight': 4, 'content': [{'end': 1148.505, 'text': 'In particular, those are all linear classifiers, which are going to classify by drawing a line or, in the higher dimensional space,', 'start': 1141.044, 'duration': 7.461}, {'end': 1150.405, 'text': 'by drawing some kind of plane.', 'start': 1148.505, 'duration': 1.9}, {'end': 1151.765, 'text': 'that separates examples.', 'start': 1150.405, 'duration': 1.36}, {'end': 1157.946, 'text': 'And having a simple classifier like that, um, can be useful in certain circumstances.', 'start': 1152.205, 'duration': 5.741}, {'end': 1162.007, 'text': 'I mean, that gives you what in machine learning is a high bias classifiers.', 'start': 1158.327, 'duration': 3.68}, {'end': 1163.627, 'text': "There's lots of talk of in CS229.", 'start': 1162.047, 'duration': 1.58}], 'summary': 'Linear classifiers in cs229 use lines or planes to classify examples, providing high bias in machine learning.', 'duration': 22.583, 'max_score': 1141.044, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1141044.jpg'}, {'end': 1235.683, 'src': 'embed', 'start': 1189.685, 'weight': 0, 'content': [{'end': 1198.995, 'text': 'when you have natural signals so those are things like um speech, language images and things like that you have a ton of data,', 'start': 1189.685, 'duration': 9.31}, {'end': 1210.103, 'text': 'so you could learn a quite sophisticated classifier, um, but representing the classes in terms of the input data is sort of very complex.', 'start': 1198.995, 'duration': 11.108}, {'end': 1213.906, 'text': 'You could never do it by just drawing a line between the two classes.', 'start': 1210.143, 'duration': 3.763}, {'end': 1218.749, 'text': "And so, you'd like to use some more complicated kind of classifier.", 'start': 1214.406, 'duration': 4.343}, {'end': 1225.113, 'text': "And so neural networks, the multi-layer neural networks that we're gonna be starting to get into now.", 'start': 1219.229, 'duration': 5.884}, {'end': 1235.683, 'text': 'precisely what they do is provide you a way to learn very complex you know, almost limitly complex, in fact classifiers,', 'start': 1225.113, 'duration': 10.57}], 'summary': 'Neural networks allow learning complex classifiers from natural signals like speech and images.', 'duration': 45.998, 'max_score': 1189.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1189685.jpg'}, {'end': 1433.699, 'src': 'embed', 'start': 1399.91, 'weight': 2, 'content': [{'end': 1403.713, 'text': "One's, this sort of um word vector representation learning.", 'start': 1399.91, 'duration': 3.803}, {'end': 1408.716, 'text': "and then the second one is that we're gonna start looking at deeper, multi-layer neural networks.", 'start': 1403.713, 'duration': 5.003}, {'end': 1424.025, 'text': 'Um sort of hidden over here on the slide is the observation that really you can think of word- vector embedding as just putting your- having a model with one more neural network layer.', 'start': 1409.297, 'duration': 14.728}, {'end': 1433.699, 'text': 'So if you imagine that each word was a one hot vector, um, for the different word types in your model.', 'start': 1424.446, 'duration': 9.253}], 'summary': 'Introduction to word vector representation learning and deeper, multi-layer neural networks.', 'duration': 33.789, 'max_score': 1399.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1399910.jpg'}], 'start': 1078.917, 'title': 'Neural networks for nlp', 'summary': 'Discusses neural network classifiers and their advantages over classic linear classifiers, particularly in handling complex datasets and natural language processing. it emphasizes the use of word vectors and multi-layer networks to optimize classification performance.', 'chapters': [{'end': 1262.883, 'start': 1078.917, 'title': 'Neural network classifiers', 'summary': 'Discusses the difference between classic linear classifiers and neural network classifiers, pointing out that neural networks provide a way to learn highly complex classifiers, making them more suitable for datasets with natural signals and a ton of data.', 'duration': 183.966, 'highlights': ['Neural networks provide a way to learn highly complex classifiers, making them more suitable for datasets with natural signals and a ton of data.', 'Classic classifiers such as naive Bayes models, basic support vector machines, softmax or logistic regressions are linear classifiers.', 'High bias classifiers, like those mentioned, may not do a very good job at classifying all the points correctly, particularly in datasets that require more powerful classifiers.', 'Deep learning has been empowered by the need for more sophisticated classifiers due to the complexity of representing classes in terms of the input data, especially for natural signals like speech, language, and images.']}, {'end': 1520.152, 'start': 1263.163, 'title': 'Neural net for nlp', 'summary': 'Discusses the utilization of word vectors and deeper multi-layer networks to enhance the classification performance of neural nets for natural language processing, emphasizing the simultaneous optimization of word representations and weights.', 'duration': 256.989, 'highlights': ['Utilization of word vectors to enhance classification performance', 'Simultaneous optimization of word representations and weights', 'Introduction of deeper multi-layer neural networks']}], 'duration': 441.235, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1078917.jpg', 'highlights': ['Neural networks learn complex classifiers for datasets with natural signals and large data', 'Deep learning empowered by need for sophisticated classifiers for natural signals', 'Word vectors enhance classification performance', 'Deeper multi-layer neural networks introduced', 'Classic classifiers like naive Bayes, SVM, softmax are linear and may have high bias']}, {'end': 1904.225, 'segs': [{'end': 1833.438, 'src': 'embed', 'start': 1781.88, 'weight': 0, 'content': [{'end': 1789.806, 'text': 'Okay, so really we can just say these artificial neurons are sort of like binary logistic regression units.', 'start': 1781.88, 'duration': 7.926}, {'end': 1798.672, 'text': "Or we can make variants of binary logistic regression units by using some different f function, and we'll come back to that again pretty soon.", 'start': 1790.326, 'duration': 8.346}, {'end': 1804.519, 'text': 'Okay Um, well, so that gives us one neuron.', 'start': 1800.856, 'duration': 3.663}, {'end': 1809.203, 'text': 'So, one neuron is a logistic regression unit for current purposes.', 'start': 1804.579, 'duration': 4.624}, {'end': 1816.208, 'text': "So, crucially, what we're wanting to do with neural networks is, say well, why only run one logistic regression?", 'start': 1809.623, 'duration': 6.585}, {'end': 1817.669, 'text': "Why don't we?", 'start': 1816.609, 'duration': 1.06}, {'end': 1821.673, 'text': 'um run a whole bunch of logistic regressions at the same time?', 'start': 1817.669, 'duration': 4.004}, {'end': 1827.496, 'text': "So you know, here are our inputs and here's our little logistic regression unit.", 'start': 1822.153, 'duration': 5.343}, {'end': 1833.438, 'text': 'um, but we could run three logistic regressions at the same time or we can run any number of them.', 'start': 1827.496, 'duration': 5.942}], 'summary': 'Artificial neurons are similar to logistic regression units. neural networks aim to run multiple logistic regressions simultaneously.', 'duration': 51.558, 'max_score': 1781.88, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1781880.jpg'}, {'end': 1882.117, 'src': 'embed', 'start': 1855.833, 'weight': 2, 'content': [{'end': 1871.454, 'text': "And so the secret of sort of then building bigger neural networks is to say we don't actually wanna decide ahead of time what those little orange logistic regressions are trying to capture.", 'start': 1855.833, 'duration': 15.621}, {'end': 1882.117, 'text': 'We want the neural network to self-organize so that those orange logistic regression, um, units learn something useful.', 'start': 1871.874, 'duration': 10.243}], 'summary': 'The secret to building bigger neural networks is to let them self-organize and learn useful information.', 'duration': 26.284, 'max_score': 1855.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1855833.jpg'}], 'start': 1520.372, 'title': 'Neural networks origins and structure', 'summary': 'Delves into the origins of neural networks and their modeling as binary logistic regression units, highlighting self-organization and practical applications.', 'chapters': [{'end': 1904.225, 'start': 1520.372, 'title': 'Neural networks origins and structure', 'summary': 'Discusses the origins of neural networks and how they can be modeled as binary logistic regression units, while emphasizing the concept of self-organization within neural networks and their practical applications.', 'duration': 383.853, 'highlights': ['Neural networks can be modeled as binary logistic regression units, with a single set of parameters, z, and the probability of one class being determined by the input to logistic regression.', 'The concept of self-organization within neural networks is crucial, allowing the network to learn something useful without deciding ahead of time what the orange logistic regression units are trying to capture.', 'Neural networks are designed to run multiple logistic regressions simultaneously, with the aim of self-organization and learning tasks such as sentiment analysis.']}], 'duration': 383.853, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1520372.jpg', 'highlights': ['Neural networks can be modeled as binary logistic regression units, with a single set of parameters, z, and the probability of one class being determined by the input to logistic regression.', 'Neural networks are designed to run multiple logistic regressions simultaneously, with the aim of self-organization and learning tasks such as sentiment analysis.', 'The concept of self-organization within neural networks is crucial, allowing the network to learn something useful without deciding ahead of time what the orange logistic regression units are trying to capture.']}, {'end': 2347.665, 'segs': [{'end': 1981.85, 'src': 'embed', 'start': 1904.225, 'weight': 2, 'content': [{'end': 1911.869, 'text': "um, and we're gonna have, uh, logistic regression classifier there, telling us positive or negative.", 'start': 1904.225, 'duration': 7.644}, {'end': 1918.574, 'text': "Um, but the inputs to that aren't going to directly be something like words in the document.", 'start': 1911.889, 'duration': 6.685}, {'end': 1923.118, 'text': "They're going to be this intermediate layer of logistic regression units.", 'start': 1918.854, 'duration': 4.264}, {'end': 1928.582, 'text': "And we're gonna train this whole thing to minimize our cross-entropy loss.", 'start': 1923.658, 'duration': 4.924}, {'end': 1938.149, 'text': "And essentially what we're going to want to have happen and the back propagation algorithm will do for us is to say you things in the middle.", 'start': 1929.122, 'duration': 9.027}, {'end': 1949.173, 'text': "it's your job to find some useful way to calculate values from the underlying data such that it'll help our final classifier make a good decision.", 'start': 1938.149, 'duration': 11.024}, {'end': 1954.115, 'text': 'And I mean in particular, you know, back to this picture.', 'start': 1949.193, 'duration': 4.922}, {'end': 1956.516, 'text': 'you know the final classifier.', 'start': 1954.115, 'duration': 2.401}, {'end': 1960.897, 'text': "it's just a linear classifier, a softmax or a logistic regression.", 'start': 1956.516, 'duration': 4.381}, {'end': 1962.418, 'text': "it's gonna have a line like this", 'start': 1960.897, 'duration': 1.521}, {'end': 1972.024, 'text': 'But if the intermediate classifiers, they are like a word embedding, they can kind of sort of re-represent the space and shift things around.', 'start': 1962.998, 'duration': 9.026}, {'end': 1981.85, 'text': "So they can learn to shift things around in such a way as you're learning a highly non-linear function of the original input space.", 'start': 1972.364, 'duration': 9.486}], 'summary': 'Logistic regression classifier trained to minimize cross-entropy loss for non-linear function learning.', 'duration': 77.625, 'max_score': 1904.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1904225.jpg'}, {'end': 2041.5, 'src': 'embed', 'start': 2013.569, 'weight': 1, 'content': [{'end': 2024.451, 'text': 'which is essentially when people had a model of a single neuron like this and then only gradually worked out how it related to more conventional statistics.', 'start': 2013.569, 'duration': 10.882}, {'end': 2033.215, 'text': 'Then there was um, the second version of neural networks, which are sort of the 80s and early 90s,', 'start': 2024.952, 'duration': 8.263}, {'end': 2041.5, 'text': 'where people built neural networks like this that had this one hidden layer where a representation could be learned in the middle.', 'start': 2033.215, 'duration': 8.285}], 'summary': 'Neural networks evolved from single neurons to hidden layer models in the 80s and 90s.', 'duration': 27.931, 'max_score': 2013.569, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2013569.jpg'}, {'end': 2089.166, 'src': 'embed', 'start': 2061.899, 'weight': 0, 'content': [{'end': 2072.931, 'text': "that precisely the motivating question is um we believe we'll be able to do even more sophisticated um classification for more complex tasks,", 'start': 2061.899, 'duration': 11.032}, {'end': 2075.835, 'text': 'things like speech recognition and image recognition.', 'start': 2072.931, 'duration': 2.904}, {'end': 2084.842, 'text': 'If we could have a deeper network which will be able to uh more effectively, learn more sophisticated functions of the input,', 'start': 2076.235, 'duration': 8.607}, {'end': 2089.166, 'text': 'which will allow us to do things like recognize sounds of a language.', 'start': 2084.842, 'duration': 4.324}], 'summary': 'Developing deeper networks to improve classification for more complex tasks like speech and image recognition.', 'duration': 27.267, 'max_score': 2061.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2061899.jpg'}, {'end': 2324.829, 'src': 'embed', 'start': 2294.944, 'weight': 5, 'content': [{'end': 2304.592, 'text': 'which is if you want to have a neural network learn anything interesting you have to stick in some function f,', 'start': 2294.944, 'duration': 9.648}, {'end': 2309.997, 'text': 'which is a non-linear function such as um, the logistic curve I showed before.', 'start': 2304.592, 'duration': 5.405}, {'end': 2324.829, 'text': "And the reason for that is, um that if you're sort of doing linear transforms like wx plus b and then w2, z1 plus b, w3,", 'start': 2310.577, 'duration': 14.252}], 'summary': 'To train a neural network, a non-linear function like the logistic curve must be included.', 'duration': 29.885, 'max_score': 2294.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2294944.jpg'}], 'start': 1904.225, 'title': 'Neural networks evolution', 'summary': 'Discusses logistic regression units and back propagation for training a classifier, and the evolution of neural networks from single-neuron models to deeper networks in the context of deep learning, aiming for more effective classification in tasks like speech and image recognition.', 'chapters': [{'end': 1981.85, 'start': 1904.225, 'title': 'Logistic regression and back propagation', 'summary': 'Discusses the use of logistic regression units in an intermediate layer to train a classifier, utilizing back propagation to find useful ways to calculate values and shift the input space for learning a highly non-linear function.', 'duration': 77.625, 'highlights': ['The inputs to the logistic regression classifier are an intermediate layer of logistic regression units, not directly words in the document.', 'The back propagation algorithm aims to find useful ways to calculate values from the underlying data to help the final classifier make a good decision.', 'The intermediate classifiers, like word embeddings, can learn to shift things around in such a way as to learn a highly non-linear function of the original input space.']}, {'end': 2347.665, 'start': 1987.859, 'title': 'Evolution of neural networks', 'summary': 'Discusses the evolution of neural networks, from single-neuron models in the 50s to the development of deeper networks in the context of deep learning, aiming for more effective classification in tasks like speech and image recognition.', 'duration': 359.806, 'highlights': ['The development of deeper networks in deep learning aims for more effective classification in tasks like speech and image recognition.', 'The historical progression of neural networks from the 50s to the present, focusing on the transition from single-neuron models to the development of deeper networks.', 'The importance of non-linear functions in neural networks to enable the learning of interesting patterns, as linear transforms alone would not capture complex relationships.']}], 'duration': 443.44, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo1904225.jpg', 'highlights': ['The development of deeper networks in deep learning aims for more effective classification in tasks like speech and image recognition.', 'The historical progression of neural networks from the 50s to the present, focusing on the transition from single-neuron models to the development of deeper networks.', 'The intermediate classifiers, like word embeddings, can learn to shift things around in such a way as to learn a highly non-linear function of the original input space.', 'The back propagation algorithm aims to find useful ways to calculate values from the underlying data to help the final classifier make a good decision.', 'The inputs to the logistic regression classifier are an intermediate layer of logistic regression units, not directly words in the document.', 'The importance of non-linear functions in neural networks to enable the learning of interesting patterns, as linear transforms alone would not capture complex relationships.']}, {'end': 2779.289, 'segs': [{'end': 2413.528, 'src': 'embed', 'start': 2377.683, 'weight': 0, 'content': [{'end': 2381.585, 'text': 'uh, non-linearity, thinking about probabilities or something like that.', 'start': 2377.683, 'duration': 3.902}, {'end': 2389.87, 'text': 'Our general picture is, well, we want to be able to do effective function approximation or curve fitting.', 'start': 2382.025, 'duration': 7.845}, {'end': 2392.171, 'text': "We'd like to learn a space like this,", 'start': 2390.17, 'duration': 2.001}, {'end': 2402.359, 'text': "and we can only do that if we're sort of putting in some non-linearities which allow us to learn these kind of curvy decision um patterns.", 'start': 2392.791, 'duration': 9.568}, {'end': 2413.528, 'text': 'And so, so f is used effectively for doing accurate function approximation or sort of pattern matching as you go along.', 'start': 2403.54, 'duration': 9.988}], 'summary': 'Non-linearities enable accurate function approximation and pattern matching.', 'duration': 35.845, 'max_score': 2377.683, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2377683.jpg'}, {'end': 2482.194, 'src': 'embed', 'start': 2458.447, 'weight': 4, 'content': [{'end': 2467.809, 'text': "people often say well, something that's really important for classification is looking at the pair of feature four and feature seven.", 'start': 2458.447, 'duration': 9.362}, {'end': 2473.611, 'text': 'Um, that, you know, if both of those are true at the same time, something important happens.', 'start': 2467.829, 'duration': 5.782}, {'end': 2477.352, 'text': "And so that's referred to normally in stats as an interaction term.", 'start': 2473.671, 'duration': 3.681}, {'end': 2482.194, 'text': 'And you can by hand, add interaction terms to your model.', 'start': 2477.632, 'duration': 4.562}], 'summary': 'Feature 4 and feature 7 interaction is important for classification.', 'duration': 23.747, 'max_score': 2458.447, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2458447.jpg'}, {'end': 2548.093, 'src': 'embed', 'start': 2515.399, 'weight': 3, 'content': [{'end': 2523.242, 'text': "So here's a brief little interlude on a teeny bit more of NLP, which is sort of a kind of problem we're gonna look at for a moment.", 'start': 2515.399, 'duration': 7.843}, {'end': 2528.624, 'text': 'So this is the task of named entity recognition that I very briefly mentioned last time.', 'start': 2523.522, 'duration': 5.102}, {'end': 2535.48, 'text': 'So, um, if we have some text, Okay.', 'start': 2529.024, 'duration': 6.456}, {'end': 2548.093, 'text': "If we have some text, something that in all sorts of places people want to do is they'd like to find the names of things that are mentioned.", 'start': 2537.682, 'duration': 10.411}], 'summary': 'Nlp task: named entity recognition for finding names in text.', 'duration': 32.694, 'max_score': 2515.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2515399.jpg'}, {'end': 2599.522, 'src': 'embed', 'start': 2571.416, 'weight': 1, 'content': [{'end': 2575.719, 'text': 'Um People when they do question answering that a lot of the time.', 'start': 2571.416, 'duration': 4.303}, {'end': 2579.602, 'text': 'the answers to questions are what we call named entities.', 'start': 2575.719, 'duration': 3.883}, {'end': 2587.027, 'text': 'the names of people, locations, organizations, pop songs, movie names, all of those kind of things are named entities.', 'start': 2579.602, 'duration': 7.425}, {'end': 2594.095, 'text': 'And if you want to sort of start building up a knowledge base automatically from a lot of text, well,', 'start': 2589.148, 'duration': 4.947}, {'end': 2599.522, 'text': 'what you normally want to do is get out the named entities and get out relations between them.', 'start': 2594.095, 'duration': 5.427}], 'summary': 'Named entities in question answering are key for knowledge base building.', 'duration': 28.106, 'max_score': 2571.416, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2571416.jpg'}, {'end': 2649.756, 'src': 'embed', 'start': 2618.171, 'weight': 2, 'content': [{'end': 2623.553, 'text': "And what we're gonna do is run a classifier and we're going to assign them a class.", 'start': 2618.171, 'duration': 5.382}, {'end': 2628.075, 'text': "So we're gonna say first word is organization, second word is organization.", 'start': 2623.613, 'duration': 4.462}, {'end': 2630.456, 'text': "third word isn't a named entity.", 'start': 2628.075, 'duration': 2.381}, {'end': 2631.677, 'text': 'fourth word is a person.', 'start': 2630.456, 'duration': 1.221}, {'end': 2634.059, 'text': 'fifth word is a person and continue down.', 'start': 2631.677, 'duration': 2.382}, {'end': 2642.284, 'text': "So, we're running a classification of a word within a position in the text, so it's got surrounding words around it.", 'start': 2634.339, 'duration': 7.945}, {'end': 2649.756, 'text': 'Um, and so, to say what the entities are, many entities are multi-word terms.', 'start': 2643.613, 'duration': 6.143}], 'summary': 'Using a classifier to assign classes to words based on position and surrounding words in the text.', 'duration': 31.585, 'max_score': 2618.171, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2618171.jpg'}], 'start': 2348.005, 'title': 'Non-linearity & named entity recognition', 'summary': 'Emphasizes the significance of non-linearity in deep networks for better function approximation, and it delves into named entity recognition, covering its importance, common approaches, and challenges such as ambiguity and defining entity boundaries.', 'chapters': [{'end': 2482.194, 'start': 2348.005, 'title': 'Power of non-linearity in neural networks', 'summary': 'Explains the importance of non-linearity in deep networks for effective function approximation and pattern matching, demonstrating that adding non-linearities allows for learning curvy decision patterns.', 'duration': 134.189, 'highlights': ['Non-linearity in deep networks provides additional power for effective function approximation and pattern matching, as it allows for learning curvy decision patterns.', 'In conventional statistics, the importance of interaction terms, such as the pair of feature four and feature seven, for classification can be added by hand to the model.', 'Multiple linear transforms in a classifier do not provide extra power, but the addition of any non-linearity results in increased power.']}, {'end': 2779.289, 'start': 2482.595, 'title': 'Named entity recognition', 'summary': 'Discusses the task of named entity recognition, its importance, common approaches, and challenges, including ambiguity and difficulty in classifying and defining entity boundaries.', 'duration': 296.694, 'highlights': ['Named entity recognition involves classifying names of organizations, people, and places, which is crucial for tasks like question answering and knowledge base building.', 'The common approach for named entity recognition is to run a classifier to assign a class to each word in the text, considering the surrounding words as well.', 'The difficulty in named entity recognition lies in determining entity boundaries, working out the class of an entity, and addressing ambiguous entities, which poses challenges in accurately identifying named entities.']}], 'duration': 431.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2348005.jpg', 'highlights': ['Non-linearity in deep networks enables effective function approximation and pattern matching.', 'Named entity recognition is crucial for tasks like question answering and knowledge base building.', 'The common approach for named entity recognition is to run a classifier to assign a class to each word in the text.', 'The difficulty in named entity recognition lies in determining entity boundaries, working out the class of an entity, and addressing ambiguous entities.', 'In conventional statistics, the importance of interaction terms for classification can be added by hand to the model.']}, {'end': 3418.041, 'segs': [{'end': 2808.649, 'src': 'embed', 'start': 2779.63, 'weight': 1, 'content': [{'end': 2784.493, 'text': "So there's sort of a fair bit of understanding variously that's needed to get it right.", 'start': 2779.63, 'duration': 4.863}, {'end': 2799.803, 'text': 'Okay Um, so what are we gonna do with that? And so this suggests, um, what we wanna do is build classifiers for language that work inside a context.', 'start': 2786.114, 'duration': 13.689}, {'end': 2806.528, 'text': "Um, so, you know, in general, it's not very interesting classifying a word outside a context.", 'start': 2800.364, 'duration': 6.164}, {'end': 2808.649, 'text': "We don't actually do that much in NLP.", 'start': 2806.568, 'duration': 2.081}], 'summary': 'Nlp requires understanding context for effective language classification.', 'duration': 29.019, 'max_score': 2779.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2779630.jpg'}, {'end': 2941.263, 'src': 'heatmap', 'start': 2835.133, 'weight': 0.867, 'content': [{'end': 2843.224, 'text': "something can either mean to plant seeds and things, so you're seeding the soil, or it can take seeds out of something like a watermelon right?", 'start': 2835.133, 'duration': 8.091}, {'end': 2845.948, 'text': 'You just need to know the context as to which it is.', 'start': 2843.264, 'duration': 2.684}, {'end': 2856.752, 'text': 'Okay So that suggests the task that we can classify a word in its context of neighboring words, and NER is an example of that.', 'start': 2847.87, 'duration': 8.882}, {'end': 2859.253, 'text': 'And the question is how might we do that?', 'start': 2857.052, 'duration': 2.201}, {'end': 2869.775, 'text': 'And a very simple way to do it might be to say well, we have a bunch of words in a row which each have a word vector, from something like word to vec.', 'start': 2859.613, 'duration': 10.162}, {'end': 2876.257, 'text': 'Um, maybe we could just average those word vectors and then classify the resulting vector.', 'start': 2870.175, 'duration': 6.082}, {'end': 2881.802, 'text': "And the problem is that doesn't work very well because you lose position information.", 'start': 2876.817, 'duration': 4.985}, {'end': 2887.868, 'text': "You don't actually know anymore which of those word vectors is the one that you're meant to be classifying.", 'start': 2882.142, 'duration': 5.726}, {'end': 2895.977, 'text': "So a simple way to do better than that is to say well, why don't we make a big vector of a word window?", 'start': 2888.589, 'duration': 7.388}, {'end': 2900.342, 'text': 'So here are words, and they each have a word vector.', 'start': 2896.537, 'duration': 3.805}, {'end': 2907.07, 'text': 'And so to classify the middle word in a context of here plus or minus two words.', 'start': 2900.843, 'duration': 6.227}, {'end': 2915.24, 'text': "we're simply gonna concatenate these five vectors together and say now we have a bigger vector and let's build a classifier.", 'start': 2907.07, 'duration': 8.17}, {'end': 2926.197, 'text': "over that vector so we're classifying this x window which is then a vector in R5D if we're using d-dimensional word vectors.", 'start': 2916.309, 'duration': 9.888}, {'end': 2941.263, 'text': 'And we can do that, um, in the kind of way that we did previously, which is, um that we could say okay for that big vector,', 'start': 2927.738, 'duration': 13.525}], 'summary': 'Classifying words in context using word vectors and ner with window of 5 words', 'duration': 106.13, 'max_score': 2835.133, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2835133.jpg'}, {'end': 2907.07, 'src': 'embed', 'start': 2882.142, 'weight': 0, 'content': [{'end': 2887.868, 'text': "You don't actually know anymore which of those word vectors is the one that you're meant to be classifying.", 'start': 2882.142, 'duration': 5.726}, {'end': 2895.977, 'text': "So a simple way to do better than that is to say well, why don't we make a big vector of a word window?", 'start': 2888.589, 'duration': 7.388}, {'end': 2900.342, 'text': 'So here are words, and they each have a word vector.', 'start': 2896.537, 'duration': 3.805}, {'end': 2907.07, 'text': 'And so to classify the middle word in a context of here plus or minus two words.', 'start': 2900.843, 'duration': 6.227}], 'summary': 'Word vectors improve classification accuracy by considering word windows.', 'duration': 24.928, 'max_score': 2882.142, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2882142.jpg'}, {'end': 3078.695, 'src': 'heatmap', 'start': 2969.561, 'weight': 0.756, 'content': [{'end': 2980.971, 'text': 'And indeed, um, for the handout on the website that we suggest you look at, it does do it with the softmax classifier of precisely this kind.', 'start': 2969.561, 'duration': 11.41}, {'end': 2986.536, 'text': 'Um, But for the example I do in class, I try to make it a bit simpler.', 'start': 2981.472, 'duration': 5.064}, {'end': 2991.16, 'text': "Um, and I want to do this I think very quickly because I'm fast running out of time.", 'start': 2987.037, 'duration': 4.123}, {'end': 3002.07, 'text': 'So one of the famous early papers of Newell, NLP, um, was this paper by Colbert and Weston, which was first an ICML paper in 2008,, which actually,', 'start': 2991.561, 'duration': 10.509}, {'end': 3005.033, 'text': 'just a couple of weeks ago, um won the ICML 2018 Test of Time Award.', 'start': 3002.07, 'duration': 2.963}, {'end': 3013.68, 'text': "Um, and then there's a more recent journal version of it, 2011.", 'start': 3005.053, 'duration': 8.627}, {'end': 3024.169, 'text': 'And, um, they use this idea of window classification to assign classes like named entities to- to words in context.', 'start': 3013.68, 'duration': 10.489}, {'end': 3028.013, 'text': 'Um, but they did it in a slightly different way.', 'start': 3024.69, 'duration': 3.323}, {'end': 3033.137, 'text': "So what they said is well, we've got these windows,", 'start': 3028.373, 'duration': 4.764}, {'end': 3041.285, 'text': 'and this is one with a um location named entity in the middle and this is one without a location entity in the middle.', 'start': 3033.137, 'duration': 8.148}, {'end': 3052.115, 'text': 'And so what we want to do is have a system that returns a score, and it should return a high score, just as a real number in this case,', 'start': 3041.665, 'duration': 10.45}, {'end': 3060.103, 'text': "and it can- should return a low score if it- if there isn't a location name in the middle of the window in this case.", 'start': 3052.115, 'duration': 7.988}, {'end': 3064.686, 'text': 'And so explicitly the model just returned the score.', 'start': 3060.623, 'duration': 4.063}, {'end': 3073.112, 'text': 'And so if you had the top level of your neural network A and you just then dot product it with a vector u,', 'start': 3065.126, 'duration': 7.986}, {'end': 3078.695, 'text': 'you then kind of with that final dot product you just return a real number.', 'start': 3073.112, 'duration': 5.583}], 'summary': 'Discussion on a paper by colbert and weston, icml 2008 test of time award winner.', 'duration': 109.134, 'max_score': 2969.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2969561.jpg'}, {'end': 3024.169, 'src': 'embed', 'start': 2991.561, 'weight': 5, 'content': [{'end': 3002.07, 'text': 'So one of the famous early papers of Newell, NLP, um, was this paper by Colbert and Weston, which was first an ICML paper in 2008,, which actually,', 'start': 2991.561, 'duration': 10.509}, {'end': 3005.033, 'text': 'just a couple of weeks ago, um won the ICML 2018 Test of Time Award.', 'start': 3002.07, 'duration': 2.963}, {'end': 3013.68, 'text': "Um, and then there's a more recent journal version of it, 2011.", 'start': 3005.053, 'duration': 8.627}, {'end': 3024.169, 'text': 'And, um, they use this idea of window classification to assign classes like named entities to- to words in context.', 'start': 3013.68, 'duration': 10.489}], 'summary': "Colbert and weston's 2008 paper won the icml 2018 test of time award for using window classification in nlp.", 'duration': 32.608, 'max_score': 2991.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2991561.jpg'}, {'end': 3187.752, 'src': 'embed', 'start': 3159.154, 'weight': 3, 'content': [{'end': 3165.236, 'text': "Let's just stick a softmax or logistic classification on top to say yes or no for location.", 'start': 3159.154, 'duration': 6.082}, {'end': 3175.863, 'text': 'But by putting in that extra hidden layer precisely, this extra hidden layer can calculate non-linear interactions between the input word vectors.', 'start': 3165.516, 'duration': 10.347}, {'end': 3187.752, 'text': 'So it can calculate things like if the first word is a word like museum and the second- and the second word is a word like the preposition in or around,', 'start': 3176.223, 'duration': 11.529}], 'summary': 'Adding an extra hidden layer can calculate non-linear interactions between input word vectors for location classification.', 'duration': 28.598, 'max_score': 3159.154, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo3159154.jpg'}, {'end': 3330.599, 'src': 'embed', 'start': 3300.463, 'weight': 4, 'content': [{'end': 3306.387, 'text': "and then that'll lead into sort of discussing, and more generally, the backpropagation algorithm.", 'start': 3300.463, 'duration': 5.924}, {'end': 3307.568, 'text': 'um, for the next one.', 'start': 3306.387, 'duration': 1.181}, {'end': 3316.472, 'text': "Okay So, if we're doing, um, gradients by hand, well, we're doing multivariable calculus, multivariable derivatives.", 'start': 3308.388, 'duration': 8.084}, {'end': 3324.396, 'text': 'But in particular, normally the most useful way to think about this is as doing matrix calculus,', 'start': 3316.912, 'duration': 7.484}, {'end': 3330.599, 'text': "which means we're directly working with vectors and matrices to work out our gradients.", 'start': 3324.396, 'duration': 6.203}], 'summary': 'Discussed backpropagation algorithm and matrix calculus for working with gradients.', 'duration': 30.136, 'max_score': 3300.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo3300463.jpg'}], 'start': 2779.63, 'title': 'Nlp word classification and neural network layers', 'summary': 'Discusses the challenges of contextual word classification in nlp and proposes a method using word vectors and window classification. it also introduces neural network layers for non-linear interactions and backpropagation for parameter updates.', 'chapters': [{'end': 3024.169, 'start': 2779.63, 'title': 'Contextual word classification', 'summary': 'Discusses the need for contextual word classification in nlp, explaining the challenges of classifying words outside a context and proposing a method to classify a word in its context using word vectors and window classification. it also mentions the use of window classification in a famous nlp paper by colbert and weston.', 'duration': 244.539, 'highlights': ['The need for contextual word classification in NLP', 'Challenges of classifying words outside a context', 'Method of classifying a word in its context using word vectors and window classification', 'Use of window classification in a famous NLP paper by Colbert and Weston']}, {'end': 3418.041, 'start': 3024.69, 'title': 'Neural network layers and backpropagation', 'summary': 'Discusses a neural network model for classification, emphasizing the use of extra layers to calculate non-linear interactions between input word vectors, and introduces the concept of backpropagation for updating parameters in a neural network.', 'duration': 393.351, 'highlights': ['The model emphasizes the use of extra layers to calculate non-linear interactions between input word vectors.', 'Introduction of backpropagation for updating parameters in a neural network.']}], 'duration': 638.411, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo2779630.jpg', 'highlights': ['Method of classifying a word in its context using word vectors and window classification', 'The need for contextual word classification in NLP', 'Challenges of classifying words outside a context', 'The model emphasizes the use of extra layers to calculate non-linear interactions between input word vectors', 'Introduction of backpropagation for updating parameters in a neural network', 'Use of window classification in a famous NLP paper by Colbert and Weston']}, {'end': 4724.566, 'segs': [{'end': 3473.688, 'src': 'embed', 'start': 3442.493, 'weight': 3, 'content': [{'end': 3454.385, 'text': "um, I sort of hope that even if you never did multivariable calculus or you can't remember any of it, it's sort of for what we have to do here,", 'start': 3442.493, 'duration': 11.892}, {'end': 3457.208, 'text': 'not that hard, and you can do it.', 'start': 3454.385, 'duration': 2.823}, {'end': 3459.07, 'text': "So here's what you do.", 'start': 3457.628, 'duration': 1.442}, {'end': 3467.744, 'text': "Right, so if we have a simple function, f of x equals x cubed, right, it's gradient.", 'start': 3461.96, 'duration': 5.784}, {'end': 3473.688, 'text': "And so the gradient is the slope, right? It's saying how steep or shallow is the slope of something.", 'start': 3468.725, 'duration': 4.963}], 'summary': 'Understanding multivariable calculus helps grasp the concept of gradient in simple functions.', 'duration': 31.195, 'max_score': 3442.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo3442493.jpg'}, {'end': 3678.321, 'src': 'embed', 'start': 3653.015, 'weight': 2, 'content': [{'end': 3660.418, 'text': 'So, if we have one variable function, so we have, um, z equals 3y and y equals x squared.', 'start': 3653.015, 'duration': 7.403}, {'end': 3670.84, 'text': "If we want to work out, um, the derivative of z with respect to x, we say, aha, that's a composition of two functions.", 'start': 3660.798, 'duration': 10.042}, {'end': 3678.321, 'text': 'So I use the chain rule, and so that means what I do is I multiply, um, the derivative.', 'start': 3671.24, 'duration': 7.081}], 'summary': 'Using chain rule to find derivative of z with respect to x in a composition of two functions', 'duration': 25.306, 'max_score': 3653.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo3653015.jpg'}, {'end': 4399.335, 'src': 'embed', 'start': 4360.897, 'weight': 1, 'content': [{'end': 4363.419, 'text': 'Note delta which is different from partial derivative d.', 'start': 4360.897, 'duration': 2.522}, {'end': 4369.144, 'text': 'Um, and so delta is referred to as the error signal in neural network talk.', 'start': 4364.219, 'duration': 4.925}, {'end': 4379.153, 'text': "So it's the- what you're calculating as the partial derivatives above the parameters that you're working out the partial derivatives with respect to.", 'start': 4369.464, 'duration': 9.689}, {'end': 4384.117, 'text': "Um, so a lot of the secret, as we'll see next time.", 'start': 4379.173, 'duration': 4.944}, {'end': 4399.335, 'text': "a lot of the secret of what happens with backpropagation is just we want to do efficient computation in the sort of way that's computer science people like to do efficient computation.", 'start': 4384.117, 'duration': 15.218}], 'summary': 'Delta represents error signal in neural network talk.', 'duration': 38.438, 'max_score': 4360.897, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo4360897.jpg'}, {'end': 4724.566, 'src': 'embed', 'start': 4706.65, 'weight': 0, 'content': [{'end': 4712.055, 'text': 'And essentially what we wanna do for backpropagation is to say how can we do, uh,', 'start': 4706.65, 'duration': 5.405}, {'end': 4716.979, 'text': 'get a computer to do this automatically for us and to do it efficiently?', 'start': 4712.055, 'duration': 4.924}, {'end': 4722.884, 'text': "And that's what sort of the deep learning frameworks like TensorFlow and PyTorch do, and how you can do.", 'start': 4716.999, 'duration': 5.885}, {'end': 4724.566, 'text': "that we'll look at more next time.", 'start': 4722.884, 'duration': 1.682}], 'summary': 'Automate backpropagation for efficiency using tensorflow and pytorch.', 'duration': 17.916, 'max_score': 4706.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo4706650.jpg'}], 'start': 3419.255, 'title': 'Neural network calculus', 'summary': 'Covers gradients in single and multivariable calculus, partial derivatives, chain rule in neural network modeling, jacobians, activation functions, error signals, and shape convention for representing jacobians. it explains the computation of partial derivatives with respect to weight and bias and their relevance in deep learning frameworks.', 'chapters': [{'end': 3713.661, 'start': 3419.255, 'title': 'Multidimensional calculus and neural networks', 'summary': 'Introduces the concept of gradients in single and multivariable calculus, including the calculation of partial derivatives and the use of the chain rule in neural network modeling.', 'duration': 294.406, 'highlights': ['The chapter explains the concept of gradients in single and multivariable calculus, emphasizing the calculation of partial derivatives and their role in determining slope and direction in multi-dimensional spaces.', 'The chapter delves into the application of gradients in neural network modeling, particularly in the context of a function with n inputs and m outputs, leading to the creation of a Jacobian matrix of partial derivatives.', 'The chapter covers the use of the chain rule in composing functions and calculating derivatives, exemplifying the process with a composition of two functions and the application of the chain rule to find the derivative of z with respect to x.']}, {'end': 4008.466, 'start': 3714.222, 'title': 'Neural network jacobians', 'summary': 'Explains how to calculate jacobians in a neural network, including the special case of activation functions and the main cases for partial derivatives of wx+b with respect to x and b.', 'duration': 294.244, 'highlights': ['The chapter explains the process of calculating Jacobians in a neural network, emphasizing the multiplication of Jacobians to obtain the right answer.', 'It details the special case of the Jacobian for activation functions, demonstrating how it forms a diagonal matrix with zeros and the activation function values on the diagonal.', 'The main cases for partial derivatives of wx+b with respect to x and b are discussed, where wx results in w and wx+b with respect to b yields an identity matrix.', 'The partial derivatives of vector dot product of u and h with respect to u result in h transpose, as mentioned in the lecture notes.']}, {'end': 4406.456, 'start': 4008.846, 'title': 'Neural network partial derivatives', 'summary': 'Explains the process of computing partial derivatives for a neural network, emphasizing the use of chain rule and jacobians. it outlines how to compute partial derivatives with respect to the parameters of the model and highlights the relevance of error signals in neural network computations.', 'duration': 397.61, 'highlights': ['The process of computing partial derivatives for a neural network involves breaking down the equations into simple pieces, applying the chain rule, and utilizing Jacobians for efficient computation.', 'The relevance of error signals, referred to as delta, in neural network computations and their correlation to the partial derivatives above the parameters of the neural network.', 'The explanation of the process of computing partial derivatives with respect to the parameters of the model, emphasizing the similarities and differences between the partial derivatives with respect to different parameters.']}, {'end': 4724.566, 'start': 4406.456, 'title': 'Compute partial derivatives and shape convention', 'summary': 'Discusses the computation of partial derivatives with respect to weight and bias, emphasizing the shape convention for representing jacobians and the use of the chain rule to automate backpropagation in deep learning frameworks like tensorflow and pytorch.', 'duration': 318.11, 'highlights': ['The end result of the partial of S with respect to W is a function with n times m inputs and one output, necessitating a Jacobian matching the shape of the input for ease of weight updates.', 'The process involves utilizing the chain rule to automatically compute derivatives in terms of vector and matrix derivatives for efficient backpropagation in deep learning frameworks like TensorFlow and PyTorch.', 'The computation requires understanding the shape of the partial derivative of S with respect to W and the need to match the shape of the inputs for the Jacobian to facilitate weight updates in stochastic gradient descent.']}], 'duration': 1305.311, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8CWyBNX6eDo/pics/8CWyBNX6eDo3419255.jpg', 'highlights': ['The process involves utilizing the chain rule to automatically compute derivatives in terms of vector and matrix derivatives for efficient backpropagation in deep learning frameworks like TensorFlow and PyTorch.', 'The relevance of error signals, referred to as delta, in neural network computations and their correlation to the partial derivatives above the parameters of the neural network.', 'The chapter covers the use of the chain rule in composing functions and calculating derivatives, exemplifying the process with a composition of two functions and the application of the chain rule to find the derivative of z with respect to x.', 'The chapter explains the concept of gradients in single and multivariable calculus, emphasizing the calculation of partial derivatives and their role in determining slope and direction in multi-dimensional spaces.']}], 'highlights': ['The lecture series covers fundamental topics in neural networks and machine learning, including backpropagation, classification, softmax regression, cross-entropy loss, and their applications in natural language processing.', 'The chapter introduces the classification setup with d-dimensional vector inputs, yi labels, and c classes, aiming to bring everyone up to speed in week two.', 'Neural networks learn complex classifiers for datasets with natural signals and large data.', 'The process involves utilizing the chain rule to automatically compute derivatives in terms of vector and matrix derivatives for efficient backpropagation in deep learning frameworks like TensorFlow and PyTorch.', 'Non-linearity in deep networks enables effective function approximation and pattern matching.']}