title
Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers
description
For more information about Stanford's Artificial Intelligence professional and graduate programs visit: https://stanford.io/2ZB72nu
Lecture 2: Word Vectors, Word Senses, and Neural Network Classifiers
1. Course organization (2 mins)
2. Finish looking at word vectors and word2vec (13 mins)
3. Can we capture the essence of word meaning more effectively by counting? (8m)
4. The GloVe model of word vectors (8 min)
5. Evaluating word vectors (14 mins)
6. Word senses (8 mins)
7. Review of classification and how neural nets differ (8 mins)
8. Introducing neural networks (14 mins)
To learn more about this course visit: https://online.stanford.edu/courses/cs224n-natural-language-processing-deep-learning
To follow along with the course schedule and syllabus visit: http://web.stanford.edu/class/cs224n/
Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)
detail
{'title': 'Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers', 'heatmap': [{'end': 1131.922, 'start': 1079.426, 'weight': 0.71}, {'end': 2668.614, 'start': 2523.49, 'weight': 0.789}], 'summary': 'Covers topics on word vectors, neural network classifiers, learning word vectors using gradient descent, word2vec model, negative sampling, challenges in count word vectors, glove model, evaluating word vectors, comparing word embedding models, word similarity models, and word sense disambiguation, aiming to enable students to understand word embeddings papers and their applications in nlp, with an emphasis on effective learning methods and model evaluations.', 'chapters': [{'end': 290.672, 'segs': [{'end': 49.566, 'src': 'embed', 'start': 5.325, 'weight': 0, 'content': [{'end': 7.066, 'text': 'OK, so what are we going to do for today?', 'start': 5.325, 'duration': 1.741}, {'end': 16.728, 'text': 'So the main content for today is to go through more stuff about word vectors,', 'start': 7.086, 'duration': 9.642}, {'end': 22.47, 'text': 'including touching on word sensors and then introducing the notion of neural network classifiers.', 'start': 16.728, 'duration': 5.742}, {'end': 26.831, 'text': "So our biggest goal is that by the end of today's class,", 'start': 23.15, 'duration': 3.681}, {'end': 39.417, 'text': "you should feel like you could confidently look at one of the word embeddings papers such as the Google Word2Vec paper or the Glove paper or Sanjeev Arora's paper that we'll come to later and feel like yeah,", 'start': 26.831, 'duration': 12.586}, {'end': 40.538, 'text': 'I can understand this.', 'start': 39.417, 'duration': 1.121}, {'end': 43.16, 'text': "I know what they're doing, and it makes sense.", 'start': 40.979, 'duration': 2.181}, {'end': 45.022, 'text': "So let's go back to where we were.", 'start': 43.401, 'duration': 1.621}, {'end': 49.566, 'text': 'So this was sort of introducing this model of Word2Vec.', 'start': 45.042, 'duration': 4.524}], 'summary': "Today's goal: understand word vectors, neural network classifiers, and feel confident with word2vec, glove, and arora's papers.", 'duration': 44.241, 'max_score': 5.325, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew5325.jpg'}, {'end': 187.789, 'src': 'embed', 'start': 115.918, 'weight': 2, 'content': [{'end': 123.3, 'text': 'this allows us to learn word vectors that capture well word similarity and meaningful directions in a word space.', 'start': 115.918, 'duration': 7.382}, {'end': 132.584, 'text': 'So more precisely, for this model, the only parameters of this model are the word vectors.', 'start': 125.014, 'duration': 7.57}, {'end': 136.89, 'text': 'So we have outside word vectors and center word vectors for each word.', 'start': 132.664, 'duration': 4.226}, {'end': 141.696, 'text': "And then we're taking their dot product to get a probability.", 'start': 137.31, 'duration': 4.386}, {'end': 149.203, 'text': 'taking a dot product to get a score of how likely a particular outside word is to occur with the center word.', 'start': 143.018, 'duration': 6.185}, {'end': 154.828, 'text': "And then we're using the softmax transformation to convert those scores into probabilities,", 'start': 149.503, 'duration': 5.325}, {'end': 159.251, 'text': 'as I discussed last time and I kind of come back to at the end this time.', 'start': 154.828, 'duration': 4.423}, {'end': 161.653, 'text': 'A couple of things to note.', 'start': 159.271, 'duration': 2.382}, {'end': 167.598, 'text': 'This model is what we call in NLP a bag of words model.', 'start': 162.514, 'duration': 5.084}, {'end': 168.679, 'text': 'So, bag of words.', 'start': 167.678, 'duration': 1.001}, {'end': 173.963, 'text': "models are models that don't actually pay any attention to word, order or position.", 'start': 168.679, 'duration': 5.284}, {'end': 178.765, 'text': "it doesn't matter if you're next to the center word or a bit further away on the left or right.", 'start': 173.963, 'duration': 4.802}, {'end': 187.789, 'text': 'the probability estimate would be the same and that seems like a very crude model of language that will offend any linguist.', 'start': 178.765, 'duration': 9.024}], 'summary': 'Word2vec model uses word vectors to capture word similarity and probabilities, ignoring word order or position.', 'duration': 71.871, 'max_score': 115.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew115918.jpg'}], 'start': 5.325, 'title': 'Word vectors and language learning', 'summary': "Covers word vectors, word sensors, and neural network classifiers, aiming for students to confidently understand word embeddings papers like google word2vec, glove, and sanjeev arora's paper. it also discusses how word vectors are learned to capture word similarity and meaningful directions in a word space, with a bag of words model using only word vectors as parameters, and a dot product to determine the likelihood of outside words occurring with the center word, ultimately transformed into probabilities through softmax. additionally, the chapter delves into the word2vec model, a crude yet effective model of language that learns about word probabilities, achieves high probabilities for contextually occurring words, and places similar meaning words close together in a high dimensional vector space, enabling language learning.", 'chapters': [{'end': 115.918, 'start': 5.325, 'title': 'Word vectors and neural network classifiers', 'summary': "Covers word vectors, word sensors, and neural network classifiers, aiming for students to confidently understand word embeddings papers like google word2vec, glove, and sanjeev arora's paper.", 'duration': 110.593, 'highlights': ["The goal is for students to confidently understand word embeddings papers such as Google Word2Vec, Glove, and Sanjeev Arora's paper.", 'The chapter covers word vectors, word sensors, and neural network classifiers.', 'Explanation of the Word2Vec model, which iterates through a corpus of text to predict surrounding words for each position using a probability distribution defined by the dot product between word vectors.']}, {'end': 168.679, 'start': 115.918, 'title': 'Word vectors and bag of words model', 'summary': 'Discusses how word vectors are learned to capture word similarity and meaningful directions in a word space, with a bag of words model using only word vectors as parameters, and a dot product to determine the likelihood of outside words occurring with the center word, ultimately transformed into probabilities through softmax.', 'duration': 52.761, 'highlights': ['The model uses word vectors as the only parameters, with outside and center word vectors for each word, utilizing a dot product to determine the likelihood of outside words occurring with the center word.', 'The softmax transformation is employed to convert the dot product scores into probabilities, defining the model as a bag of words model in NLP.']}, {'end': 290.672, 'start': 168.679, 'title': 'Word2vec model and language learning', 'summary': 'Discusses the word2vec model, a crude yet effective model of language that learns about word probabilities, achieves high probabilities for contextually occurring words, and places similar meaning words close together in a high dimensional vector space, enabling language learning.', 'duration': 121.993, 'highlights': ['The word2vec model is a crude yet effective model of language that can learn quite a lot about word probabilities, even though it may offend linguists (e.g., probability estimates regardless of word position).', 'The model aims to give reasonably high probabilities to words that occur in the context of the center word, achieving probabilities in the range of 0.01, and places similar meaning words close together in a high dimensional vector space during the learning phase.', 'The high dimensional vector space created by the word2vec model groups words with similar meanings close together, such as days of the week, cell phone makers, and fields like mathematics and economics.']}], 'duration': 285.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew5325.jpg', 'highlights': ['The chapter covers word vectors, word sensors, and neural network classifiers.', "The goal is for students to confidently understand word embeddings papers such as Google Word2Vec, Glove, and Sanjeev Arora's paper.", 'The model uses word vectors as the only parameters, with outside and center word vectors for each word, utilizing a dot product to determine the likelihood of outside words occurring with the center word.', 'The softmax transformation is employed to convert the dot product scores into probabilities, defining the model as a bag of words model in NLP.', 'The word2vec model is a crude yet effective model of language that can learn quite a lot about word probabilities, even though it may offend linguists (e.g., probability estimates regardless of word position).']}, {'end': 915.961, 'segs': [{'end': 388.675, 'src': 'embed', 'start': 362.197, 'weight': 1, 'content': [{'end': 368.682, 'text': "So what we're going to do is we start off with random word vectors.", 'start': 362.197, 'duration': 6.485}, {'end': 372.744, 'text': 'We initialize them to small numbers near 0 in each dimension.', 'start': 368.742, 'duration': 4.002}, {'end': 374.486, 'text': "We've defined our.", 'start': 373.085, 'duration': 1.401}, {'end': 378.288, 'text': 'loss function j, which we looked at last time.', 'start': 375.166, 'duration': 3.122}, {'end': 388.675, 'text': "And then we're going to use a gradient descent algorithm, which is an iterative algorithm that learns to maximize j of theta by changing theta.", 'start': 378.689, 'duration': 9.986}], 'summary': 'Using gradient descent to maximize j of theta for word vectors.', 'duration': 26.478, 'max_score': 362.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew362197.jpg'}, {'end': 442.926, 'src': 'embed', 'start': 415.157, 'weight': 3, 'content': [{'end': 422.547, 'text': 'And so one of the parameters of neural nets that you can fiddle in your software package is what is the step size?', 'start': 415.157, 'duration': 7.39}, {'end': 429.536, 'text': 'So if you take a really really itsy, bitsy step, it might take you a long time to minimize the function.', 'start': 422.847, 'duration': 6.689}, {'end': 432.759, 'text': 'wasted computation.', 'start': 431.438, 'duration': 1.321}, {'end': 442.926, 'text': 'On the other hand, if your step size is much too big, well, then you can actually diverge and start going to worse places.', 'start': 432.779, 'duration': 10.147}], 'summary': 'Adjusting the step size in neural nets affects computation time and convergence.', 'duration': 27.769, 'max_score': 415.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew415157.jpg'}, {'end': 574.575, 'src': 'embed', 'start': 541.929, 'weight': 2, 'content': [{'end': 544.211, 'text': 'because we have to iterate over our entire corpus.', 'start': 541.929, 'duration': 2.282}, {'end': 548.694, 'text': "So you'd wait a very long time before you made a single gradient update.", 'start': 544.551, 'duration': 4.143}, {'end': 551.317, 'text': 'And so optimization would be extremely slow.', 'start': 549.015, 'duration': 2.302}, {'end': 558.923, 'text': "basically 100% of the time in neural network land, we don't use gradient descent.", 'start': 553.038, 'duration': 5.885}, {'end': 562.045, 'text': "We instead use what's called stochastic gradient descent.", 'start': 559.223, 'duration': 2.822}, {'end': 566.929, 'text': 'And stochastic gradient descent is a very simple modification of this.', 'start': 562.505, 'duration': 4.424}, {'end': 574.575, 'text': 'So, rather than working out an estimate of the gradient based on the entire corpus,', 'start': 567.329, 'duration': 7.246}], 'summary': 'In neural network land, stochastic gradient descent is used 100% of the time to avoid slow optimization.', 'duration': 32.646, 'max_score': 541.929, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew541929.jpg'}, {'end': 703.573, 'src': 'embed', 'start': 647.009, 'weight': 0, 'content': [{'end': 657.951, 'text': 'I present this as the gradient descent is a sort of performance hack that lets you learn much more quickly.', 'start': 647.009, 'duration': 10.942}, {'end': 661.072, 'text': "It turns out it's not only a performance hack.", 'start': 657.971, 'duration': 3.101}, {'end': 665.233, 'text': 'Neural nets have some quite counterintuitive properties.', 'start': 661.372, 'duration': 3.861}, {'end': 673.737, 'text': 'And actually the fact that stochastic gradient descent is kind of noisy and bounces around as it does its thing.', 'start': 665.774, 'duration': 7.963}, {'end': 684.082, 'text': 'it actually means that in complex networks it learns better solutions than if you were to run plain gradient descent very slowly.', 'start': 673.737, 'duration': 10.345}, {'end': 688.064, 'text': 'So you can both compute much more quickly and do a better job.', 'start': 684.482, 'duration': 3.582}, {'end': 694.567, 'text': 'OK, one final note on running stochastic gradients with word vectors.', 'start': 690.464, 'duration': 4.103}, {'end': 696.128, 'text': 'This is kind of an aside.', 'start': 694.607, 'duration': 1.521}, {'end': 703.573, 'text': "But something to note is that if we're doing a stochastic gradient update based on one window,", 'start': 697.049, 'duration': 6.524}], 'summary': 'Stochastic gradient descent enables faster learning and better solutions for complex neural networks.', 'duration': 56.564, 'max_score': 647.009, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew647009.jpg'}], 'start': 292.119, 'title': 'Learning word vectors and stochastic gradient descent', 'summary': 'Covers learning word vectors using gradient descent, initializing word vectors, defining loss function, and using stochastic gradient descent, enabling faster learning in neural networks with sparse gradient updates, resulting in several orders of magnitude faster learning.', 'chapters': [{'end': 514.827, 'start': 292.119, 'title': 'Learning word vectors with gradient descent', 'summary': 'Explains the process of learning good word vectors using gradient descent, initializing word vectors, defining the loss function, and using a gradient descent algorithm to gradually move towards the minimum by making small steps in the direction of the negative gradient.', 'duration': 222.708, 'highlights': ['The process of learning good word vectors involves initializing them to small numbers near 0 in each dimension and using a gradient descent algorithm to maximize j of theta by changing theta.', 'Using a gradient descent algorithm involves calculating the gradient j of theta from the current values of theta and making a small step in the direction of the negative gradient to gradually move down towards the minimum.', 'The step size in the gradient descent algorithm is an important parameter that affects the time taken to minimize the function and the possibility of divergence.']}, {'end': 646.989, 'start': 515.267, 'title': 'Stochastic gradient descent in neural networks', 'summary': 'Explains the inefficiency of using gradient descent in neural networks due to the large corpus size, and instead advocates for stochastic gradient descent, which allows for faster learning by making updates to the parameters based on a small batch of center words, resulting in several orders of magnitude faster learning.', 'duration': 131.722, 'highlights': ['Stochastic gradient descent is used in neural networks to make updates to the parameters based on a small batch of center words, leading to several orders of magnitude faster learning.', 'The inefficiency of using gradient descent in neural networks is due to the need to iterate over the entire corpus, resulting in extremely slow optimization.', 'The estimate of the gradient in stochastic gradient descent is noisy and bad due to only looking at a small fraction of the corpus rather than the whole corpus.']}, {'end': 915.961, 'start': 647.009, 'title': 'Neural nets and stochastic gradient descent', 'summary': 'Discusses the counterintuitive properties of neural nets and the efficiency of stochastic gradient descent, which enables faster learning and better solutions, as well as the sparse gradient update in stochastic gradients with word vectors.', 'duration': 268.952, 'highlights': ['Stochastic gradient descent learns better solutions than plain gradient descent in complex networks, enabling quicker computation and improved performance.', 'Stochastic gradient updates with word vectors result in very sparse gradient update information for most of the vocabulary, requiring efficient parameter updates for a few words.', 'Word vectors are represented as row vectors in common deep learning packages, such as PyTorch, for efficient memory access.']}], 'duration': 623.842, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew292119.jpg', 'highlights': ['Stochastic gradient descent enables several orders of magnitude faster learning in neural networks', 'Initializing word vectors involves setting them to small numbers near 0 in each dimension', 'Using stochastic gradient descent results in very sparse gradient update information for most of the vocabulary', 'The step size in the gradient descent algorithm significantly affects the time taken to minimize the function', 'Stochastic gradient descent learns better solutions than plain gradient descent in complex networks']}, {'end': 1546.369, 'segs': [{'end': 998.76, 'src': 'embed', 'start': 969.725, 'weight': 0, 'content': [{'end': 973.186, 'text': 'The skip gram one is more natural in various ways.', 'start': 969.725, 'duration': 3.461}, {'end': 978.409, 'text': "So it's sort of normally the one that people have gravitated to in subsequent work.", 'start': 973.246, 'duration': 5.163}, {'end': 988.234, 'text': "But then as to how you train this model, what I've presented so far is the naive softmax equation,", 'start': 979.53, 'duration': 8.704}, {'end': 993.297, 'text': 'which is a simple but relatively expensive training method.', 'start': 988.234, 'duration': 5.063}, {'end': 998.76, 'text': "And so that isn't really what they suggest using in your paper, in the paper.", 'start': 993.817, 'duration': 4.943}], 'summary': 'Skip gram model is preferred due to its naturalness, but the presented naive softmax equation for training is not recommended in the paper.', 'duration': 29.035, 'max_score': 969.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew969725.jpg'}, {'end': 1079.426, 'src': 'embed', 'start': 1042.885, 'weight': 1, 'content': [{'end': 1052.05, 'text': 'So if you have 100, 000 word, vocabulary, you have to do 100, 000 dot products to work out the denominator.', 'start': 1042.885, 'duration': 9.165}, {'end': 1054.172, 'text': 'And that seems a little bit of a shame.', 'start': 1052.19, 'duration': 1.982}, {'end': 1062.616, 'text': 'And so, instead of that, the idea of negative sampling is where, instead of using this softmax,', 'start': 1054.792, 'duration': 7.824}, {'end': 1079.426, 'text': "we're going to train binary logistic regression models for both the troop, the true pair of center word and the context word versus noise pairs,", 'start': 1062.616, 'duration': 16.81}], 'summary': 'Proposes using negative sampling to train binary logistic regression models for word pairs.', 'duration': 36.541, 'max_score': 1042.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1042885.jpg'}, {'end': 1131.922, 'src': 'heatmap', 'start': 1079.426, 'weight': 0.71, 'content': [{'end': 1085.069, 'text': 'where we keep the true center word and we just randomly sample words from the vocabulary.', 'start': 1079.426, 'duration': 5.643}, {'end': 1090.667, 'text': 'So as presented in the paper, the idea is like this.', 'start': 1086.724, 'duration': 3.943}, {'end': 1100.773, 'text': 'So overall, what we want to optimize is still an average of the loss for each particular center word.', 'start': 1090.747, 'duration': 10.026}, {'end': 1108.298, 'text': "But for when we're working out the loss for each particular center word, we're going to work out, sorry,", 'start': 1101.174, 'duration': 7.124}, {'end': 1111.259, 'text': 'the loss for each particular center word and each particular window.', 'start': 1108.298, 'duration': 2.961}, {'end': 1119.44, 'text': "we're going to take the dot product as before of the center word and the outside word.", 'start': 1111.259, 'duration': 8.181}, {'end': 1121.741, 'text': "And that's the main quantity.", 'start': 1119.86, 'duration': 1.881}, {'end': 1130.562, 'text': "But now, instead of using that inside the softmax, we're going to put it through the logistic function, which is sometimes also, or often also,", 'start': 1122.101, 'duration': 8.461}, {'end': 1131.922, 'text': 'called the sigmoid function.', 'start': 1130.562, 'duration': 1.36}], 'summary': 'Proposed method involves sampling words, optimizing loss with dot product, and using logistic function instead of softmax.', 'duration': 52.496, 'max_score': 1079.426, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1079426.jpg'}, {'end': 1325.863, 'src': 'embed', 'start': 1299.617, 'weight': 2, 'content': [{'end': 1303.682, 'text': 'What they do is they start with what we call the unigram distribution of words.', 'start': 1299.617, 'duration': 4.065}, {'end': 1309.208, 'text': 'So that is how often words actually occur in our big corpus.', 'start': 1304.082, 'duration': 5.126}, {'end': 1317.897, 'text': "So if you have a billion word corpus and a particular word occurred 90 times in it, you're taking 90 divided by a billion.", 'start': 1309.588, 'duration': 8.309}, {'end': 1320.759, 'text': "And so that's the unigram probability of the word.", 'start': 1318.197, 'duration': 2.562}, {'end': 1325.863, 'text': 'But what they then do is that they take that to the 3 quarters power.', 'start': 1321.22, 'duration': 4.643}], 'summary': 'Analyzing word occurrence in a billion word corpus, applying unigram distribution to the 3 quarters power.', 'duration': 26.246, 'max_score': 1299.617, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1299617.jpg'}], 'start': 916.221, 'title': 'Word2vec and negative sampling', 'summary': 'Covers word2vec model, skip gram and continuous bag of words variants, and emphasizes the use of negative sampling for training, which is more efficient than naive softmax equation. it also discusses the inefficiency of naive softmax, introduces negative sampling, and explains the optimization process for training the word2vec model.', 'chapters': [{'end': 1021.292, 'start': 916.221, 'title': 'Word2vec model and training methods', 'summary': 'Covers the introduction of word2vec model, including skip gram and continuous bag of words variants, and emphasizes the use of negative sampling for training, which is more efficient than the naive softmax equation.', 'duration': 105.071, 'highlights': ['The word2vec model includes two basic variants: skip gram and continuous bag of words model, both giving similar results.', 'The skip gram model, which is position independent for outside words, is more commonly used due to its natural approach.', 'The suggested training method for the word2vec model is negative sampling, which is more efficient than the naive softmax equation.']}, {'end': 1546.369, 'start': 1021.673, 'title': 'Word2vec and negative sampling', 'summary': 'Discusses the inefficiency of naive softmax due to the expensive computation of the denominator and introduces the concept of negative sampling, which involves training binary logistic regression models for true and noise pairs, optimizing the loss function using the negated dot product through the sigmoid, and sampling words based on the unigram distribution to train the word2vec model.', 'duration': 524.696, 'highlights': ['Negative sampling is introduced to address the inefficiency of the naive softmax, which requires expensive computation of the denominator, involving iterating over every word in the vocabulary, resulting in a large number of dot products to be calculated.', 'The concept of negative sampling involves training binary logistic regression models for both the true pair of center word and the context word versus noise pairs, aiming to optimize the loss function using the negated dot product through the sigmoid function, which maps any real number to a probability between 0 and 1, and sampling words based on the unigram distribution, which is then raised to the 3/4th power to dampen the difference between common and rare words.', 'The unigram distribution of words is utilized to sample words for training the Word2Vec model, where the probability of sampling a word depends on its occurrence in the corpus, and the 3/4th power transformation dampens the difference between common and rare words, resulting in a more balanced sampling approach.']}], 'duration': 630.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew916221.jpg', 'highlights': ['The skip gram model is more commonly used due to its natural approach', 'Negative sampling is more efficient than the naive softmax equation', 'The unigram distribution of words is utilized to sample words for training the Word2Vec model']}, {'end': 2363.463, 'segs': [{'end': 1612.281, 'src': 'embed', 'start': 1588.014, 'weight': 1, 'content': [{'end': 1594.095, 'text': "Because in the denominator, you're also working out the dot product with every other word in the vocabulary.", 'start': 1588.014, 'duration': 6.081}, {'end': 1600.577, 'text': 'So, as well as wanting the dot product with the actual word that you see in the context to be big,', 'start': 1594.395, 'duration': 6.182}, {'end': 1612.281, 'text': "you maximize your likelihood by making the dot products of other words that weren't in the context smaller, because that's shrinking your denominator.", 'start': 1600.577, 'duration': 11.704}], 'summary': 'Maximize dot product with actual word and shrink dot products of other words to maximize likelihood.', 'duration': 24.267, 'max_score': 1588.014, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1588014.jpg'}, {'end': 1736.767, 'src': 'embed', 'start': 1709.31, 'weight': 0, 'content': [{'end': 1719.837, 'text': "But if you want a slightly better, more stable sense of OK, we'd like to in general have other words, have low probability.", 'start': 1709.31, 'duration': 10.527}, {'end': 1728.303, 'text': "it seems like you might be able to get better, more stable results if you instead say let's have 10 or 15 sample negative words.", 'start': 1719.837, 'duration': 8.466}, {'end': 1730.645, 'text': "And indeed, that's been found to be true.", 'start': 1728.404, 'duration': 2.241}, {'end': 1736.767, 'text': "And for the negative words, well, it's easy to sample any number of random words you want.", 'start': 1732.066, 'duration': 4.701}], 'summary': 'Using 10 or 15 sample negative words can lead to a more stable sense of ok.', 'duration': 27.457, 'max_score': 1709.31, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1709310.jpg'}, {'end': 1820.06, 'src': 'embed', 'start': 1786.527, 'weight': 2, 'content': [{'end': 1788.909, 'text': 'And just one more note on that.', 'start': 1786.527, 'duration': 2.382}, {'end': 1794.013, 'text': 'I mean, there are actually two ways that people have commonly made these co-occurrence matrices.', 'start': 1788.929, 'duration': 5.084}, {'end': 1801.699, 'text': "One corresponds to what we've seen already, that you use a window around a word, which is similar to word2vec.", 'start': 1794.673, 'duration': 7.026}, {'end': 1810.216, 'text': "And that allows you to capture some locality and some of the sort of syntactic and semantic proximity that's more fine grained.", 'start': 1802.7, 'duration': 7.516}, {'end': 1820.06, 'text': 'The other way these matrix is often made is that normally documents have some structure,', 'start': 1810.677, 'duration': 9.383}], 'summary': 'Two common methods for making co-occurrence matrices: using a window around a word like word2vec, and capturing locality and syntactic/semantic proximity; another method involves structuring documents.', 'duration': 33.533, 'max_score': 1786.527, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1786527.jpg'}, {'end': 1949.601, 'src': 'embed', 'start': 1868.257, 'weight': 3, 'content': [{'end': 1878.582, 'text': 'when then we have a half a million dimensional vector for each word, which is much, much bigger than the word vectors that we typically use?', 'start': 1868.257, 'duration': 10.325}, {'end': 1890.433, 'text': 'And it also means that because we have these very high dimensional vectors that we have a lot of sparsity and a lot of randomness.', 'start': 1880.023, 'duration': 10.41}, {'end': 1897.839, 'text': 'So the results that you get tend to be noisier and less robust depending on what particular stuff was in the corpus.', 'start': 1890.553, 'duration': 7.286}, {'end': 1906.343, 'text': 'And so in general, people have found that you can get much better results by working with low dimensional vectors.', 'start': 1898.78, 'duration': 7.563}, {'end': 1916.387, 'text': 'So, then, the idea is, we can store the most of the important information about the distribution of words in the context of other words,', 'start': 1906.463, 'duration': 9.924}, {'end': 1920.369, 'text': 'in a fixed small number of dimensions, giving a dense vector.', 'start': 1916.387, 'duration': 3.982}, {'end': 1927.037, 'text': 'And in practice, the dimensionality of the vectors that are used are normally somewhere between 25 and 1, 000.', 'start': 1921.109, 'duration': 5.928}, {'end': 1937.172, 'text': 'And so at that point, we need to use some way to reduce the dimensionality of our count co-occurrence vectors.', 'start': 1927.037, 'duration': 10.135}, {'end': 1949.601, 'text': 'So if you have a good memory from a linear algebra class, you hopefully saw singular value decomposition.', 'start': 1939.813, 'duration': 9.788}], 'summary': 'High-dimensional word vectors lead to noisier results. using low-dimensional vectors (25-1000) gives better results.', 'duration': 81.344, 'max_score': 1868.257, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1868257.jpg'}, {'end': 2336.495, 'src': 'embed', 'start': 2306.318, 'weight': 6, 'content': [{'end': 2315.844, 'text': 'So on the one hand, the linear algebra methods actually seemed like they had advantages for fast training and efficient usage of statistics.', 'start': 2306.318, 'duration': 9.526}, {'end': 2325.773, 'text': "Although there had been work on capturing word similarities with them, by and large the results weren't as good,", 'start': 2317.946, 'duration': 7.827}, {'end': 2329.577, 'text': 'perhaps because of disproportionate importance given to large counts in the main.', 'start': 2325.773, 'duration': 3.804}, {'end': 2336.495, 'text': 'Conversely the models, the neural models.', 'start': 2330.058, 'duration': 6.437}], 'summary': "Linear algebra methods show advantages for fast training and efficient usage of statistics, but word similarity results weren't as good due to disproportionate importance given to large counts.", 'duration': 30.177, 'max_score': 2306.318, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2306318.jpg'}], 'start': 1546.85, 'title': 'Word embedding and challenges in count word vectors', 'summary': 'Delves into the concept of negative sampling in word embedding, emphasizing the importance of maximizing dot product with context words and minimizing it with non-context words, and also discusses the challenges of high-dimensional count word vectors, the application of singular value decomposition to reduce dimensionality, and the comparison of linear algebra-based methods with neural updating algorithms.', 'chapters': [{'end': 1838.736, 'start': 1546.85, 'title': 'Word embedding with negative sampling', 'summary': 'Discusses the concept of negative sampling in word embedding, emphasizing the need to maximize the dot product with words in the context and minimize it with words not in the context, and the advantages of sampling multiple negative words for more stable results.', 'duration': 291.886, 'highlights': ['The need to maximize the dot product with words in the context and minimize it with words not in the context is crucial for maximizing the loss in word embedding.', 'Sampling multiple negative words, such as 10 or 15, has been found to provide better and more stable results in word embedding.', 'There are two common ways to create co-occurrence matrices: using a window around a word or using larger structures like paragraphs or web pages.']}, {'end': 2363.463, 'start': 1840.888, 'title': 'Challenges in count word vectors', 'summary': 'Discusses the challenges of using high-dimensional count word vectors, the need for low-dimensional vectors, and the application of singular value decomposition to reduce dimensionality, with a focus on improving word vector representations and the comparison of linear algebra-based methods with neural updating algorithms.', 'duration': 522.575, 'highlights': ['The need for low-dimensional vectors', 'Application of singular value decomposition (SVD) to reduce dimensionality', 'Challenges of using high-dimensional count word vectors', 'Comparison of linear algebra-based methods with neural updating algorithms']}], 'duration': 816.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew1546850.jpg', 'highlights': ['Sampling 10 or 15 negative words provides better results in word embedding.', 'Maximizing dot product with context words and minimizing with non-context words is crucial.', 'Two common ways to create co-occurrence matrices: using a window or larger structures.', 'Application of singular value decomposition (SVD) to reduce dimensionality is necessary.', 'The need for low-dimensional vectors in word embedding.', 'Challenges of using high-dimensional count word vectors in word embedding.', 'Comparison of linear algebra-based methods with neural updating algorithms.']}, {'end': 2674.937, 'segs': [{'end': 2428.25, 'src': 'embed', 'start': 2398.495, 'weight': 1, 'content': [{'end': 2404.359, 'text': 'the property that you want is for meaning components.', 'start': 2398.495, 'duration': 5.864}, {'end': 2417.317, 'text': 'So a meaning component is something like going from male to female, queen to king, or going from verb to its agent,', 'start': 2404.38, 'duration': 12.937}, {'end': 2425.467, 'text': 'truck to driver that those meaning components should be represented as ratios of co-occurrence probabilities.', 'start': 2417.317, 'duration': 8.15}, {'end': 2428.25, 'text': "So here's an example that shows that.", 'start': 2426.088, 'duration': 2.162}], 'summary': 'Property sought: meaning components represented as ratios of co-occurrence probabilities.', 'duration': 29.755, 'max_score': 2398.495, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2398495.jpg'}, {'end': 2668.614, 'src': 'heatmap', 'start': 2480.888, 'weight': 2, 'content': [{'end': 2486.071, 'text': 'So, to get out the meaning component we want of going from gas to solid,', 'start': 2480.888, 'duration': 5.183}, {'end': 2491.534, 'text': "what's actually really useful is to look at the ratio of these co-occurrence probabilities.", 'start': 2486.071, 'duration': 5.463}, {'end': 2501.52, 'text': 'Because then we get a spectrum from large to small, between solid and gas, whereas for water, in a random word,', 'start': 2492.615, 'duration': 8.905}, {'end': 2505.922, 'text': 'it basically cancels out and gives you 1..', 'start': 2501.52, 'duration': 4.402}, {'end': 2512.885, 'text': 'I just wrote these numbers in, but if you count them up in a large corpus, it is basically what you get.', 'start': 2505.922, 'duration': 6.963}, {'end': 2523.49, 'text': 'So here are actual co-occurrence probabilities, and that for water and my random word which was fashion here, these are approximately 1,', 'start': 2512.986, 'duration': 10.504}, {'end': 2532.836, 'text': 'whereas for the ratio of probability of co-occurrence of solid with ice or steam is about 10..', 'start': 2523.49, 'duration': 9.346}, {'end': 2535.718, 'text': "And for gas, it's about a tenth.", 'start': 2532.836, 'duration': 2.882}, {'end': 2546.285, 'text': 'So how can we capture these ratios of co-occurrence probabilities as linear meaning components,', 'start': 2536.518, 'duration': 9.767}, {'end': 2551.829, 'text': 'so that in our word vector space we can just add and subtract linear meaning components??', 'start': 2546.285, 'duration': 5.544}, {'end': 2568.174, 'text': 'Well, It seems like the way we can achieve that is if we build a log bilinear model so that the dot product between two word vectors attempts to approximate the log of the probability of co-occurrence.', 'start': 2552.549, 'duration': 15.625}, {'end': 2577.279, 'text': 'So if you do that, you then get this property that the difference between two vectors.', 'start': 2568.774, 'duration': 8.505}, {'end': 2586.212, 'text': 'Its similarity to another word corresponds to the log of the probability ratio shown on the previous slide.', 'start': 2578.541, 'duration': 7.671}, {'end': 2608.253, 'text': 'So the GloVe model wanted to try and unify the thinking between the co-occurrence matrix models and the neural models by being in some way similar to a neural model but actually calculated on top of a co-occurrence matrix count.', 'start': 2587.033, 'duration': 21.22}, {'end': 2612.134, 'text': 'So we had an explicit loss function.', 'start': 2609.353, 'duration': 2.781}, {'end': 2621.138, 'text': 'And our explicit loss function is that we wanted the dot product to be similar to the log of the co-occurrence.', 'start': 2612.154, 'duration': 8.984}, {'end': 2625.86, 'text': "We actually added in some bias terms here, but I'll ignore those for the moment.", 'start': 2621.978, 'duration': 3.882}, {'end': 2629.821, 'text': 'And we wanted to not have very common words dominate.', 'start': 2626.3, 'duration': 3.521}, {'end': 2637.284, 'text': "And so we capped the effect of high word counts using this f function that's shown here.", 'start': 2630.202, 'duration': 7.082}, {'end': 2645.809, 'text': 'optimize this g function directly on the co-occurrence count matrix.', 'start': 2639.005, 'duration': 6.804}, {'end': 2649.191, 'text': 'So that gave us fast training scalable to huge corpora.', 'start': 2645.849, 'duration': 3.342}, {'end': 2654.407, 'text': 'And so this algorithm worked very well.', 'start': 2651.265, 'duration': 3.142}, {'end': 2664.032, 'text': 'So if you run this algorithm and ask what are the nearest words to frog, you get frogs, toad, and then you get some complicated words.', 'start': 2655.187, 'duration': 8.845}, {'end': 2668.614, 'text': 'But it turns out they are all frogs until you get down to lizards.', 'start': 2664.132, 'duration': 4.482}], 'summary': 'Glove model uses co-occurrence probabilities to unify co-occurrence matrix and neural models, achieving fast and scalable training.', 'duration': 54.83, 'max_score': 2480.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2480888.jpg'}, {'end': 2621.138, 'src': 'embed', 'start': 2587.033, 'weight': 0, 'content': [{'end': 2608.253, 'text': 'So the GloVe model wanted to try and unify the thinking between the co-occurrence matrix models and the neural models by being in some way similar to a neural model but actually calculated on top of a co-occurrence matrix count.', 'start': 2587.033, 'duration': 21.22}, {'end': 2612.134, 'text': 'So we had an explicit loss function.', 'start': 2609.353, 'duration': 2.781}, {'end': 2621.138, 'text': 'And our explicit loss function is that we wanted the dot product to be similar to the log of the co-occurrence.', 'start': 2612.154, 'duration': 8.984}], 'summary': 'Glove model unifies co-occurrence matrix and neural models by explicit loss function.', 'duration': 34.105, 'max_score': 2587.033, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2587033.jpg'}, {'end': 2683.782, 'src': 'embed', 'start': 2655.187, 'weight': 5, 'content': [{'end': 2664.032, 'text': 'So if you run this algorithm and ask what are the nearest words to frog, you get frogs, toad, and then you get some complicated words.', 'start': 2655.187, 'duration': 8.845}, {'end': 2668.614, 'text': 'But it turns out they are all frogs until you get down to lizards.', 'start': 2664.132, 'duration': 4.482}, {'end': 2670.855, 'text': 'So Latouria is that lovely tree frog there.', 'start': 2668.674, 'duration': 2.181}, {'end': 2674.937, 'text': 'And so this actually seemed to work out pretty well.', 'start': 2672.256, 'duration': 2.681}, {'end': 2683.782, 'text': 'How well did it work out? To discuss that a bit more, I now want to say something about how do we evaluate word vectors.', 'start': 2675.998, 'duration': 7.784}], 'summary': "Algorithm finds nearest words to 'frog' like 'frogs' and 'toad', but also some complicated words. overall, the algorithm worked out pretty well.", 'duration': 28.595, 'max_score': 2655.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2655187.jpg'}], 'start': 2363.463, 'title': 'Word embeddings and glove model', 'summary': "Explores the use of analogies in word embeddings, emphasizing the representation of meaning components and their ratios of co-occurrence probabilities. it also discusses the glove model's use of log bilinear model to capture co-occurrence probabilities and its success in providing fast training scalable to huge corpora.", 'chapters': [{'end': 2535.718, 'start': 2363.463, 'title': 'Analogies in word embeddings', 'summary': 'Explores the properties required for vector subtractions and additions to work for analogies, emphasizing the representation of meaning components as ratios of co-occurrence probabilities and providing examples of how these ratios help capture the spectrum from gas to solid in physics.', 'duration': 172.255, 'highlights': ['The representation of meaning components as ratios of co-occurrence probabilities is crucial for capturing analogies', 'The use of ratios of co-occurrence probabilities helps in capturing the spectrum from large to small between solid and gas', 'Examples of actual co-occurrence probabilities are provided to support the concept']}, {'end': 2674.937, 'start': 2536.518, 'title': 'Glove model and linear meaning components', 'summary': "Discusses the glove model's use of log bilinear model to capture co-occurrence probabilities, unifying co-occurrence matrix models and neural models, and its success in providing fast training scalable to huge corpora.", 'duration': 138.419, 'highlights': ['The GloVe model unifies co-occurrence matrix models and neural models by using a log bilinear model to capture co-occurrence probabilities, providing a fast and scalable training algorithm.', "The algorithm provides accurate nearest word predictions, such as 'frog' returning 'frogs', 'toad', and 'lizards', showcasing its effectiveness in capturing semantic relationships.", "The dot product between two word vectors in the GloVe model attempts to approximate the log of the probability of co-occurrence, ensuring that the difference between two vectors' similarity to another word corresponds to the log of the probability ratio."]}], 'duration': 311.474, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2363463.jpg', 'highlights': ['The GloVe model unifies co-occurrence matrix models and neural models by using a log bilinear model to capture co-occurrence probabilities, providing a fast and scalable training algorithm.', 'The representation of meaning components as ratios of co-occurrence probabilities is crucial for capturing analogies', 'The use of ratios of co-occurrence probabilities helps in capturing the spectrum from large to small between solid and gas', "The dot product between two word vectors in the GloVe model attempts to approximate the log of the probability of co-occurrence, ensuring that the difference between two vectors' similarity to another word corresponds to the log of the probability ratio.", 'Examples of actual co-occurrence probabilities are provided to support the concept', "The algorithm provides accurate nearest word predictions, such as 'frog' returning 'frogs', 'toad', and 'lizards', showcasing its effectiveness in capturing semantic relationships."]}, {'end': 3148.598, 'segs': [{'end': 2735, 'src': 'embed', 'start': 2702.77, 'weight': 0, 'content': [{'end': 2710.972, 'text': "you're just looking at one center word at a time and generating a few negative samples.", 'start': 2702.77, 'duration': 8.202}, {'end': 2716.954, 'text': 'And so it sort of seems like doing something precise there.', 'start': 2711.452, 'duration': 5.502}, {'end': 2727.517, 'text': "Whereas if you're doing optimization algorithm on the whole matrix at once, well, you actually know everything about the matrix at once.", 'start': 2717.274, 'duration': 10.243}, {'end': 2735, 'text': "You're not just looking at what other words occurred in this one context of the center word.", 'start': 2727.537, 'duration': 7.463}], 'summary': 'The transcript discusses word centering and optimization algorithms.', 'duration': 32.23, 'max_score': 2702.77, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2702770.jpg'}, {'end': 2809.13, 'src': 'embed', 'start': 2787.264, 'weight': 1, 'content': [{'end': 2797.628, 'text': 'So how can we really evaluate word vectors? So in general for NLP evaluation, people talk about two ways of evaluation, intrinsic and extrinsic.', 'start': 2787.264, 'duration': 10.364}, {'end': 2809.13, 'text': "So an intrinsic evaluation means that you evaluate directly on the specific or intermediate subtasks that you've been working on.", 'start': 2797.968, 'duration': 11.162}], 'summary': 'Evaluate word vectors using intrinsic and extrinsic methods.', 'duration': 21.866, 'max_score': 2787.264, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2787264.jpg'}, {'end': 2944.704, 'src': 'embed', 'start': 2914.068, 'weight': 2, 'content': [{'end': 2924.271, 'text': 'So for intrinsic evaluation of word vectors, one way which we mentioned last time was this word vector analogy.', 'start': 2914.068, 'duration': 10.203}, {'end': 2929.973, 'text': 'So we could simply give our models a big collection of word vector analogy problems.', 'start': 2924.371, 'duration': 5.602}, {'end': 2933.354, 'text': 'So we could say, man is the woman as king is the what.', 'start': 2930.033, 'duration': 3.321}, {'end': 2944.704, 'text': 'and ask the model to find the word that is closest using that sort of word analogy computation and hope that what comes out there is queen.', 'start': 2933.934, 'duration': 10.77}], 'summary': 'Intrinsic evaluation of word vectors using word vector analogy to find word relationships.', 'duration': 30.636, 'max_score': 2914.068, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2914068.jpg'}, {'end': 3119.625, 'src': 'embed', 'start': 3089.334, 'weight': 3, 'content': [{'end': 3090.655, 'text': 'So here are vectors.', 'start': 3089.334, 'duration': 1.321}, {'end': 3095.056, 'text': 'for positive, comparative, and superlative forms of adjectives.', 'start': 3091.115, 'duration': 3.941}, {'end': 3099.978, 'text': 'And you can see those also move in roughly linear components.', 'start': 3095.397, 'duration': 4.581}, {'end': 3109.481, 'text': 'So the word2vec people built a data set of analogies so you could evaluate different models on the accuracy of their analogies.', 'start': 3100.778, 'duration': 8.703}, {'end': 3113.843, 'text': "And so here's how you can do this.", 'start': 3110.742, 'duration': 3.101}, {'end': 3115.183, 'text': 'And this gives some numbers.', 'start': 3113.943, 'duration': 1.24}, {'end': 3117.944, 'text': 'So there are semantic and syntactic analogies.', 'start': 3115.563, 'duration': 2.381}, {'end': 3119.625, 'text': "I'll just look at the totals.", 'start': 3118.304, 'duration': 1.321}], 'summary': 'Word2vec evaluates models on accuracy of analogies with semantic and syntactic categories.', 'duration': 30.291, 'max_score': 3089.334, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3089334.jpg'}], 'start': 2675.998, 'title': 'Evaluating word vectors', 'summary': 'Evaluates word vectors, highlighting inefficiencies of skip-gram and discussing intrinsic and extrinsic evaluations, word vector analogy computation, and the use of semantic and syntactic analogies.', 'chapters': [{'end': 2765.872, 'start': 2675.998, 'title': 'Word vector evaluation', 'summary': 'Discusses the evaluation of word vectors, highlighting the inefficiency of skip-gram due to its use of statistics and the potential benefits of optimizing the whole matrix at once for more efficient and less noisy loss minimization.', 'duration': 89.874, 'highlights': ['Optimizing the whole matrix at once allows for more efficient and less noisy work to minimize loss, contrasting the inefficient use of statistics in skip-gram.', 'In skip-gram, only one center word at a time is considered, leading to an imprecise approach, while optimizing the whole matrix provides a comprehensive understanding of all words at once.']}, {'end': 3148.598, 'start': 2765.912, 'title': 'Evaluating word vectors in nlp', 'summary': 'Discusses the evaluation of word vectors in natural language processing, focusing on intrinsic and extrinsic evaluations, word vector analogy computation, and the use of semantic and syntactic analogies to assess the accuracy of word vectors.', 'duration': 382.686, 'highlights': ['Intrinsic and extrinsic evaluations are discussed for evaluating word vectors, with intrinsic evaluations being fast to compute and helping to understand the component being worked on, while extrinsic evaluations focus on improving performance on real tasks of interest to human beings.', 'Word vector analogy computation is used as an intrinsic evaluation method, where models are given word vector analogy problems to solve, and the accuracy score of the model in solving these problems is measured.', 'Semantic and syntactic analogies are used to evaluate word vectors, with the example of semantic analogies such as company CEO and syntactic analogies like positive, comparative, and superlative forms of adjectives.', 'The accuracy of different models on semantic and syntactic analogies is evaluated, with unscaled co-occurrence counts performing poorly but showing improvement when scaled.']}], 'duration': 472.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew2675998.jpg', 'highlights': ['Optimizing the whole matrix at once minimizes loss and noise, contrasting inefficient skip-gram statistics.', 'Intrinsic evaluations are fast and aid in understanding components, while extrinsic evaluations focus on real tasks.', 'Word vector analogy computation measures model accuracy in solving analogy problems.', 'Semantic and syntactic analogies are used to evaluate word vectors, showing unscaled co-occurrence counts performing poorly.']}, {'end': 3413.683, 'segs': [{'end': 3178.675, 'src': 'embed', 'start': 3148.898, 'weight': 2, 'content': [{'end': 3152.401, 'text': "And now we're getting up to 60.1, which actually isn't a bad score.", 'start': 3148.898, 'duration': 3.503}, {'end': 3156.425, 'text': 'So you can actually do a decent job without a neural network.', 'start': 3152.681, 'duration': 3.744}, {'end': 3162.99, 'text': 'And then here are the two variants of the word2vec model.', 'start': 3156.785, 'duration': 6.205}, {'end': 3165.991, 'text': 'And here are our results from the GloVE model.', 'start': 3163.33, 'duration': 2.661}, {'end': 3167.371, 'text': 'And, of course, at the time 2014,,', 'start': 3166.331, 'duration': 1.04}, {'end': 3178.675, 'text': 'we took this as absolute proof that our model was better and our more efficient use of statistics was really working in our favor.', 'start': 3167.371, 'duration': 11.304}], 'summary': 'Achieved 60.1 score without neural network, favoring efficient statistics use.', 'duration': 29.777, 'max_score': 3148.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3148898.jpg'}, {'end': 3224.651, 'src': 'embed', 'start': 3195.901, 'weight': 1, 'content': [{'end': 3208.366, 'text': 'So this looks at the semantic, syntactic, and overall performance on word analogies of glove models that were trained on different subsets of data.', 'start': 3195.901, 'duration': 12.465}, {'end': 3213.387, 'text': 'So in particular, the two on the left are trained on Wikipedia.', 'start': 3208.826, 'duration': 4.561}, {'end': 3221.93, 'text': 'And you can see that training on Wikipedia makes you do really well on semantic analogies, which maybe makes sense,', 'start': 3214.548, 'duration': 7.382}, {'end': 3224.651, 'text': 'because Wikipedia just tells you a lot of semantic facts.', 'start': 3221.93, 'duration': 2.721}], 'summary': 'Glove models trained on wikipedia excel in semantic analogies.', 'duration': 28.75, 'max_score': 3195.901, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3195901.jpg'}, {'end': 3277.67, 'src': 'embed', 'start': 3251.928, 'weight': 0, 'content': [{'end': 3261.536, 'text': "you can see that for the semantics it's just not as good as even and one quarter of the size amount of Wikipedia data.", 'start': 3251.928, 'duration': 9.608}, {'end': 3264.999, 'text': 'So if you get a lot of data, you can compensate for that.', 'start': 3261.556, 'duration': 3.443}, {'end': 3269.703, 'text': 'So here on the right end, did you then have common crawl web data?', 'start': 3265.079, 'duration': 4.624}, {'end': 3277.67, 'text': "And so once there's a lot of web data, so now 42 billion words you're then starting to get good scores again from the semantic side.", 'start': 3269.743, 'duration': 7.927}], 'summary': 'With 42 billion words of web data, semantic scores improved significantly.', 'duration': 25.742, 'max_score': 3251.928, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3251928.jpg'}, {'end': 3360.692, 'src': 'embed', 'start': 3306.525, 'weight': 3, 'content': [{'end': 3311.489, 'text': "today is the suite's too long load and working reasonably well.", 'start': 3306.525, 'duration': 4.964}, {'end': 3316.133, 'text': "But you still get significant gains for 200, and it's somewhat to 300.", 'start': 3311.749, 'duration': 4.384}, {'end': 3324.72, 'text': 'So at least back around 2013 to 15, everyone sort of gravitated to the fact that 300 dimensional vectors is the sweet spot.', 'start': 3316.133, 'duration': 8.587}, {'end': 3333.287, 'text': 'So almost frequently, if you look through the best known sets of word vectors that include the word2vec vectors and the glove vectors,', 'start': 3325.36, 'duration': 7.927}, {'end': 3337.251, 'text': 'that usually what you get is 300 dimensional word vectors.', 'start': 3333.687, 'duration': 3.564}, {'end': 3342.956, 'text': "That's not the only intrinsic evaluation you can do.", 'start': 3339.773, 'duration': 3.183}, {'end': 3351.985, 'text': 'Another intrinsic evaluation you can do is see how these models model human judgments of word similarity.', 'start': 3343.497, 'duration': 8.488}, {'end': 3360.692, 'text': 'So psychologists for several decades have actually taken human judgments of word similarity,', 'start': 3352.005, 'duration': 8.687}], 'summary': 'Word vectors of 300 dimensions are considered optimal, yielding significant gains for 200 and somewhat to 300, as observed from best-known word vector sets like word2vec and glove.', 'duration': 54.167, 'max_score': 3306.525, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3306525.jpg'}], 'start': 3148.898, 'title': 'Comparing and evaluating word embedding models', 'summary': 'Compares word2vec and glove models for nlp, highlighting the impact of training on different datasets, achieving up to 60.1 score without using a neural network, and emphasizing the significance of 300 dimensional word vectors and correlation with human judgments of word similarity.', 'chapters': [{'end': 3195.521, 'start': 3148.898, 'title': 'Comparing word embedding models', 'summary': 'Discusses the comparison of word2vec and glove models for nlp, with a realization that the improved performance was primarily due to better data rather than model architecture, scoring up to 60.1 without using a neural network.', 'duration': 46.623, 'highlights': ['The improved model performance, scoring up to 60.1, was attributed to better data rather than the model architecture.', 'The realization that the more efficient use of statistics was not the main contributor to the better model performance.', 'The comparison of two variants of the word2vec model and the results from the GloVE model.']}, {'end': 3413.683, 'start': 3195.901, 'title': 'Word embedding model evaluation', 'summary': 'Discusses the performance of glove models trained on different datasets, highlighting the impact of training on wikipedia, newswire data, and web data on semantic analogies and vector dimensions, emphasizing the significance of 300 dimensional word vectors and correlation with human judgments of word similarity.', 'duration': 217.782, 'highlights': ['GloVe models trained on Wikipedia perform well on semantic analogies, with 300 dimensional word vectors being the sweet spot.', 'Training exclusively on Google News (Newswire data) results in poorer semantic performance compared to using a smaller amount of Wikipedia data.', 'Using common crawl web data significantly improves semantic performance, with 42 billion words leading to good scores in semantic analogies.', 'The chapter emphasizes the significance of 300 dimensional word vectors and their prevalence in the best-known sets of word vectors.', "Correlation with human judgments of word similarity is measured to evaluate the models, providing insights into the models' ability to capture word similarity judgments."]}], 'duration': 264.785, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3148898.jpg', 'highlights': ['Using common crawl web data significantly improves semantic performance, with 42 billion words leading to good scores in semantic analogies.', 'GloVe models trained on Wikipedia perform well on semantic analogies, with 300 dimensional word vectors being the sweet spot.', 'The improved model performance, scoring up to 60.1, was attributed to better data rather than the model architecture.', 'The chapter emphasizes the significance of 300 dimensional word vectors and their prevalence in the best-known sets of word vectors.', "Correlation with human judgments of word similarity is measured to evaluate the models, providing insights into the models' ability to capture word similarity judgments."]}, {'end': 4108.596, 'segs': [{'end': 3444.558, 'src': 'embed', 'start': 3414.183, 'weight': 0, 'content': [{'end': 3416.544, 'text': 'And so then we can get data for that.', 'start': 3414.183, 'duration': 2.361}, {'end': 3420.826, 'text': 'And so there are various different data sets of word similarities.', 'start': 3416.964, 'duration': 3.862}, {'end': 3425.789, 'text': 'And we can score different models as to how well they do on similarities.', 'start': 3421.186, 'duration': 4.603}, {'end': 3436.114, 'text': 'You see here that plain SVDs works comparatively better here for similarities than it did for analogies.', 'start': 3427.609, 'duration': 8.505}, {'end': 3441.616, 'text': "It's not great, but it's now not completely terrible because we no longer need that linear property.", 'start': 3436.134, 'duration': 5.482}, {'end': 3444.558, 'text': 'But again, scaled SVDs work a lot better.', 'start': 3441.897, 'duration': 2.661}], 'summary': 'Various data sets of word similarities scored, svds work better than analogies, scaled svds work best.', 'duration': 30.375, 'max_score': 3414.183, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3414183.jpg'}, {'end': 3510.658, 'src': 'embed', 'start': 3478.026, 'weight': 1, 'content': [{'end': 3480.008, 'text': 'So one slide before that.', 'start': 3478.026, 'duration': 1.982}, {'end': 3493.706, 'text': 'So the property that we want is that we want the dot product to represent the log probability of co-occurrence.', 'start': 3480.369, 'duration': 13.337}, {'end': 3499.871, 'text': 'And that then gives me my tricky log bilinear.', 'start': 3494.587, 'duration': 5.284}, {'end': 3508.516, 'text': "So the bi is that there's sort of the wi and the wj, so that there are sort of two linear things.", 'start': 3500.151, 'duration': 8.365}, {'end': 3510.658, 'text': "And it's linear in each one of them.", 'start': 3508.917, 'duration': 1.741}], 'summary': 'Property: dot product represents log probability of co-occurrence for log bilinear', 'duration': 32.632, 'max_score': 3478.026, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3478026.jpg'}, {'end': 3596.829, 'src': 'embed', 'start': 3569.857, 'weight': 2, 'content': [{'end': 3577.741, 'text': "But the other bit that's in here is a lot of the time when you're building models.", 'start': 3569.857, 'duration': 7.884}, {'end': 3582.163, 'text': 'rather than simply having sort of an AX model.', 'start': 3577.741, 'duration': 4.422}, {'end': 3590.707, 'text': 'it seems useful to have a bias term which can move things up and down for the word in general.', 'start': 3582.163, 'duration': 8.544}, {'end': 3596.829, 'text': "And so we added into the model bias terms so that there's a bias term for both words.", 'start': 3591.087, 'duration': 5.742}], 'summary': 'Bias terms added to model for flexibility and accuracy.', 'duration': 26.972, 'max_score': 3569.857, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3569857.jpg'}, {'end': 3675.604, 'src': 'embed', 'start': 3638.25, 'weight': 4, 'content': [{'end': 3648.982, 'text': 'because You want to pay more attention to words that are more common or word pairs that are more common.', 'start': 3638.25, 'duration': 10.732}, {'end': 3660.827, 'text': "Because if you think about it in word-to-vec terms, you're seeing if things have a co-occurrence count of 50 versus 3,", 'start': 3649.002, 'duration': 11.825}, {'end': 3669.519, 'text': 'you want to do a better job at modeling the co-occurrence of the things that occurred together 50 times.', 'start': 3660.827, 'duration': 8.692}, {'end': 3675.604, 'text': 'And so you want to consider in the count of co-occurrence.', 'start': 3671.38, 'duration': 4.224}], 'summary': 'Improve modeling by focusing on word pairs with higher co-occurrence counts.', 'duration': 37.354, 'max_score': 3638.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3638250.jpg'}, {'end': 3818.481, 'src': 'embed', 'start': 3786.489, 'weight': 3, 'content': [{'end': 3796.177, 'text': 'But if you add into it word vectors, better representation of the meaning of words and so that you can have the numbers go up quite a bit.', 'start': 3786.489, 'duration': 9.688}, {'end': 3803.305, 'text': 'And then you can compare different models to see how much gain they give you in terms of this extrinsic task.', 'start': 3796.518, 'duration': 6.787}, {'end': 3812.735, 'text': 'So skipping ahead, this was a question that I was asked after class, which was word sensors.', 'start': 3804.707, 'duration': 8.028}, {'end': 3818.481, 'text': "Because so far, we've had just one word.", 'start': 3812.856, 'duration': 5.625}], 'summary': 'Using word vectors improves word meaning representation, enabling comparison of models for gain in extrinsic tasks.', 'duration': 31.992, 'max_score': 3786.489, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3786489.jpg'}, {'end': 4000.403, 'src': 'embed', 'start': 3976.621, 'weight': 5, 'content': [{'end': 3983.149, 'text': 'So maybe what we should do is have different word vectors for the different meanings of pike.', 'start': 3976.621, 'duration': 6.528}, {'end': 3984.43, 'text': "So we'd have one.", 'start': 3983.229, 'duration': 1.201}, {'end': 3994.283, 'text': 'word vector for the medieval pointy weapon, another word vector for the kind of fish, another word vector for the kind of road.', 'start': 3985.191, 'duration': 9.092}, {'end': 3996.867, 'text': "So they'd then be word sense vectors.", 'start': 3994.363, 'duration': 2.504}, {'end': 4000.403, 'text': 'And you can do that.', 'start': 3999.442, 'duration': 0.961}], 'summary': "Proposing different word vectors for different meanings of 'pike' to create word sense vectors.", 'duration': 23.782, 'max_score': 3976.621, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3976621.jpg'}], 'start': 3414.183, 'title': 'Word similarity models and model embedding', 'summary': "Evaluates word similarity models, highlighting svds' superior performance for similarities, and discusses model embedding, emphasizing the need to minimize the difference between word pairs and the use of word vectors in extrinsic tasks, with significant performance improvements.", 'chapters': [{'end': 3533.602, 'start': 3414.183, 'title': 'Word similarity models and objective functions', 'summary': "Discusses evaluating word similarity models, with svds showing better performance for similarities than analogies, and the glove model's objective function based on log bilinear representation of co-occurrence probabilities.", 'duration': 119.419, 'highlights': ['The plain SVDs work comparatively better for similarities than for analogies, with scaled SVDs and Word2vec showing even better performance.', "The objective function for the GloVe model aims to have the dot product represent the log probability of co-occurrence, using a log bilinear model that is linear in both 'wi' and 'wj'."]}, {'end': 4108.596, 'start': 3536.968, 'title': 'Word vectors and model embedding', 'summary': 'Discusses the concept of word vectors and model embedding, emphasizing the need to minimize the difference between word pairs, the inclusion of bias terms in models, and the use of word vectors in extrinsic tasks, such as named entity recognition, with significant improvements in performance.', 'duration': 571.628, 'highlights': ['The concept of minimizing the difference between word pairs by squaring the difference to make it positive and as small as possible is emphasized, with a focus on the inclusion of bias terms in models for better representation of word probabilities (90% of the concept).', 'The use of word vectors in extrinsic tasks, such as named entity recognition, is discussed, highlighting the significant improvement in performance when word vectors are incorporated, compared to models using only discrete features (significant performance gain).', 'The consideration of word frequency in word vectors is explained, with an emphasis on paying more attention to common word pairs and the limitations of paying excessive attention to extremely common words (important consideration for word vector evaluation).', 'The concept of word sense vectors is introduced, highlighting the use of different word vectors for different meanings of a word, and the successful application of clustering instances of a word to represent word senses (innovative approach to capturing multiple meanings of words).']}], 'duration': 694.413, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew3414183.jpg', 'highlights': ['Scaled SVDs and Word2vec show better performance for similarities than plain SVDs.', "GloVe model's objective function aims to have the dot product represent the log probability of co-occurrence.", 'Inclusion of bias terms in models improves representation of word probabilities.', 'Word vectors in extrinsic tasks, like named entity recognition, significantly improve performance.', 'Minimizing the difference between word pairs is emphasized for better word representation.', 'Word sense vectors use different word vectors for different meanings of a word.']}, {'end': 4512.297, 'segs': [{'end': 4155.399, 'src': 'embed', 'start': 4109.096, 'weight': 5, 'content': [{'end': 4110.698, 'text': "So that's then the computer sense.", 'start': 4109.096, 'duration': 1.602}, {'end': 4112.999, 'text': 'So basically, this does work.', 'start': 4111.098, 'duration': 1.901}, {'end': 4117.942, 'text': 'And we can learn word vectors for different senses of a word.', 'start': 4113.158, 'duration': 4.784}, {'end': 4123.524, 'text': "But actually, this isn't the majority way that things have then gone in practice.", 'start': 4118.402, 'duration': 5.122}, {'end': 4128.587, 'text': 'And there are a couple of reasons for that.', 'start': 4124.965, 'duration': 3.622}, {'end': 4131.028, 'text': 'I mean, one is just simplicity.', 'start': 4128.907, 'duration': 2.121}, {'end': 4135.868, 'text': "If you do this, It's kind of complex,", 'start': 4131.448, 'duration': 4.42}, {'end': 4141.572, 'text': 'because you first of all have to learn word sensors and then start learning word vectors in terms of the word sensors.', 'start': 4135.868, 'duration': 5.704}, {'end': 4151.756, 'text': "But the other reason is, although this model of having word sensors is traditional, it's what you see in dictionaries.", 'start': 4142.551, 'duration': 9.205}, {'end': 4155.399, 'text': "it's commonly what's being used in natural language processing.", 'start': 4151.756, 'duration': 3.643}], 'summary': 'Traditional word sense models are complex, not commonly used in nlp.', 'duration': 46.303, 'max_score': 4109.096, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew4109095.jpg'}, {'end': 4243.817, 'src': 'embed', 'start': 4201.704, 'weight': 1, 'content': [{'end': 4213.852, 'text': 'So it actually turns out that in practice you can do rather well by simply having one word vector per word type.', 'start': 4201.704, 'duration': 12.148}, {'end': 4216.994, 'text': 'And what happens if you do that??', 'start': 4214.452, 'duration': 2.542}, {'end': 4236.329, 'text': 'Well, what you find is that What you learn as a word vector is what gets referred to in fancy talk as a superposition of the word vectors,', 'start': 4217.615, 'duration': 18.714}, {'end': 4243.817, 'text': 'for the different senses of a word, where the word superposition means no more or less than a weighted sum.', 'start': 4236.329, 'duration': 7.488}], 'summary': 'One word vector per word type can yield a superposition of word vectors for different senses.', 'duration': 42.113, 'max_score': 4201.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew4201704.jpg'}, {'end': 4333.137, 'src': 'embed', 'start': 4299.296, 'weight': 2, 'content': [{'end': 4309.703, 'text': 'But actually, it turns out that if you use this average vector in applications, it tends to sort of self-disambiguate.', 'start': 4299.296, 'duration': 10.407}, {'end': 4320.269, 'text': 'Because if you say, is the word pike similar to the word for fish? Well, part of this vector represents fish.', 'start': 4310.103, 'duration': 10.166}, {'end': 4322.51, 'text': 'the fish sense of pike.', 'start': 4321.229, 'duration': 1.281}, {'end': 4327.293, 'text': "And so in those components, it'll be kind of similar to the fish vector.", 'start': 4322.89, 'duration': 4.403}, {'end': 4333.137, 'text': "And so yes, you'll say there's substantial similarity.", 'start': 4327.613, 'duration': 5.524}], 'summary': 'Using average vector in applications self-disambiguates, showing substantial similarity.', 'duration': 33.841, 'max_score': 4299.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew4299296.jpg'}, {'end': 4462.964, 'src': 'embed', 'start': 4436.141, 'weight': 0, 'content': [{'end': 4448.255, 'text': 'that things are so sparse in those high dimensional vector spaces that you can use ideas from sparse coding to actually separate out the different senses,', 'start': 4436.141, 'duration': 12.114}, {'end': 4450.558, 'text': "providing they're relatively common.", 'start': 4448.255, 'duration': 2.303}, {'end': 4455.62, 'text': 'So they show in their paper that you can start with the vector of, say,', 'start': 4451.478, 'duration': 4.142}, {'end': 4462.964, 'text': 'pike and actually separate out components of that vector that correspond to different senses of the word pike.', 'start': 4455.62, 'duration': 7.344}], 'summary': 'Sparse coding can separate senses in high-dimensional vector spaces.', 'duration': 26.823, 'max_score': 4436.141, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew4436141.jpg'}], 'start': 4109.096, 'title': 'Word vectors and word sense disambiguation', 'summary': "Discusses the challenges and concepts related to word vectors and word senses, including complexities, traditional use in natural language processing, word sense disambiguation, superposition of senses, and an example of separating different senses of the word 'pike'.", 'chapters': [{'end': 4155.399, 'start': 4109.096, 'title': 'Word vectors and word senses', 'summary': 'Discusses the concept of word vectors for different senses of a word and the challenges in implementing this method, highlighting complexity and traditional use in natural language processing.', 'duration': 46.303, 'highlights': ['The majority way of implementing word vectors for different senses of a word has not been practical due to complexity and traditional use in natural language processing.', 'Learning word vectors for different senses of a word is complex and involves first learning word sensors and then learning word vectors in terms of the word sensors.']}, {'end': 4512.297, 'start': 4155.399, 'title': 'Word sense disambiguation', 'summary': "Discusses the challenges of word sense disambiguation, the concept of word vectors as a superposition of senses, and the surprising result that sense vectors can be reconstructed from a word vector, with an example of separating different senses of the word 'pike'.", 'duration': 356.898, 'highlights': ['The vector for a word can be a superposition of the word vectors for different senses, with the weighting corresponding to the frequencies of use of the different senses.', 'Using the average vector for a word tends to self-disambiguate in applications, as it represents different senses and shows similarity to relevant words in context.', "In high dimensional vector spaces, sparse coding can be used to separate out components of a word vector that correspond to different senses of the word, as demonstrated with the example of separating five different senses of the word 'pike'.", 'Different senses of a word can be identified by examining which components of the word vector are similar to other words used in the same context, leading to self-disambiguation.', 'The surprising result that sense vectors can be reconstructed from a word vector using ideas from sparse coding, allowing the separation of components corresponding to different senses of the word.']}], 'duration': 403.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gqaHkPEZAew/pics/gqaHkPEZAew4109095.jpg', 'highlights': ["In high dimensional vector spaces, sparse coding can be used to separate out components of a word vector that correspond to different senses of the word, as demonstrated with the example of separating five different senses of the word 'pike'.", 'The vector for a word can be a superposition of the word vectors for different senses, with the weighting corresponding to the frequencies of use of the different senses.', 'Using the average vector for a word tends to self-disambiguate in applications, as it represents different senses and shows similarity to relevant words in context.', 'The surprising result that sense vectors can be reconstructed from a word vector using ideas from sparse coding, allowing the separation of components corresponding to different senses of the word.', 'Different senses of a word can be identified by examining which components of the word vector are similar to other words used in the same context, leading to self-disambiguation.', 'The majority way of implementing word vectors for different senses of a word has not been practical due to complexity and traditional use in natural language processing.', 'Learning word vectors for different senses of a word is complex and involves first learning word sensors and then learning word vectors in terms of the word sensors.']}], 'highlights': ['GloVe model unifies co-occurrence matrix models and neural models by using a log bilinear model to capture co-occurrence probabilities, providing a fast and scalable training algorithm.', 'Using common crawl web data significantly improves semantic performance, with 42 billion words leading to good scores in semantic analogies.', 'The chapter emphasizes the significance of 300 dimensional word vectors and their prevalence in the best-known sets of word vectors.', 'Scaled SVDs and Word2vec show better performance for similarities than plain SVDs.', 'The model uses word vectors as the only parameters, with outside and center word vectors for each word, utilizing a dot product to determine the likelihood of outside words occurring with the center word.']}