title
Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 2 – Word Vectors and Word Senses
description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3qeGYcW
Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)
To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs224n/index.html#schedule
Chapters:
00:00 Intro
00:29 Ipython Notebook
01:57 Analogy Problems
07:18 Principle components analysis scatter plot
09:56 Halt your Ipython notebooks
24:19 Stochastic Gradients with Word Vectors
26:07 Two Word Vectors
29:49 Negative Sampling
30:46 Sigmoid Functions
33:04 Unigram Distribution
detail
{'title': 'Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 2 – Word Vectors and Word Senses', 'heatmap': [{'end': 1841.95, 'start': 1788.828, 'weight': 0.705}, {'end': 3400.569, 'start': 3241.695, 'weight': 0.721}], 'summary': 'Covers word vectors, word embeddings, word2vec model, glove model, and word vector evaluations in nlp, including techniques, optimization, and multi-sense word models, offering insights into challenges, limitations, and practical implications.', 'chapters': [{'end': 900.465, 'segs': [{'end': 80.203, 'src': 'embed', 'start': 29.951, 'weight': 0, 'content': [{'end': 33.313, 'text': 'I stuck this IPython notebook up on the course page.', 'start': 29.951, 'duration': 3.362}, {'end': 37.335, 'text': 'So under Lecture 1, you can find a copy of it and you can download it.', 'start': 33.373, 'duration': 3.962}, {'end': 41.657, 'text': 'So I both stuck up just an HTML version of it and a zip file.', 'start': 37.635, 'duration': 4.022}, {'end': 45.419, 'text': "Like the HTML file is only good to look at, you can't do anything with it.", 'start': 41.937, 'duration': 3.482}, {'end': 51.602, 'text': 'So you wanna- if you wanna play with it via yourself, um, download the zip file and get the IPython notebook out of that.', 'start': 45.459, 'duration': 6.143}, {'end': 56.864, 'text': "Okay, so we were looking at these glove word vectors, which I'll talk about a bit more today.", 'start': 52.262, 'duration': 4.602}, {'end': 67.45, 'text': 'And so there were these sort of basic results of similarity in this vector space worked very nicely for discovering similar words.', 'start': 57.385, 'duration': 10.065}, {'end': 68.39, 'text': 'And then.', 'start': 68.09, 'duration': 0.3}, {'end': 70.512, 'text': 'going on from that.', 'start': 69.591, 'duration': 0.921}, {'end': 80.203, 'text': "there was this idea that we'll spend some more time on today, which was um, maybe this vector space is not only a similarity space where,", 'start': 70.512, 'duration': 9.691}], 'summary': 'Ipython notebook available for download under lecture 1; exploring glove word vectors and similarity in vector space.', 'duration': 50.252, 'max_score': 29.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM29951.jpg'}, {'end': 224.864, 'src': 'embed', 'start': 153.858, 'weight': 4, 'content': [{'end': 156.959, 'text': "We're going to subtract the man vector from the king vector.", 'start': 153.858, 'duration': 3.101}, {'end': 167.002, 'text': "And the idea we have in our head then is if we do that, what will happen is we'll be left with the meaning of kingship without the man-ness.", 'start': 157.319, 'duration': 9.683}, {'end': 174.184, 'text': "Um, And so then there's also a director, a vector for woman.", 'start': 167.742, 'duration': 6.442}, {'end': 180.606, 'text': 'So we could add the woman vector to that resulting vector and then we could say well, in the vector,', 'start': 174.304, 'duration': 6.302}, {'end': 187.167, 'text': "we end up at some point in the vector space and then we're gonna say well, what's the closest word that you can find to here?", 'start': 180.606, 'duration': 6.561}, {'end': 189.848, 'text': "And it's gonna print out the closest word.", 'start': 187.708, 'duration': 2.14}, {'end': 201.812, 'text': "And as we saw, um, last time, um, lo and behold, if you do that, um, you get the answer I'm saying you get.", 'start': 190.248, 'duration': 11.564}, {'end': 206.115, 'text': 'Um, king, man, woman.', 'start': 202.453, 'duration': 3.662}, {'end': 211.177, 'text': 'No? Wait.', 'start': 208.056, 'duration': 3.121}, {'end': 222.163, 'text': 'I have to reverse king and, ah, sure, sure, sure.', 'start': 211.197, 'duration': 10.966}, {'end': 223.403, 'text': 'Sorry Whoops.', 'start': 222.203, 'duration': 1.2}, {'end': 224.864, 'text': 'Yeah Okay.', 'start': 223.643, 'duration': 1.221}], 'summary': 'Using vector operations to derive word meanings and find closest words.', 'duration': 71.006, 'max_score': 153.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM153858.jpg'}, {'end': 329.474, 'src': 'embed', 'start': 284.502, 'weight': 3, 'content': [{'end': 290.426, 'text': 'So I can say tallest to tallest as long as to longest, and it gets that.', 'start': 284.502, 'duration': 5.924}, {'end': 296.849, 'text': 'Um, if I say good is to fantastic, as bad is to terrible,', 'start': 291.307, 'duration': 5.542}, {'end': 304.833, 'text': "then it seems to get out that there's some kind of notion of make more extreme direction and get this direction out.", 'start': 296.849, 'duration': 7.984}, {'end': 305.973, 'text': 'I skipped over one.', 'start': 304.953, 'duration': 1.02}, {'end': 317.281, 'text': 'Obama is to Clinton as Reagan is to, You may or may not like the answer it gives for this one as Obama is to, as Reagan is to Nixon.', 'start': 305.993, 'duration': 11.288}, {'end': 323.968, 'text': 'Um, now one thing you might notice at this point, and this is something I actually want to come back to at the end.', 'start': 317.782, 'duration': 6.186}, {'end': 329.474, 'text': "Um, well, there's this problem because Clinton's ambiguous, right? There's Bill and there's Hillary.", 'start': 324.489, 'duration': 4.985}], 'summary': 'Testing ai understanding of word relationships and associations.', 'duration': 44.972, 'max_score': 284.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM284502.jpg'}, {'end': 377.518, 'src': 'embed', 'start': 355.65, 'weight': 7, 'content': [{'end': 367.676, 'text': "So I think what we're getting um out of this is that Clinton and Nixon are sort of similar, of people in dangers, um, of being impeached, um and uh,", 'start': 355.65, 'duration': 12.026}, {'end': 371.517, 'text': 'on both sides of the aisle, and is thinking primarily of Bill Clinton.', 'start': 367.676, 'duration': 3.841}, {'end': 377.518, 'text': "But um, if this sort of brings up something that I'll come back to right at the end of um,", 'start': 371.837, 'duration': 5.681}], 'summary': "Comparison of clinton and nixon's impeachment risk, with focus on bill clinton.", 'duration': 21.868, 'max_score': 355.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM355650.jpg'}, {'end': 452.146, 'src': 'embed', 'start': 420.232, 'weight': 8, 'content': [{'end': 424.636, 'text': 'Um, and so you can do that and it decides that serial is the odd one out of that set.', 'start': 420.232, 'duration': 4.404}, {'end': 425.978, 'text': 'Seems okay.', 'start': 425.397, 'duration': 0.581}, {'end': 433.445, 'text': "Um, and then one other thing I'll just show you is, so, um, it'd sort of be nice to look at these words as I've drawn them.", 'start': 426.638, 'duration': 6.807}, {'end': 436.008, 'text': 'in some of the slide pictures.', 'start': 434.386, 'duration': 1.622}, {'end': 441.975, 'text': 'So this is saying to put together a PCA principal components analysis, um, scatterplot.', 'start': 436.348, 'duration': 5.627}, {'end': 452.146, 'text': 'Um, so I can do that and then I can say um, give it a set of words and draw me these as a scatterplot.', 'start': 442.535, 'duration': 9.611}], 'summary': 'Demonstrating pca scatterplot for word analysis.', 'duration': 31.914, 'max_score': 420.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM420232.jpg'}], 'start': 5.24, 'title': 'Word vectors and analysis', 'summary': 'Discusses the application of word vectors, analysis using pca, and challenges in python plotting, highlighting limitations and potential issues, aiding in understanding word meanings and similarities.', 'chapters': [{'end': 355.41, 'start': 5.24, 'title': 'Cs224n class 2: word vectors and meanings', 'summary': 'Discusses the application of word vectors in capturing similarity and profound meanings, illustrated through analogy problems and various examples, with an emphasis on the potential limitations of the data used.', 'duration': 350.17, 'highlights': ['The word vectors in a vector space work effectively in discovering similar words.', 'The vector space may capture meanings in a deeper and more profound way, with directions in the space representing specific meanings.', 'Illustration of analogy problems using word vectors to demonstrate meaningful associations between words.', "The limitations of the data used, particularly in the case of ambiguous terms like 'Clinton.'"]}, {'end': 557.084, 'start': 355.65, 'title': 'Word vectors analysis', 'summary': 'Discusses the analysis of word vectors using pca to visualize the similarity of words, cautioning on the accuracy of the 2d projection and mentioning the potential issues of string ambiguity and the odd-one-out word tests.', 'duration': 201.434, 'highlights': ['The analysis of word vectors using PCA to visualize the similarity of words', 'Cautioning on the accuracy of the 2D projection', 'Mentioning the potential issues of string ambiguity and the odd-one-out word tests']}, {'end': 900.465, 'start': 557.504, 'title': 'Python plotting and word vectors', 'summary': 'Discusses challenges in point labeling in python scatter plots and explores the process of learning word vectors through word2vec, matrices, and probability distributions for context prediction.', 'duration': 342.961, 'highlights': ['The chapter discusses the challenges in point labeling in scatter plots in Python and suggests the need for a better way to label points in Python plots.', 'The chapter delves into the process of learning word vectors through Word2vec, involving iterative updating algorithms and probability distributions for context prediction.', 'The chapter explains the concept of matrices representing word vectors and their representation as rows in major deep learning packages such as TensorFlow and PyTorch.']}], 'duration': 895.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM5240.jpg', 'highlights': ['The vector space captures meanings in a profound way, with directions representing specific meanings.', 'Illustration of analogy problems using word vectors to demonstrate meaningful associations between words.', 'The word vectors in a vector space effectively discover similar words.', 'The analysis of word vectors using PCA to visualize the similarity of words.', "The limitations of the data used, particularly in the case of ambiguous terms like 'Clinton.'", 'Mentioning the potential issues of string ambiguity and the odd-one-out word tests.', 'The chapter discusses the challenges in point labeling in scatter plots in Python.', 'The chapter delves into the process of learning word vectors through Word2vec.', 'The chapter explains the concept of matrices representing word vectors and their representation as rows in major deep learning packages.']}, {'end': 1852.03, 'segs': [{'end': 929.761, 'src': 'embed', 'start': 900.565, 'weight': 0, 'content': [{'end': 903.706, 'text': 'The most likely word two to the left is house, three to the left is house.', 'start': 900.565, 'duration': 3.141}, {'end': 906.447, 'text': 'the one to the right should be house two right?', 'start': 904.266, 'duration': 2.181}, {'end': 909.307, 'text': "So it's sort of no sort of fineness of prediction.", 'start': 906.487, 'duration': 2.82}, {'end': 916.089, 'text': "it's just an overall kind of um probability distribution of words that are likely to occur in my context.", 'start': 909.307, 'duration': 6.782}, {'end': 927.292, 'text': "So all we're asking for is a model that gives reasonably high probability estimates to all words that occur in the context of this word relatively often.", 'start': 916.149, 'duration': 11.143}, {'end': 929.761, 'text': "There's nothing more to it than that.", 'start': 928.22, 'duration': 1.541}], 'summary': 'Model aims to predict words likely to occur in context, based on probability distribution.', 'duration': 29.196, 'max_score': 900.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM900565.jpg'}, {'end': 1018.363, 'src': 'embed', 'start': 991.151, 'weight': 3, 'content': [{'end': 1000.276, 'text': "And I mean, one of the things that some work has discussed, so on the, readings there are two papers from Sanjeev Arora's group in Princeton.", 'start': 991.151, 'duration': 9.125}, {'end': 1006.978, 'text': 'And one of those papers sort of discusses, um, this probability high frequency effect.', 'start': 1000.616, 'duration': 6.362}, {'end': 1018.363, 'text': 'And your crude way of actually fixing this high frequency effect is that normally um the first, um, the first, biggest component,', 'start': 1007.278, 'duration': 11.085}], 'summary': "Discussion on high frequency effect in sanjeev arora's papers", 'duration': 27.212, 'max_score': 991.151, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM991151.jpg'}, {'end': 1116.192, 'src': 'embed', 'start': 1089.932, 'weight': 4, 'content': [{'end': 1094.719, 'text': 'a word can be close to lots of other words in different directions.', 'start': 1089.932, 'duration': 4.787}, {'end': 1097.619, 'text': 'Um, Okay.', 'start': 1095.48, 'duration': 2.139}, {'end': 1104.664, 'text': 'So, um, we sort of started to talk, um, about how we went about learning these word vectors.', 'start': 1097.659, 'duration': 7.005}, {'end': 1111.528, 'text': "I'm sort of gonna take about a five-minute, um, detour into optimization.", 'start': 1105.104, 'duration': 6.424}, {'end': 1113.89, 'text': "Now, this isn't really an optimization class.", 'start': 1111.589, 'duration': 2.301}, {'end': 1116.192, 'text': 'If you wanna learn a lot about optimization.', 'start': 1113.95, 'duration': 2.242}], 'summary': 'Learning word vectors involves optimization, not an optimization class.', 'duration': 26.26, 'max_score': 1089.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1089932.jpg'}, {'end': 1192.445, 'src': 'embed', 'start': 1162.337, 'weight': 2, 'content': [{'end': 1164.118, 'text': 'which were our variables theta.', 'start': 1162.337, 'duration': 1.781}, {'end': 1174.525, 'text': "And then what we want to do is say well, if we take a small step in the direction of the negative of the gradient, that'll be taking us down,", 'start': 1164.538, 'duration': 9.987}, {'end': 1182.133, 'text': 'say downhill in this space, and we want to keep on doing that and sort of head to the minimum of our space.', 'start': 1174.525, 'duration': 7.608}, {'end': 1188.08, 'text': 'I mean, of course, in our high multi-dimensional space, you know, it might not be a nice smooth curve like this.', 'start': 1182.634, 'duration': 5.446}, {'end': 1192.445, 'text': "It might be a horrible and non-convex curve, but that's just the idea.", 'start': 1188.16, 'duration': 4.285}], 'summary': 'Using gradient descent to find minimum in multi-dimensional space.', 'duration': 30.108, 'max_score': 1162.337, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1162337.jpg'}, {'end': 1362.696, 'src': 'embed', 'start': 1334.065, 'weight': 1, 'content': [{'end': 1339.807, 'text': 'So this is sort of an amazingly, amazingly noisy estimate of the gradient.', 'start': 1334.065, 'duration': 5.742}, {'end': 1346.869, 'text': "But it sort of doesn't matter too much, because as soon as we've done it, we're gonna choose a different center word and do it again and again,", 'start': 1340.187, 'duration': 6.682}, {'end': 1354.671, 'text': "so that gradually we sort of approach what we would have gotten if we'd sort of looked at all of the center words before we took any steps.", 'start': 1346.869, 'duration': 7.802}, {'end': 1362.696, 'text': 'But because we take steps as we go, we get to the minimum of the function, orders of magnitude more quickly.', 'start': 1355.071, 'duration': 7.625}], 'summary': 'Noisy gradient estimate helps reach function minimum faster.', 'duration': 28.631, 'max_score': 1334.065, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1334065.jpg'}, {'end': 1841.95, 'src': 'heatmap', 'start': 1788.828, 'weight': 0.705, 'content': [{'end': 1795.533, 'text': "And so the idea of negative sampling is we're going to train binary logistic regressions instead.", 'start': 1788.828, 'duration': 6.705}, {'end': 1802.078, 'text': "And so we're gonna train one binary logistic regression for the actual word observed.", 'start': 1795.894, 'duration': 6.184}, {'end': 1804.28, 'text': "what's in the numerator?", 'start': 1802.078, 'duration': 2.202}, {'end': 1808.804, 'text': 'and you want to give high probability to the word that was actually observed.', 'start': 1804.28, 'duration': 4.524}, {'end': 1821.691, 'text': "And then what we're gonna do is we're gonna sort of randomly sample a bunch of other words they're the negative samples and say they weren't the ones that were actually seen.", 'start': 1809.404, 'duration': 12.287}, {'end': 1826.034, 'text': 'So you should be trying to give them as low a probability as possible.', 'start': 1822.051, 'duration': 3.983}, {'end': 1834.907, 'text': "Okay So, um, the sort of notation that they use in the paper is sort of slightly different to the one I've used.", 'start': 1827.463, 'duration': 7.444}, {'end': 1841.95, 'text': "Um, they actually do maximization, not minimization, and that's their equation, which I'll come back to.", 'start': 1835.267, 'duration': 6.683}], 'summary': 'Using negative sampling to train binary logistic regressions for word observations, giving high probability to actual words and low probability to negative samples.', 'duration': 53.122, 'max_score': 1788.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1788828.jpg'}], 'start': 900.565, 'title': 'Word embeddings and optimization techniques in nlp', 'summary': 'Discusses word embeddings, word vector optimization, gradient descent, mini-batch optimization, and word2vec models. it outlines the impact of simplistic models, the process of learning word vectors, advantages of mini-batch optimization, sparse matrix updates, and practical challenges in word2vec models.', 'chapters': [{'end': 1067.123, 'start': 900.565, 'title': 'Word embeddings and probability distributions', 'summary': 'Discusses the use of word embeddings to predict probable words in context and highlights the impact of simplistic models on capturing word meanings and addressing high frequency effects.', 'duration': 166.558, 'highlights': ['The model aims to provide high probability estimates for words that occur relatively often in a given context, contributing to the surprising ability of simplistic models to capture meanings of words.', "Word vectors have a strong word probability component reflecting high frequency effects, with potential solutions discussed in papers from Sanjeev Arora's group in Princeton.", 'Two-dimensional pictures of word spaces are misleading, as they fail to capture the true relationships between words and the effects of high frequency words on semantic similarities.']}, {'end': 1354.671, 'start': 1067.143, 'title': 'Word vector optimization', 'summary': 'Discusses the unintuitive properties of high-dimensional vector spaces, the process of learning word vectors through optimization, and the use of stochastic gradient descent to efficiently compute gradients in deep learning systems.', 'duration': 287.528, 'highlights': ['Stochastic gradient descent is used in deep learning systems to efficiently compute gradients by sampling a window and using the estimate of the gradient as a parameter update.', 'The process of learning word vectors involves minimizing a cost function by calculating the gradient of the cost function with respect to word vectors, and taking small steps in the direction of the negative gradient to approach the minimum of the space.', 'High-dimensional vector spaces have unintuitive properties, such as a word being close to multiple other words in different directions.']}, {'end': 1508.48, 'start': 1355.071, 'title': 'Gradient descent and mini-batch optimization', 'summary': 'Discusses the advantages of using mini-batch optimization, including faster computations due to parallelization and less noisy gradient estimates, with examples of using 32 or 64 examples in a mini-batch, and explains the sparsity of parameter updates in stochastic gradients with word vectors.', 'duration': 153.409, 'highlights': ['Using mini-batch optimization provides faster computations due to parallelization, gaining a lot by using a mini-batch of 64 examples, and less noisy estimates of the gradient compared to using just one example.', 'The sparsity of parameter updates in stochastic gradients with word vectors is highlighted, where a mini-batch with a relatively small number of words is used to build a model over a vocabulary of quarter of a million words, resulting in most elements in the vector being zero.', 'NVIDIA GPUs perform better with mini-batch sizes of 32 or 64 due to the hardware architecture, as it allows better speedups compared to using arbitrary batch sizes.']}, {'end': 1668.686, 'start': 1508.88, 'title': 'Word2vec: sparse matrix update', 'summary': 'Discusses the optimization of updating word vectors in the word2vec model, emphasizing the benefits of sparse matrix updates and the rationale behind using two word vectors instead of one for practical ease and better results.', 'duration': 159.806, 'highlights': ['Sparse matrix updates can significantly improve the speed of updating word vectors, especially when performing distributed computation over multiple computers.', 'Using two word vectors instead of one in the Word2Vec model simplifies the mathematical computations and results in better vector representations for words.', 'The choice of having two word vectors in Word2Vec is practical, as it simplifies the math for working out partial derivatives and avoids complex squared terms in the computations.']}, {'end': 1852.03, 'start': 1668.686, 'title': 'Understanding word2vec models', 'summary': 'Explains the two main parts of the word2vec family - continuous bag of words model and skip grams model, highlighting the practical challenges and the proposed solution of negative sampling for faster computation.', 'duration': 183.344, 'highlights': ['The chapter explains the two main parts of the Word2Vec family - continuous bag of words model and skip grams model, with a focus on the practical challenges of using naive softmax and the proposed solution of negative sampling for faster computation.', 'The skip grams model involves predicting all the words in context one at a time using one center word, while the continuous bag of words model aims to predict the center word using all the outside words, considered independently, like a naive Bayes model.', 'The proposed solution of negative sampling involves training binary logistic regressions to give high probability to the observed word and low probability to randomly sampled negative words, addressing the practical challenges of slow computation in the Word2Vec model.']}], 'duration': 951.465, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM900565.jpg', 'highlights': ['Using mini-batch optimization provides faster computations due to parallelization, gaining a lot by using a mini-batch of 64 examples, and less noisy estimates of the gradient compared to using just one example.', 'The process of learning word vectors involves minimizing a cost function by calculating the gradient of the cost function with respect to word vectors, and taking small steps in the direction of the negative gradient to approach the minimum of the space.', 'The model aims to provide high probability estimates for words that occur relatively often in a given context, contributing to the surprising ability of simplistic models to capture meanings of words.', 'The chapter explains the two main parts of the Word2Vec family - continuous bag of words model and skip grams model, with a focus on the practical challenges of using naive softmax and the proposed solution of negative sampling for faster computation.', 'Stochastic gradient descent is used in deep learning systems to efficiently compute gradients by sampling a window and using the estimate of the gradient as a parameter update.']}, {'end': 2216.497, 'segs': [{'end': 2028.919, 'src': 'embed', 'start': 1995.973, 'weight': 0, 'content': [{'end': 1997.815, 'text': 'So those are called unigram counts.', 'start': 1995.973, 'duration': 1.842}, {'end': 2003.339, 'text': 'And so you start off with unigram counts, but then you raise them to the three-quarters power.', 'start': 1998.255, 'duration': 5.084}, {'end': 2014.086, 'text': 'And raising to the three-quarters power has the effect of um decreasing how often you sample very common words and increasing how often you sample rarer words.', 'start': 2003.739, 'duration': 10.347}, {'end': 2019.089, 'text': "Okay Um, and that's that.", 'start': 2016.087, 'duration': 3.002}, {'end': 2023.532, 'text': "Okay So that's everything about Word2Vec I'm going to say.", 'start': 2019.53, 'duration': 4.002}, {'end': 2028.919, 'text': 'Anyone have any? Last thing, yes.', 'start': 2023.552, 'duration': 5.367}], 'summary': 'Word2vec uses unigram counts, raised to three-quarters power, to adjust sampling frequency.', 'duration': 32.946, 'max_score': 1995.973, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1995973.jpg'}, {'end': 2092.217, 'src': 'embed', 'start': 2063.759, 'weight': 2, 'content': [{'end': 2074.282, 'text': 'it normally means I am a normalization term to turn things into probabilities and you sort of iterate over the numerator term and summing them and divide through.', 'start': 2063.759, 'duration': 10.523}, {'end': 2080.123, 'text': "Any other questions of things I haven't explained or otherwise? Yes.", 'start': 2075.282, 'duration': 4.841}, {'end': 2086.232, 'text': "So the window length, that's the, Yes.", 'start': 2080.524, 'duration': 5.708}, {'end': 2092.217, 'text': "So, what size window do you use? I'll actually come back to that in a bit and show a little bit of data on that.", 'start': 2086.331, 'duration': 5.886}], 'summary': 'Explaining the use of normalization term and iterating over the numerator term in data analysis', 'duration': 28.458, 'max_score': 2063.759, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2063759.jpg'}, {'end': 2172.007, 'src': 'embed', 'start': 2143.5, 'weight': 4, 'content': [{'end': 2146.801, 'text': 'um, the model looks very- fairly clean.', 'start': 2143.5, 'duration': 3.301}, {'end': 2148.581, 'text': 'But what people discovered,', 'start': 2147.141, 'duration': 1.44}, {'end': 2159.544, 'text': 'um when they started digging through the code which to- to their credit they did make available reproducible research that there are actually a whole bunch of tricks,', 'start': 2148.581, 'duration': 10.963}, {'end': 2168.306, 'text': 'of different things, like these hyperparameters of um, how you sample and how you weight windows, and various things to make the numbers better.', 'start': 2159.544, 'duration': 8.762}, {'end': 2172.007, 'text': 'So, you know, people play quite a few tricks to make the numbers go up.', 'start': 2168.626, 'duration': 3.381}], 'summary': "Model's code revealed use of tricks to boost numbers.", 'duration': 28.507, 'max_score': 2143.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2143500.jpg'}], 'start': 1852.03, 'title': 'Understanding word2vec model', 'summary': 'Delves into the word2vec model, encompassing the sigmoid function, negative sampling, and sampling distribution, highlighting the significance of hyperparameters and techniques for enhanced performance.', 'chapters': [{'end': 2216.497, 'start': 1852.03, 'title': 'Understanding word2vec model', 'summary': 'Discusses the word2vec model, covering topics such as the sigmoid function, negative sampling, and sampling distribution, emphasizing the use of hyperparameters and tricks for improved performance.', 'duration': 364.467, 'highlights': ["The sigmoid function maps any real number onto a probability distribution between 0 and 1, representing binary outcomes of 'yes' and 'no'.", 'The process involves taking the dot product of two vectors, using a sigmoid function, and aiming for a high probability estimate.', 'The chapter discusses negative sampling, emphasizing the objective function, and the importance of choosing random k words to minimize their dot products with the center word.', 'The sampling distribution involves using the unigram distribution, raising the counts to the three-quarters power to favor rarer words over common ones.', 'The discussion touches on the selection of hyperparameters and various tricks used to enhance model performance.']}], 'duration': 364.467, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM1852030.jpg', 'highlights': ['The process involves taking the dot product of two vectors, using a sigmoid function, and aiming for a high probability estimate.', 'The chapter discusses negative sampling, emphasizing the objective function, and the importance of choosing random k words to minimize their dot products with the center word.', 'The sampling distribution involves using the unigram distribution, raising the counts to the three-quarters power to favor rarer words over common ones.', "The sigmoid function maps any real number onto a probability distribution between 0 and 1, representing binary outcomes of 'yes' and 'no'.", 'The discussion touches on the selection of hyperparameters and various tricks used to enhance model performance.']}, {'end': 2907.186, 'segs': [{'end': 2322.082, 'src': 'embed', 'start': 2260.827, 'weight': 1, 'content': [{'end': 2281.805, 'text': 'Do you have a question? So you could argue whether or not this was written in the clearest way.', 'start': 2260.827, 'duration': 20.978}, {'end': 2291.476, 'text': "So we're making this dot product and then we're negating it, which is then flipping which side of the space we're on right?", 'start': 2282.986, 'duration': 8.49}, {'end': 2296.057, 'text': 'Because the sigmoid is symmetric around 0..', 'start': 2291.536, 'duration': 4.521}, {'end': 2303.879, 'text': "So if we've got some dot product, um, and then we negate it, we're sort of working out a 1 minus probability.", 'start': 2296.057, 'duration': 7.822}, {'end': 2310.44, 'text': "And so that's the way in which we're actually for the first term.", 'start': 2303.899, 'duration': 6.541}, {'end': 2317.322, 'text': "for the first term, we're wanting the probability to be high, and then, for the negative samples, we're wanting their probability to be low.", 'start': 2310.44, 'duration': 6.882}, {'end': 2322.082, 'text': "Okay, I'll maybe run ahead now.", 'start': 2319.529, 'duration': 2.553}], 'summary': 'Discussing dot product, negation, and sigmoid symmetry in probability calculations.', 'duration': 61.255, 'max_score': 2260.827, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2260827.jpg'}, {'end': 2435.747, 'src': 'embed', 'start': 2411.15, 'weight': 5, 'content': [{'end': 2421.375, 'text': 'So in NLP we often want to distinguish between a particular kind of type, like banana or apple, versus particular instances of it in the text,', 'start': 2411.15, 'duration': 10.225}, {'end': 2423.676, 'text': "and that's referred to as sort of a type token distinction.", 'start': 2421.375, 'duration': 2.301}, {'end': 2431.363, 'text': 'So we could um look at each um token of a word and the words five around that,', 'start': 2424.096, 'duration': 7.267}, {'end': 2435.747, 'text': 'and then we should- could sort of start counting up which words occur- occur with it.', 'start': 2431.363, 'duration': 4.384}], 'summary': 'Nlp distinguishes between type and token, counting word occurrences.', 'duration': 24.597, 'max_score': 2411.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2411150.jpg'}, {'end': 2728.858, 'src': 'embed', 'start': 2699.136, 'weight': 0, 'content': [{'end': 2715.788, 'text': "So I can um make use of um NumPy's SVD function and I can throw into it um matrices and um I can make word vectors, and these ones look really bad.", 'start': 2699.136, 'duration': 16.652}, {'end': 2717.729, 'text': 'But hey, I give it a dataset of three sentences.', 'start': 2715.848, 'duration': 1.881}, {'end': 2720.632, 'text': 'So this was exactly a fair comparison.', 'start': 2718.01, 'duration': 2.622}, {'end': 2728.858, 'text': 'But so this technique was in, um, popularized, around, um, the term, the turn of the millennium.', 'start': 2720.952, 'duration': 7.906}], 'summary': "Using numpy's svd function to create word vectors from matrices, with a dataset of three sentences. technique popularized around the turn of the millennium.", 'duration': 29.722, 'max_score': 2699.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2699136.jpg'}, {'end': 2854.03, 'src': 'embed', 'start': 2829.806, 'weight': 2, 'content': [{'end': 2836.872, 'text': 'Um he had- he used the idea, which was also another of the hacks that was put into the Word2Vec was,', 'start': 2829.806, 'duration': 7.066}, {'end': 2843.597, 'text': 'rather than just you treating the whole window the same, that you should, um count words that are closer more.', 'start': 2836.872, 'duration': 6.725}, {'end': 2849.385, 'text': 'So in Word2Vec, they sample closer words more commonly than further away words.', 'start': 2844.358, 'duration': 5.027}, {'end': 2854.03, 'text': "Um, in his system, you're sort of having to have a differential count for closer words, etc.", 'start': 2849.825, 'duration': 4.205}], 'summary': 'Word2vec prioritizes sampling closer words more commonly than further away words.', 'duration': 24.224, 'max_score': 2829.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2829806.jpg'}], 'start': 2216.778, 'title': 'Word embedding techniques and dimensionality reduction', 'summary': 'Discusses common techniques used in word embedding models, such as shuffling data for faster computation and reducing the dimensionality of word co-occurrence matrices to improve storage and model robustness, along with the application of singular value decomposition (svd) for dimensionality reduction and latent semantic analysis (lsa) for word vectors, including the challenges and improvements made by doug rohde using pearson correlations to produce more useful word vectors.', 'chapters': [{'end': 2578.908, 'start': 2216.778, 'title': 'Word embedding techniques and dimensionality reduction', 'summary': 'Discusses common techniques used in word embedding models such as shuffling data for faster computation, predicting word context through co-occurrence counts, and reducing the dimensionality of word co-occurrence matrices to improve storage and model robustness.', 'duration': 362.13, 'highlights': ['The technique of shuffling data at the beginning of each epoch is used for faster computation and locality benefits, resulting in different outcomes for each epoch.', 'Predicting word context through co-occurrence counts involves creating a matrix of word co-occurrence counts and measuring the similarity of vectors directly based on these counts.', 'Reducing the dimensionality of the word co-occurrence matrix, typically to a dimensionality of 25 to 1,000 as done in Word2Vec, helps address storage and sparsity issues in classification models.']}, {'end': 2907.186, 'start': 2579.348, 'title': 'Dimensionality reduction and latent semantic analysis', 'summary': "Discusses singular value decomposition (svd) for dimensionality reduction, reducing matrices to a two-dimensional representation, and the application of latent semantic analysis (lsa) for word vectors, popularized around the turn of the millennium, and the challenges with the technique. it also highlights doug rohde's improvements by manipulating word counts and using pearson correlations to produce more useful word vectors.", 'duration': 327.838, 'highlights': ['Singular value decomposition (SVD) allows for dimensionality reduction by discarding the smallest singular values, effectively reducing the representation to a lower dimension.', 'Latent Semantic Analysis (LSA) was popularized around the turn of the millennium for word applications, but faced challenges in information retrieval and did not gain widespread adoption.', "Doug Rohde's improvements to word vectors involved manipulating word counts by log scaling high-frequency words, using a ceiling function, and incorporating differential counts for closer words. Additionally, he utilized Pearson correlations to transform counts, resulting in more useful word vectors."]}], 'duration': 690.408, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2216778.jpg', 'highlights': ["Doug Rohde's improvements to word vectors involved manipulating word counts by log scaling high-frequency words, using a ceiling function, and incorporating differential counts for closer words. Additionally, he utilized Pearson correlations to transform counts, resulting in more useful word vectors.", 'Reducing the dimensionality of the word co-occurrence matrix, typically to a dimensionality of 25 to 1,000 as done in Word2Vec, helps address storage and sparsity issues in classification models.', 'The technique of shuffling data at the beginning of each epoch is used for faster computation and locality benefits, resulting in different outcomes for each epoch.', 'Predicting word context through co-occurrence counts involves creating a matrix of word co-occurrence counts and measuring the similarity of vectors directly based on these counts.', 'Singular value decomposition (SVD) allows for dimensionality reduction by discarding the smallest singular values, effectively reducing the representation to a lower dimension.', 'Latent Semantic Analysis (LSA) was popularized around the turn of the millennium for word applications, but faced challenges in information retrieval and did not gain widespread adoption.']}, {'end': 3595.724, 'segs': [{'end': 3207.229, 'src': 'embed', 'start': 3163.021, 'weight': 2, 'content': [{'end': 3166.865, 'text': 'um, that Jeffrey Pennington, um, which is social media of.', 'start': 3163.021, 'duration': 3.844}, {'end': 3175.853, 'text': 'can we sort of combine these ideas and sort of have some of the goodness of the neural net methods, um,', 'start': 3166.865, 'duration': 8.988}, {'end': 3180.217, 'text': 'while trying to do things with some kind of count matrix?', 'start': 3175.853, 'duration': 4.364}, {'end': 3194.411, 'text': 'And so in particular um we wanted to get the result in a slightly less hacky way that you want to have components of meaning being linear opera- linear operations in the vector space,', 'start': 3180.697, 'duration': 13.714}, {'end': 3197.736, 'text': "that they're just some vector you're adding, or something like this.", 'start': 3194.411, 'duration': 3.325}, {'end': 3207.229, 'text': 'And so the crucial observation of this model was that we could use ratios of co-occurrence probabilities to encode meaning components.', 'start': 3198.236, 'duration': 8.993}], 'summary': 'Exploring combining neural net methods with count matrix for linear operations in vector space using co-occurrence probabilities.', 'duration': 44.208, 'max_score': 3163.021, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3163021.jpg'}, {'end': 3400.569, 'src': 'heatmap', 'start': 3241.695, 'weight': 0.721, 'content': [{'end': 3246.077, 'text': 'because large appears both here and here or small appears there and there.', 'start': 3241.695, 'duration': 4.382}, {'end': 3252.7, 'text': "The thing that's interesting is sort of the difference between these components and they're indicating a meaning component.", 'start': 3246.437, 'duration': 6.263}, {'end': 3259.958, 'text': 'And so we can get at that, if we look at the ratio of co-occurrence probabilities.', 'start': 3253.06, 'duration': 6.898}, {'end': 3264.101, 'text': 'And so for the ratio of co-occurrence probabilities.', 'start': 3260.398, 'duration': 3.703}, {'end': 3272.787, 'text': 'this is a dimension of meaning and where, for other words, um, this sort of ratio cancels out to about one.', 'start': 3264.101, 'duration': 8.686}, {'end': 3280.778, 'text': "And so in this slide I've moved so it's not my small and large, but these are actually actual counts from a corpus.", 'start': 3273.467, 'duration': 7.311}, {'end': 3284.804, 'text': 'So we roughly get dimension of meaning between solid and gas.', 'start': 3280.818, 'duration': 3.986}, {'end': 3288.79, 'text': "Other ones coming out as about one because they're not the dimension of meaning.", 'start': 3285.184, 'duration': 3.606}, {'end': 3298.767, 'text': 'And so it seems like what we want is we want to have ratio of co-occurrence probabilities become linear in our space,', 'start': 3290.185, 'duration': 8.582}, {'end': 3300.087, 'text': "and then we're in a good business.", 'start': 3298.767, 'duration': 1.32}, {'end': 3303.288, 'text': "And so that's what we want to set about doing.", 'start': 3300.687, 'duration': 2.601}, {'end': 3305.109, 'text': 'Well, how can you do that??', 'start': 3303.328, 'duration': 1.781}, {'end': 3315.872, 'text': 'Well, the way you can do that is by if you can make the dot products equal to the log of the co-occurrence probability.', 'start': 3305.569, 'duration': 10.303}, {'end': 3325.615, 'text': 'then immediately you get the fact that when you have a vector difference, it turns into a ratio of the co-occurrence probabilities.', 'start': 3315.872, 'duration': 9.743}, {'end': 3334.06, 'text': 'And so essentially, the whole of the model is that we want to have dot products or logs of co-occurrence probabilities.', 'start': 3327.338, 'duration': 6.722}, {'end': 3336.88, 'text': "And so that's what we do.", 'start': 3334.98, 'duration': 1.9}, {'end': 3343.042, 'text': "So here is our objective function here, and it's made to look a little bit more complicated.", 'start': 3337, 'duration': 6.042}, {'end': 3360.411, 'text': "But essentially we've got the squared loss here and then we're wanting to say the dot product should be as similar as possible to the log of the co-occurrence probability and so you'll- they'll be lost to the extent that they're not the same.", 'start': 3343.442, 'duration': 16.969}, {'end': 3367.24, 'text': 'But we kind of complexify it a little by putting in biased terms for both of the two words,', 'start': 3360.892, 'duration': 6.348}, {'end': 3373.327, 'text': "because maybe the word is just overall common and likes to co-occur things or uncommon, or doesn't?", 'start': 3367.24, 'duration': 6.087}, {'end': 3383.356, 'text': 'And then we do one more little trick because everyone does tricks to make the performance better is that we also use this f function in front,', 'start': 3373.827, 'duration': 9.529}, {'end': 3389.662, 'text': "so that we're sort of capping the effect that very common word pairs can have on the performance of the system.", 'start': 3383.356, 'duration': 6.306}, {'end': 3400.569, 'text': 'Okay. And so that gave us the glove model of word vectors and, Theoretically, the interest of this was you know,', 'start': 3389.682, 'duration': 10.887}], 'summary': 'Ratio of co-occurrence probabilities determines dimension of meaning in word vectors.', 'duration': 158.874, 'max_score': 3241.695, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3241695.jpg'}, {'end': 3512.129, 'src': 'embed', 'start': 3482.856, 'weight': 0, 'content': [{'end': 3484.838, 'text': 'Are you guessing the right part of speech??', 'start': 3482.856, 'duration': 1.982}, {'end': 3487.22, 'text': 'Are you putting synonyms close together?', 'start': 3485.238, 'duration': 1.982}, {'end': 3491.964, 'text': "And that's sort of normally very easy to do and fast to compute.", 'start': 3487.64, 'duration': 4.324}, {'end': 3496.245, 'text': "and it's useful to do because it helps us understand the system.", 'start': 3492.384, 'duration': 3.861}, {'end': 3498.425, 'text': 'On the other hand, a lot of the time,', 'start': 3496.825, 'duration': 1.6}, {'end': 3512.129, 'text': "those intrinsic evaluations it's not very clear where whether having done well on that task is really going to help us build the amazing natural language understanding robots that we so ardently desire.", 'start': 3498.425, 'duration': 13.704}], 'summary': 'Nlp analysis raises questions on the effectiveness of certain language tasks.', 'duration': 29.273, 'max_score': 3482.856, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3482856.jpg'}, {'end': 3582.92, 'src': 'embed', 'start': 3553.126, 'weight': 4, 'content': [{'end': 3555.347, 'text': 'You wanna have stuff that works in real tasks.', 'start': 3553.126, 'duration': 2.221}, {'end': 3560.231, 'text': 'Of course, there are sort of, on the other hand, a lot of things are a lot harder then.', 'start': 3555.708, 'duration': 4.523}, {'end': 3572.437, 'text': "So it's much more work to do such an evaluation and to run different variants of a system and even when the results uh poor or great,", 'start': 3560.251, 'duration': 12.186}, {'end': 3574.358, 'text': "sometimes it's hard to diagnose.", 'start': 3572.437, 'duration': 1.921}, {'end': 3574.898, 'text': 'You know.', 'start': 3574.678, 'duration': 0.22}, {'end': 3578.779, 'text': "if it- if your great new word vectors don't work better in the system,", 'start': 3574.898, 'duration': 3.881}, {'end': 3582.92, 'text': 'you know it might be for sort of some extraneous reason about how the system was built.', 'start': 3578.779, 'duration': 4.141}], 'summary': 'Evaluating system variants for real tasks can be challenging and may require extensive work and diagnosis.', 'duration': 29.794, 'max_score': 3553.126, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3553126.jpg'}], 'start': 2907.566, 'title': 'Glove model and linear properties', 'summary': 'Discusses discovering linear properties in vector space, emergence of semantic vectors as linear components, invention of a vector space for analogies, and development of glove model unifying count and prediction methods for word vectors.', 'chapters': [{'end': 3097.573, 'start': 2907.566, 'title': 'Discovering linear properties in vector space', 'summary': 'Discusses the discovery of linear properties in vector space, highlighting the emergence of semantic vectors as linear components and the invention of a vector space that performs well in analogies, leading to the development of the glove model.', 'duration': 190.007, 'highlights': ['The discovery of semantic vectors as linear components in the vector space, such as the direction from a verb to the doer of the verb, contributes to understanding the linearity property of the space.', 'The invention of a vector space with linearity property led to performing well in analogies tests, demonstrating the effectiveness of carefully constructed spaces in generating good word vector spaces.', 'The observation of linear properties in the vector space was a starting point for the development of the GloVe model, which offered an alternative approach to word vector space construction based on efficient use of global statistics.']}, {'end': 3595.724, 'start': 3098.213, 'title': 'Glove model: unifying count and prediction methods', 'summary': "Delves into the development and application of the glove model, which unifies count and prediction methods for word vectors, demonstrating how ratios of co-occurrence probabilities encode meaning components and how the model's objective function aims to make dot products similar to log of co-occurrence probabilities, leading to good word vectors.", 'duration': 497.511, 'highlights': ['The GloVe model unifies count and prediction methods for word vectors, using ratios of co-occurrence probabilities to encode meaning components, and its objective function aims to make dot products similar to log of co-occurrence probabilities, leading to good word vectors.', "The importance of intrinsic and extrinsic evaluations for word vector models is emphasized, where intrinsic evaluations assess the system's performance on specific tasks, while extrinsic evaluations measure the impact on real-world applications.", "Intrinsic evaluations provide insights into the system's performance, while extrinsic evaluations measure the impact on real-world applications, such as web search or question answering systems."]}], 'duration': 688.158, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM2907566.jpg', 'highlights': ['The GloVe model unifies count and prediction methods for word vectors, using ratios of co-occurrence probabilities to encode meaning components, and its objective function aims to make dot products similar to log of co-occurrence probabilities, leading to good word vectors.', 'The observation of linear properties in the vector space was a starting point for the development of the GloVe model, which offered an alternative approach to word vector space construction based on efficient use of global statistics.', 'The invention of a vector space with linearity property led to performing well in analogies tests, demonstrating the effectiveness of carefully constructed spaces in generating good word vector spaces.', 'The discovery of semantic vectors as linear components in the vector space, such as the direction from a verb to the doer of the verb, contributes to understanding the linearity property of the space.', "The importance of intrinsic and extrinsic evaluations for word vector models is emphasized, where intrinsic evaluations assess the system's performance on specific tasks, while extrinsic evaluations measure the impact on real-world applications.", "Intrinsic evaluations provide insights into the system's performance, while extrinsic evaluations measure the impact on real-world applications, such as web search or question answering systems."]}, {'end': 4427.501, 'segs': [{'end': 3671.279, 'src': 'embed', 'start': 3639.592, 'weight': 6, 'content': [{'end': 3641.424, 'text': 'Um, Okay.', 'start': 3639.592, 'duration': 1.832}, {'end': 3645.105, 'text': 'But nevertheless, um, so this is something that you can evaluate.', 'start': 3641.744, 'duration': 3.361}, {'end': 3648.446, 'text': 'Here are now some GloVe visualizations.', 'start': 3645.345, 'duration': 3.101}, {'end': 3658.07, 'text': 'And so these GloVe visualizations show exactly the same kind of linearity property that Doug Rohde had discovered, which means that analogies work,', 'start': 3648.506, 'duration': 9.564}, {'end': 3663.111, 'text': 'sort of by construction, because our vector space wanted to make meaning components um linear.', 'start': 3658.07, 'duration': 5.041}, {'end': 3671.279, 'text': 'So this is then, um, showing a gender display, This is showing one between companies and their CEOs.', 'start': 3663.451, 'duration': 7.828}], 'summary': 'Glove visualizations exhibit linearity properties, enabling analogies and gender displays.', 'duration': 31.687, 'max_score': 3639.592, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3639592.jpg'}, {'end': 3781.403, 'src': 'embed', 'start': 3692.97, 'weight': 0, 'content': [{'end': 3695.652, 'text': "It's sort of a-, it's a bit of a weirdo dataset,", 'start': 3692.97, 'duration': 2.682}, {'end': 3701.497, 'text': 'because it sort of tests a few random different things which may have been things that his system worked well on.', 'start': 3695.652, 'duration': 5.845}, {'end': 3712.786, 'text': 'Um, but you know, it tests countries and capitals, um country, um, you know cities and states, countries and currencies.', 'start': 3702.017, 'duration': 10.769}, {'end': 3720.212, 'text': 'So there are a bunch of semantic things that tests and then there are some um syntactic things at tests.', 'start': 3712.806, 'duration': 7.406}, {'end': 3724.476, 'text': 'so bad, worst, fast, fastest um for superlatives.', 'start': 3720.212, 'duration': 4.264}, {'end': 3727.499, 'text': 'But you know even some of the ones I was showing before.', 'start': 3724.576, 'duration': 2.923}, {'end': 3729.401, 'text': "you know there's no, there's no.", 'start': 3727.499, 'duration': 1.902}, {'end': 3734.467, 'text': 'Obama is to Clinton, um, kind of ones that are actually in this evaluation set.', 'start': 3729.401, 'duration': 5.066}, {'end': 3740.782, 'text': "Um, Here's a big table of results that comes from our GloVe paper.", 'start': 3735.087, 'duration': 5.695}, {'end': 3746.684, 'text': 'So not surprisingly, the GloVe paper performed best in this evaluation because it was our paper.', 'start': 3740.842, 'duration': 5.842}, {'end': 3758.927, 'text': 'But I mean, perhaps the things to start to notice is, yeah, if you just do a plain SVD on counts, that that works.', 'start': 3749.164, 'duration': 9.763}, {'end': 3764.069, 'text': 'abominably badly for these, um, analogy tasks.', 'start': 3760.026, 'duration': 4.043}, {'end': 3773.957, 'text': 'But you know kind of, as Doug Rohde showed, if you start then doing manipulations of the count matrix before you do an SVD,', 'start': 3764.189, 'duration': 9.768}, {'end': 3781.403, 'text': 'you can actually start to produce an SVD-based system that actually performs quite well on these tasks.', 'start': 3773.957, 'duration': 7.446}], 'summary': 'Dataset evaluates system on semantic and syntactic tasks. glove paper performed best in evaluation. svd on counts works badly, but manipulations improve performance.', 'duration': 88.433, 'max_score': 3692.97, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3692970.jpg'}, {'end': 3869.272, 'src': 'embed', 'start': 3825.665, 'weight': 1, 'content': [{'end': 3829.849, 'text': 'So this is a graph of dimensionality and what the performance is.', 'start': 3825.665, 'duration': 4.184}, {'end': 3837.531, 'text': "So for the three lines, the green one's semantic, the blue one's the syntactic analogies, and so red's the overall score.", 'start': 3830.209, 'duration': 7.322}, {'end': 3846.173, 'text': 'So sort of what you see is up to dimensionality 300, things are clearly increasing quite a bit and then it gets fairly flat,', 'start': 3837.831, 'duration': 8.342}, {'end': 3850.654, 'text': 'which is precisely why you find a lot of word vectors, um, that are of dimensionality 300..', 'start': 3846.173, 'duration': 4.481}, {'end': 3855.277, 'text': "Um, This one's showing what window size.", 'start': 3850.654, 'duration': 4.623}, {'end': 3860.403, 'text': 'So this is sort of what we talked about symmetric on both sides window size.', 'start': 3855.617, 'duration': 4.786}, {'end': 3863.604, 'text': 'And as it goes from 2, 4, 6, 8, 10.', 'start': 3860.843, 'duration': 2.761}, {'end': 3869.272, 'text': 'And sort of what you see is if you use a very small window like 2, that actually works.', 'start': 3863.606, 'duration': 5.666}], 'summary': 'Word vector performance peaks at dimensionality 300, small window size works well.', 'duration': 43.607, 'max_score': 3825.665, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3825665.jpg'}, {'end': 4149.287, 'src': 'embed', 'start': 4122.197, 'weight': 4, 'content': [{'end': 4131.84, 'text': "But there's sort of something else that's interesting in this graph, which is, um that Using Wikipedia works frequently well,", 'start': 4122.197, 'duration': 9.643}, {'end': 4143.323, 'text': 'so that you actually find that 1.6 billion tokens of Wikipedia works better than 4.3 billion tokens of Newswire, newspaper article data.', 'start': 4131.84, 'duration': 11.483}, {'end': 4149.287, 'text': 'And so I think that sort of actually makes sense, which is well you know,', 'start': 4143.344, 'duration': 5.943}], 'summary': 'Using 1.6 billion tokens of wikipedia works better than 4.3 billion tokens of newswire data.', 'duration': 27.09, 'max_score': 4122.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4122197.jpg'}, {'end': 4220.292, 'src': 'embed', 'start': 4194.295, 'weight': 7, 'content': [{'end': 4203.985, 'text': 'I think actually one of the reasons why they work so well is that the original word-to-vec vectors that Google distributes are built only on Google News data,', 'start': 4194.295, 'duration': 9.69}, {'end': 4207.57, 'text': 'where ours sort of have this Wikipedia data inside them.', 'start': 4203.985, 'duration': 3.585}, {'end': 4211.324, 'text': 'Okay Um, rushing ahead.', 'start': 4208.231, 'duration': 3.093}, {'end': 4220.292, 'text': "Um, yeah, so that there's all of the work on analogy, but the other more basic evaluation is this one of capturing similarity judgments.", 'start': 4211.744, 'duration': 8.548}], 'summary': 'Word2vec vectors from google use only google news data, while ours include wikipedia data, enhancing similarity capture.', 'duration': 25.997, 'max_score': 4194.295, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4194295.jpg'}], 'start': 3597.768, 'title': 'Word vector evaluations and performance insights', 'summary': 'Discusses intrinsic word vector evaluations, linearity properties, and performance in various tasks, along with insights on the impact of dimensions and training data size, including key findings on count matrix manipulations and the usefulness of wikipedia data.', 'chapters': [{'end': 3746.684, 'start': 3597.768, 'title': 'Word vector evaluations', 'summary': 'Discusses intrinsic word vector evaluations, including analogies and glove visualizations, showcasing their linearity properties and performance in various tasks, such as gender display and syntactic facts.', 'duration': 148.916, 'highlights': ['GloVe visualizations show linearity properties discovered by Doug Rohde and demonstrate analogies work by construction.', 'Tomasz Mikulov built a dataset with a variety of analogies, including countries and capitals, cities and states, countries and currencies, as well as syntactic tests for superlatives.', 'GloVe paper performed best in the evaluation of word vector tasks.']}, {'end': 4427.501, 'start': 3749.164, 'title': 'Word vector analysis and performance insights', 'summary': 'Discusses the impact of dimensions and training data size on performance of word vectors in semantic and syntactic analogies, with key findings including the significance of manipulations of the count matrix before svd and the differential usefulness of wikipedia data in making word vectors.', 'duration': 678.337, 'highlights': ['The significance of manipulations of the count matrix before SVD in producing an SVD-based system that performs well on analogy tasks, as shown by Doug Rohde.', 'The impact of dimensions and training data size on the performance of word vectors in semantic and syntactic analogies, with larger dimensionality and training on 42 billion words of text resulting in better performance.', 'The optimization of performance at dimensionality 300 and the impact of window size on syntactic and semantic prediction.', 'The differential usefulness of Wikipedia data in making word vectors, with 1.6 billion tokens of Wikipedia working better than 4.3 billion tokens of Newswire, and the explanation for this based on the nature of the text.', "The use of similarity judgments to evaluate word vectors, as well as the issue of word ambiguity and multiple meanings, illustrated through the example of the word 'pike'."]}], 'duration': 829.733, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM3597768.jpg', 'highlights': ['GloVe paper performed best in the evaluation of word vector tasks.', 'The impact of dimensions and training data size on the performance of word vectors in semantic and syntactic analogies, with larger dimensionality and training on 42 billion words of text resulting in better performance.', 'The significance of manipulations of the count matrix before SVD in producing an SVD-based system that performs well on analogy tasks, as shown by Doug Rohde.', 'The optimization of performance at dimensionality 300 and the impact of window size on syntactic and semantic prediction.', 'The differential usefulness of Wikipedia data in making word vectors, with 1.6 billion tokens of Wikipedia working better than 4.3 billion tokens of Newswire, and the explanation for this based on the nature of the text.', 'Tomasz Mikulov built a dataset with a variety of analogies, including countries and capitals, cities and states, countries and currencies, as well as syntactic tests for superlatives.', 'GloVe visualizations show linearity properties discovered by Doug Rohde and demonstrate analogies work by construction.', "The use of similarity judgments to evaluate word vectors, as well as the issue of word ambiguity and multiple meanings, illustrated through the example of the word 'pike'."]}, {'end': 4833.489, 'segs': [{'end': 4505.142, 'src': 'embed', 'start': 4480.29, 'weight': 3, 'content': [{'end': 4491.815, 'text': "let's cluster all the contexts in which it occurs and then we'll see if there seem to be multiple clear clusters by some criterion for that word.", 'start': 4480.29, 'duration': 11.525}, {'end': 4496.177, 'text': "And if so, we'll just sort of, split the word into pseudo-words.", 'start': 4492.135, 'duration': 4.042}, {'end': 4503.281, 'text': 'So if it seems like that there are five clusters, um, for the word, the example I meant to use here is Jaguar.', 'start': 4496.237, 'duration': 7.044}, {'end': 4505.142, 'text': 'five clusters for the word Jaguar.', 'start': 4503.281, 'duration': 1.861}], 'summary': "Cluster contexts to identify multiple clear clusters for the word 'jaguar', resulting in 5 clusters.", 'duration': 24.852, 'max_score': 4480.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4480290.jpg'}, {'end': 4654.493, 'src': 'embed', 'start': 4622.573, 'weight': 2, 'content': [{'end': 4626.014, 'text': 'vectors of the different sentences- different senses.', 'start': 4622.573, 'duration': 3.441}, {'end': 4629.775, 'text': 'um, where superposition- superposition just means a weighted average.', 'start': 4626.014, 'duration': 3.761}, {'end': 4641.146, 'text': 'Um, um, So that, effectively, my meaning of pike is sort of a weighted average of the vectors for the different senses of pike,', 'start': 4630.295, 'duration': 10.851}, {'end': 4644.308, 'text': 'and the components are just weighted by their frequency.', 'start': 4641.146, 'duration': 3.162}, {'end': 4654.493, 'text': "Um, so that part maybe is perhaps not too surprising, but the part that's really surprising is well, if we're just averaging these word vectors,", 'start': 4644.328, 'duration': 10.165}], 'summary': "Analyzing different senses of 'pike' using weighted averages of word vectors.", 'duration': 31.92, 'max_score': 4622.573, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4622573.jpg'}, {'end': 4720.021, 'src': 'embed', 'start': 4693.743, 'weight': 1, 'content': [{'end': 4700.788, 'text': "And so it turns out that there's this whole literature on um sparse coding, compressed sensing, um,", 'start': 4693.743, 'duration': 7.045}, {'end': 4703.59, 'text': 'some of which is actually done by people in the stats department here,', 'start': 4700.788, 'duration': 2.802}, {'end': 4712.215, 'text': 'um which shows that in these cases where you have these sort of sparse um codes in these high-dimensional spaces,', 'start': 4703.59, 'duration': 8.625}, {'end': 4720.021, 'text': "you can actually commonly reconstruct out the components of a superposition, even though all you've done is sort of done this weighted average.", 'start': 4712.215, 'duration': 7.806}], 'summary': 'Sparse coding and compressed sensing literature shows components can be reconstructed in high-dimensional spaces.', 'duration': 26.278, 'max_score': 4693.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4693743.jpg'}, {'end': 4833.489, 'src': 'embed', 'start': 4815.618, 'weight': 0, 'content': [{'end': 4824.163, 'text': 'And so the word vectors were just sort of this useful source that you could throw into any NLP system that you built and your numbers went up.', 'start': 4815.618, 'duration': 8.545}, {'end': 4831.968, 'text': 'So they were just a very effective technology which actually did work in basically any extrinsic task you tried it on.', 'start': 4824.263, 'duration': 7.705}, {'end': 4833.489, 'text': 'Okay Thanks a lot.', 'start': 4832.528, 'duration': 0.961}], 'summary': 'Word vectors were an effective technology, improving numbers in any nlp system.', 'duration': 17.871, 'max_score': 4815.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4815618.jpg'}], 'start': 4427.521, 'title': 'Developing multi-sense word models', 'summary': 'Discusses the development of a model with multiple senses for words, using crude context clustering and pseudo-words to create word vectors, highlighting limitations. it also explores how word vectors with multiple senses and sparse coding in high-dimensional spaces improve nlp tasks.', 'chapters': [{'end': 4570.042, 'start': 4427.521, 'title': 'Multi-sense word model', 'summary': 'Discusses the development of a model with multiple senses for words, using a crude method of clustering contexts and dividing words into pseudo-words, such as jaguar 1, 2, 3, 4, 5, and running a word vectoring algorithm to represent each sense, which works but has clear limitations.', 'duration': 142.521, 'highlights': ['The model was developed to have multiple senses for a word by clustering contexts and dividing words into pseudo-words, such as Jaguar 1, 2, 3, 4, 5, and running a word vectoring algorithm to represent each sense, which was found to work but has clear limitations.', 'The crude method involved clustering all the contexts in which a common word occurs and then splitting the word into pseudo-words, resulting in representations for each sense of the word, such as Jaguar 1, Jaguar 2, Jaguar 3, 4, 5.', "The divisions between senses are often unclear and overlapping, posing a limitation to the model's effectiveness in capturing distinct word senses."]}, {'end': 4833.489, 'start': 4570.042, 'title': 'Word vectors and their multiple senses', 'summary': 'Explores the concept of word vectors with multiple senses, and discusses how sparse coding in high-dimensional spaces enables the separation of different sense meanings, leading to significant improvements in nlp tasks.', 'duration': 263.447, 'highlights': ['The concept of word vectors with multiple senses is explored, where the meaning of a word is a weighted average of the vectors for its different senses, and the components are weighted by their frequency.', 'Sparse coding in high-dimensional spaces enables the separation of different sense meanings, leading to the extraction of various meanings associated with a word.', 'The use of word vectors significantly improves NLP tasks, with models incorporating word representations showing a couple of percent or more increase in performance across various extrinsic tasks.']}], 'duration': 405.968, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kEMJRjEdNzM/pics/kEMJRjEdNzM4427521.jpg', 'highlights': ['The use of word vectors significantly improves NLP tasks, with models showing a couple of percent or more increase in performance', 'Sparse coding in high-dimensional spaces enables the separation of different sense meanings', 'The concept of word vectors with multiple senses is explored, where the meaning of a word is a weighted average of the vectors for its different senses', 'The model was developed to have multiple senses for a word by clustering contexts and dividing words into pseudo-words', 'The crude method involved clustering all the contexts in which a common word occurs and then splitting the word into pseudo-words']}], 'highlights': ['GloVe paper performed best in the evaluation of word vector tasks.', 'The vector space captures meanings in a profound way, with directions representing specific meanings.', 'Using mini-batch optimization provides faster computations due to parallelization, gaining a lot by using a mini-batch of 64 examples, and less noisy estimates of the gradient compared to using just one example.', 'The process involves taking the dot product of two vectors, using a sigmoid function, and aiming for a high probability estimate.', 'The GloVe model unifies count and prediction methods for word vectors, using ratios of co-occurrence probabilities to encode meaning components, and its objective function aims to make dot products similar to log of co-occurrence probabilities, leading to good word vectors.']}