title
Word2Vec (tutorial)
description
In this video, we'll use a Game of Thrones dataset to create word vectors. Then we'll map these word vectors out on a graph and use them to tell us related words that we input. We'll learn how to process a dataset from scratch, go over the word vectorization process, and visualization techniques all in one session.
Code for this video:
https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE
Join us in our Slack channel:
http://wizards.herokuapp.com/
More learning resources:
https://www.tensorflow.org/tutorials/word2vec/
https://radimrehurek.com/gensim/models/word2vec.html
https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
http://sebastianruder.com/word-embeddings-1/
http://natureofcode.com/book/chapter-1-vectors/
Please subscribe. And like. And Comment. That's what keeps me going.
And please support me on Patreon:
https://www.patreon.com/user?u=3191693
Follow me:
Twitter: https://twitter.com/sirajraval
Facebook: https://www.facebook.com/sirajology Instagram: https://www.instagram.com/sirajraval/ Instagram: https://www.instagram.com/sirajraval/
Signup for my newsletter for exciting updates in the field of AI:
https://goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: http://chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
https://www.wagergpt.co
detail
{'title': 'Word2Vec (tutorial)', 'heatmap': [{'end': 1307.568, 'start': 1248.314, 'weight': 0.709}, {'end': 1429.92, 'start': 1331.351, 'weight': 0.763}, {'end': 2234.037, 'start': 1986.657, 'weight': 0.749}, {'end': 2374.664, 'start': 2337.393, 'weight': 0.787}], 'summary': 'The tutorial covers creating word vectors from game of thrones data, python dependencies, combining book file names into a corpus, training word2vec model, and visualizing and analyzing word vectors for semantic similarity using t-s-n-e method, cosine similarity, and real-life applications in legal, medical, and scientific fields.', 'chapters': [{'end': 423.327, 'segs': [{'end': 67.157, 'src': 'embed', 'start': 30.965, 'weight': 0, 'content': [{'end': 34.146, 'text': 'If you know Game of Thrones, give it a shout out in the comments.', 'start': 30.965, 'duration': 3.181}, {'end': 35.547, 'text': "Let's see who knows what this is.", 'start': 34.226, 'duration': 1.321}, {'end': 36.647, 'text': "It doesn't matter if you don't.", 'start': 35.747, 'duration': 0.9}, {'end': 40.349, 'text': 'The point is we are learning about the concept of word vectors.', 'start': 36.667, 'duration': 3.682}, {'end': 44.771, 'text': 'And we want to take some books and make them into vectors.', 'start': 40.489, 'duration': 4.282}, {'end': 51.293, 'text': "And once we have these vectors, we're going to do a bunch of really cool stuff with it, all right? So who's in the house? Let me name some names.", 'start': 44.811, 'duration': 6.482}, {'end': 57.536, 'text': 'We got Jake, Akash, Party, Tahir, Teddy, Reki, Party, Ricardo, Angel.', 'start': 51.313, 'duration': 6.223}, {'end': 67.157, 'text': "Got a lot of people in the house, all right? So okay, let's go ahead and do a five minute Q&A, and then we're gonna get into the code.", 'start': 58.888, 'duration': 8.269}], 'summary': 'Learning about word vectors and applying them to books, followed by a q&a session.', 'duration': 36.192, 'max_score': 30.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU30965.jpg'}, {'end': 137.392, 'src': 'embed', 'start': 109.284, 'weight': 1, 'content': [{'end': 112.147, 'text': 'I just picked it because I think the idea is cool.', 'start': 109.284, 'duration': 2.863}, {'end': 112.968, 'text': "I don't have time to..", 'start': 112.187, 'duration': 0.781}, {'end': 116.551, 'text': 'read anything or watch TV shows anymore.', 'start': 113.587, 'duration': 2.964}, {'end': 117.732, 'text': "I'm just focused on content.", 'start': 116.591, 'duration': 1.141}, {'end': 120.956, 'text': "Any maths? We are going to do, yes, we're going to do some maths.", 'start': 118.673, 'duration': 2.283}, {'end': 126.022, 'text': "We're going to use the cosine similarity as a measure of distance between word vectors.", 'start': 120.976, 'duration': 5.046}, {'end': 127.023, 'text': "It's a corpus.", 'start': 126.363, 'duration': 0.66}, {'end': 130.247, 'text': "Yes, it's five different books, but we're going to treat it as one big corpus.", 'start': 127.063, 'duration': 3.184}, {'end': 131.789, 'text': 'Please say my name, Piyush.', 'start': 130.708, 'duration': 1.081}, {'end': 137.392, 'text': 'OK Okay, doing the deep learning foundation and feeling utterly lost.', 'start': 132.31, 'duration': 5.082}], 'summary': 'Focused on content, using cosine similarity for word vectors in a corpus.', 'duration': 28.108, 'max_score': 109.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU109284.jpg'}, {'end': 185.006, 'src': 'embed', 'start': 155.749, 'weight': 2, 'content': [{'end': 158.63, 'text': "we're going to go so deep into the math in the next video.", 'start': 155.749, 'duration': 2.881}, {'end': 160.851, 'text': 'okay, so the next weekly video, get ready.', 'start': 158.63, 'duration': 2.221}, {'end': 163.092, 'text': 'we are going to really dive into the math.', 'start': 160.851, 'duration': 2.241}, {'end': 164.013, 'text': 'please say my name, colin.', 'start': 163.092, 'duration': 0.921}, {'end': 164.753, 'text': "what's a word vector?", 'start': 164.013, 'duration': 0.74}, {'end': 166.094, 'text': "i'll explain that in a second.", 'start': 164.753, 'duration': 1.341}, {'end': 167.434, 'text': "clean the camera, there's moisture.", 'start': 166.094, 'duration': 1.34}, {'end': 173.838, 'text': "okay, okay, um, Video isn't clear, bro.", 'start': 167.434, 'duration': 6.404}, {'end': 175.879, 'text': "OK, can't help that right now.", 'start': 173.938, 'duration': 1.941}, {'end': 177.861, 'text': 'Is OpenAI worth to explore? Yes.', 'start': 176.26, 'duration': 1.601}, {'end': 179.882, 'text': 'One minute wrap before this? I love your wrap.', 'start': 178.341, 'duration': 1.541}, {'end': 180.523, 'text': "I'll do that.", 'start': 180.003, 'duration': 0.52}, {'end': 182.364, 'text': "Yeah, I'll do that.", 'start': 181.744, 'duration': 0.62}, {'end': 185.006, 'text': 'Let me answer some more questions.', 'start': 184.186, 'duration': 0.82}], 'summary': 'Upcoming video to deeply explore math; openai worth exploring, confirmed.', 'duration': 29.257, 'max_score': 155.749, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU155749.jpg'}, {'end': 233.147, 'src': 'embed', 'start': 202.078, 'weight': 5, 'content': [{'end': 204.339, 'text': "So you'd want to use probably JavaScript for that.", 'start': 202.078, 'duration': 2.261}, {'end': 205.359, 'text': 'JavaScript would be easy.', 'start': 204.379, 'duration': 0.98}, {'end': 210.122, 'text': "Consnet.js, Andre Karpathy's library would be great for that.", 'start': 206.04, 'duration': 4.082}, {'end': 211.383, 'text': 'The cam looks dirty.', 'start': 210.463, 'duration': 0.92}, {'end': 215.165, 'text': 'Hey man, I was just at the beach recording some cool stuff.', 'start': 211.683, 'duration': 3.482}, {'end': 219.276, 'text': "using word vectors, that's going to be possible.", 'start': 217.314, 'duration': 1.962}, {'end': 220.757, 'text': 'Are you going to use TensorFlow for this?', 'start': 219.596, 'duration': 1.161}, {'end': 227.562, 'text': "No, we're going to use Word2Vec and we're going to use a bunch of other, smaller libraries, all right?", 'start': 220.877, 'duration': 6.685}, {'end': 233.147, 'text': "Okay, so, one more question, and then we're going to get started.", 'start': 229.164, 'duration': 3.983}], 'summary': 'Using javascript, consnet.js, and word2vec for video processing.', 'duration': 31.069, 'max_score': 202.078, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU202078.jpg'}, {'end': 274.924, 'src': 'embed', 'start': 250.445, 'weight': 3, 'content': [{'end': 263.233, 'text': "It helps if the words are relevant to the problem we're trying to solve, if they're relevant to the The story, which is Game of Thrones in this case.", 'start': 250.445, 'duration': 12.788}, {'end': 265.255, 'text': "Okay, so that's it for the questions.", 'start': 263.453, 'duration': 1.802}, {'end': 266.817, 'text': "Let's get started with this.", 'start': 265.876, 'duration': 0.941}, {'end': 268.138, 'text': "I'm going to start screen sharing.", 'start': 266.897, 'duration': 1.241}, {'end': 270.02, 'text': "It's going to be an IPython notebook.", 'start': 268.438, 'duration': 1.582}, {'end': 272.742, 'text': "And then we're going to, that's right, a wrap.", 'start': 270.88, 'duration': 1.862}, {'end': 274.924, 'text': "Let's do a little wrap for a second.", 'start': 273.503, 'duration': 1.421}], 'summary': 'Discussion on relevance of words to problem solving, game of thrones reference, ipython notebook, and wrap-up.', 'duration': 24.479, 'max_score': 250.445, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU250445.jpg'}, {'end': 436.65, 'src': 'embed', 'start': 404.742, 'weight': 4, 'content': [{'end': 406.905, 'text': 'They are these huge text files for the books.', 'start': 404.742, 'duration': 2.163}, {'end': 411.197, 'text': "That's all they are, okay? I just took five of them.", 'start': 407.025, 'duration': 4.172}, {'end': 412.678, 'text': 'I downloaded them from Pirate Bay.', 'start': 411.217, 'duration': 1.461}, {'end': 413.679, 'text': 'No regrets.', 'start': 413.058, 'duration': 0.621}, {'end': 418.423, 'text': "You know what I'm saying? So that's what this is, and that's it.", 'start': 414.6, 'duration': 3.823}, {'end': 420.425, 'text': "We've just got five books in the series.", 'start': 418.603, 'duration': 1.822}, {'end': 423.327, 'text': "We're going to take all these books, and we're going to create word vectors from them.", 'start': 420.705, 'duration': 2.622}, {'end': 429.313, 'text': "We're going to treat it as one big corpus, okay? So that's what we're going to do.", 'start': 423.347, 'duration': 5.966}, {'end': 436.65, 'text': "So let's just go dive right into this baby, all right? So the first thing we want to do is we want to import our dependencies.", 'start': 430.342, 'duration': 6.308}], 'summary': 'The transcript discusses downloading five books from pirate bay to create word vectors for a big corpus.', 'duration': 31.908, 'max_score': 404.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU404742.jpg'}], 'start': 4.757, 'title': 'Creating word vectors from game of thrones data', 'summary': 'Involves creating word vectors using word2vec model from game of thrones books and data, emphasizing math, cosine similarity, javascript, consnet.js, and analyzing semantic similarity.', 'chapters': [{'end': 202.057, 'start': 4.757, 'title': 'Creating word vectors from game of thrones books', 'summary': 'Involves a live session where word vectors are created using the word2vec model from a series of books called game of thrones, with a focus on math and cosine similarity as a measure of distance between word vectors.', 'duration': 197.3, 'highlights': ['Word vectors are created using the Word2Vec model from a series of Game of Thrones books. The session involves creating word vectors from the Game of Thrones books, showcasing the practical application of the Word2Vec model.', 'Math and cosine similarity are utilized as a measure of distance between word vectors. The use of cosine similarity as a measure of distance between word vectors is emphasized, highlighting the incorporation of mathematical concepts in the session.', "Focus on deep learning and math in the upcoming videos. The speaker acknowledges the need to delve deeply into math and deep learning in the next weekly video, addressing the audience's feedback and concerns."]}, {'end': 423.327, 'start': 202.078, 'title': 'Creating word vectors from game of thrones data', 'summary': 'Discusses using javascript, consnet.js, and word2vec to create word vectors from the game of thrones dataset, emphasizing the importance of relevant vocabulary and the intention to analyze semantic similarity.', 'duration': 221.249, 'highlights': ['The importance of using relevant words for the problem is stressed, with a focus on the Game of Thrones story.', 'The speaker plans to create word vectors from the five books of the Game of Thrones series, obtained from Pirate Bay.', 'The tools JavaScript, Consnet.js, and Word2Vec are recommended for creating word vectors.']}], 'duration': 418.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU4757.jpg', 'highlights': ['Word vectors are created using the Word2Vec model from a series of Game of Thrones books, showcasing practical application.', 'Math and cosine similarity are utilized as a measure of distance between word vectors, emphasizing the incorporation of mathematical concepts.', "The speaker acknowledges the need to delve deeply into math and deep learning in the next weekly video, addressing the audience's feedback and concerns.", 'The importance of using relevant words for the problem is stressed, with a focus on the Game of Thrones story.', 'The speaker plans to create word vectors from the five books of the Game of Thrones series, obtained from Pirate Bay.', 'The tools JavaScript, Consnet.js, and Word2Vec are recommended for creating word vectors.']}, {'end': 1257.358, 'segs': [{'end': 464.613, 'src': 'embed', 'start': 423.347, 'weight': 0, 'content': [{'end': 429.313, 'text': "We're going to treat it as one big corpus, okay? So that's what we're going to do.", 'start': 423.347, 'duration': 5.966}, {'end': 436.65, 'text': "So let's just go dive right into this baby, all right? So the first thing we want to do is we want to import our dependencies.", 'start': 430.342, 'duration': 6.308}, {'end': 438.993, 'text': "Now, we've actually got a lot of dependencies for this.", 'start': 436.71, 'duration': 2.283}, {'end': 441.015, 'text': "So I'm going to explain every single one.", 'start': 439.013, 'duration': 2.002}, {'end': 444.079, 'text': 'So the first one we want to do is import future.', 'start': 441.035, 'duration': 3.044}, {'end': 445.741, 'text': 'And why do we want to import future?', 'start': 444.339, 'duration': 1.402}, {'end': 447.963, 'text': 'Can anyone tell me why in the comments?', 'start': 445.761, 'duration': 2.202}, {'end': 455.448, 'text': "As I type this, the reason we want to import future is because it's the missing link between Python 2 and Python 3..", 'start': 448.644, 'duration': 6.804}, {'end': 457.95, 'text': 'It allows us to use the syntax from both.', 'start': 455.448, 'duration': 2.502}, {'end': 460.511, 'text': "It's kind of like a bridge between the two languages.", 'start': 458.25, 'duration': 2.261}, {'end': 464.613, 'text': "And we're going to import three functions that we're going to use for this, okay? So that's the first step.", 'start': 460.531, 'duration': 4.082}], 'summary': 'Importing dependencies including future for bridging python 2 and 3.', 'duration': 41.266, 'max_score': 423.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU423347.jpg'}, {'end': 521.357, 'src': 'embed', 'start': 493.092, 'weight': 3, 'content': [{'end': 501.668, 'text': "It's a way of like, quickly and efficiently searching through a large text or number database for what you need.", 'start': 493.092, 'duration': 8.576}, {'end': 503.469, 'text': 'The next one is for logging.', 'start': 502.428, 'duration': 1.041}, {'end': 505.51, 'text': "So actually, we don't need to log.", 'start': 503.529, 'duration': 1.981}, {'end': 509.653, 'text': "Now, I actually haven't talked about concurrency before.", 'start': 506.611, 'duration': 3.042}, {'end': 510.713, 'text': 'So this is going to be interesting.', 'start': 509.673, 'duration': 1.04}, {'end': 514.155, 'text': "We're going to import this multiprocessing library to perform concurrency.", 'start': 510.733, 'duration': 3.422}, {'end': 521.357, 'text': "And if you don't know, concurrency is a way of running multiple threads and having each thread run a different process.", 'start': 514.195, 'duration': 7.162}], 'summary': 'Efficiently search large databases; implement concurrency with multiprocessing library.', 'duration': 28.265, 'max_score': 493.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU493092.jpg'}, {'end': 621.145, 'src': 'embed', 'start': 592.016, 'weight': 4, 'content': [{'end': 594.059, 'text': 'okay, nltk is awesome.', 'start': 592.016, 'duration': 2.043}, {'end': 595.68, 'text': 'it is so easy to use.', 'start': 594.059, 'duration': 1.621}, {'end': 596.681, 'text': 'let me zoom in on this thing.', 'start': 595.68, 'duration': 1.001}, {'end': 601.646, 'text': 'okay, Literally, it can tokenize sentences in single lines of code.', 'start': 596.681, 'duration': 4.965}, {'end': 607.872, 'text': "So if you have a sentence like at 8 o'clock on Thursday morning, Arthur didn't feel very good, you feed that to NLTK and boom,", 'start': 601.686, 'duration': 6.186}, {'end': 610.055, 'text': "it'll give you the tokens for each word.", 'start': 607.872, 'duration': 2.183}, {'end': 611.015, 'text': 'Why is this useful??', 'start': 610.155, 'duration': 0.86}, {'end': 616.12, 'text': 'Well, you can have part of speech tagging, POS tagging, which means like oh, is this a noun??', 'start': 611.296, 'duration': 4.824}, {'end': 616.861, 'text': 'Is this a verb??', 'start': 616.221, 'duration': 0.64}, {'end': 617.562, 'text': 'Is this a CD??', 'start': 616.941, 'duration': 0.621}, {'end': 619.183, 'text': 'How does it know these things??', 'start': 617.942, 'duration': 1.241}, {'end': 621.145, 'text': 'Because it has a pre-trained.', 'start': 619.263, 'duration': 1.882}], 'summary': 'Nltk can tokenize sentences with single lines of code, providing part of speech tagging and pre-trained knowledge.', 'duration': 29.129, 'max_score': 592.016, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU592016.jpg'}, {'end': 681.013, 'src': 'embed', 'start': 653.425, 'weight': 5, 'content': [{'end': 656.447, 'text': 'At a high level right now, Word2Vec is what Google created.', 'start': 653.425, 'duration': 3.022}, {'end': 663.693, 'text': 'So basically they trained a neural network on a huge data set of word vectors.', 'start': 657.308, 'duration': 6.385}, {'end': 668.663, 'text': 'And it created vectors, and we can use these vectors in other ways.', 'start': 665.26, 'duration': 3.403}, {'end': 674.008, 'text': "So it's like a generalized collection of word vectors, okay? So I'm gonna talk about all this in a second.", 'start': 668.683, 'duration': 5.325}, {'end': 676.39, 'text': 'Let me just keep typing out these dependencies.', 'start': 674.028, 'duration': 2.362}, {'end': 678.011, 'text': 'The next one is dimensionality reduction.', 'start': 676.41, 'duration': 1.601}, {'end': 681.013, 'text': "Once we have our word vectors, they're gonna be multidimensional.", 'start': 678.071, 'duration': 2.942}], 'summary': 'Word2vec is a collection of word vectors created by google using a neural network trained on a large dataset, allowing for multidimensional word representation.', 'duration': 27.588, 'max_score': 653.425, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU653425.jpg'}, {'end': 806.985, 'src': 'embed', 'start': 776.192, 'weight': 6, 'content': [{'end': 782.396, 'text': 'Now our next step is to process our data, okay? So step one is to process our data.', 'start': 776.192, 'duration': 6.204}, {'end': 787.259, 'text': 'What does this look like? Well, before we do anything, before we do anything, we want to clean our data.', 'start': 782.696, 'duration': 4.563}, {'end': 792.722, 'text': 'So how do we clean our data? Well, NLTK has a really handy function for this.', 'start': 787.279, 'duration': 5.443}, {'end': 797.466, 'text': 'Well, the first one is called pumped, and the next one is called stop word.', 'start': 792.782, 'duration': 4.684}, {'end': 806.985, 'text': "So what does this do? What this does is it downloads Punk downloads, it's a tokenizer, it's a pre-trained tokenizer.", 'start': 797.486, 'duration': 9.499}], 'summary': "Data processing involves cleaning using nltk's functions like punk and stop word.", 'duration': 30.793, 'max_score': 776.192, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU776192.jpg'}, {'end': 926.173, 'src': 'embed', 'start': 886.492, 'weight': 7, 'content': [{'end': 891.794, 'text': 'make sure that we actually printed them all right?', 'start': 886.492, 'duration': 5.302}, {'end': 894.735, 'text': 'It ends there and it starts here.', 'start': 893.115, 'duration': 1.62}, {'end': 898.216, 'text': 'okay? Boom, boom.', 'start': 894.735, 'duration': 3.481}, {'end': 899.957, 'text': 'Okay, no, no, no, no, no.', 'start': 898.877, 'duration': 1.08}, {'end': 901.768, 'text': 'There we go.', 'start': 901.408, 'duration': 0.36}, {'end': 904.489, 'text': "Okay, so that's for our text file.", 'start': 901.848, 'duration': 2.641}, {'end': 907.569, 'text': 'And let me print out the books.', 'start': 905.349, 'duration': 2.22}, {'end': 908.469, 'text': "Let's print them out.", 'start': 907.889, 'duration': 0.58}, {'end': 909.85, 'text': "Let's print them out and make sure that we got them.", 'start': 908.489, 'duration': 1.361}, {'end': 911.15, 'text': 'File names.', 'start': 910.55, 'duration': 0.6}, {'end': 917.751, 'text': 'Sorted glob.glob.', 'start': 915.911, 'duration': 1.84}, {'end': 919.352, 'text': "Let's see what we got here.", 'start': 918.472, 'duration': 0.88}, {'end': 926.173, 'text': ".txt Let's see.", 'start': 920.392, 'duration': 5.781}], 'summary': 'Printing and verifying all files, including text and books, using sorted glob.glob.', 'duration': 39.681, 'max_score': 886.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU886492.jpg'}], 'start': 423.347, 'title': 'Python dependencies and data processing', 'summary': "Covers importing dependencies in python, explaining the 'future' import, and processing dependencies such as word encoding, regex, logging, concurrency, os module, pretty print, natural language toolkit, word2vec, dimensionality reduction, and data visualization. it also includes cleaning data using nltk and glob to get text file names.", 'chapters': [{'end': 464.613, 'start': 423.347, 'title': 'Importing dependencies and explaining future in python', 'summary': "Discusses the process of importing dependencies in python, particularly focusing on explaining the 'future' import, which acts as a bridge between python 2 and python 3, allowing the usage of syntax from both versions.", 'duration': 41.266, 'highlights': ["The 'future' import serves as a bridge between Python 2 and Python 3, enabling the use of syntax from both versions.", 'Importing dependencies is crucial for the functionality of the code.', "The chapter emphasizes the significance of importing 'future' as the missing link between Python 2 and Python 3."]}, {'end': 1257.358, 'start': 465.374, 'title': 'Processing dependencies and data', 'summary': 'Covers processing dependencies including word encoding, regex, logging, concurrency, os module, pretty print, natural language toolkit, word2vec, dimensionality reduction, and data visualization. it also includes cleaning data using nltk and glob to get text file names.', 'duration': 791.984, 'highlights': ['The chapter covers various dependencies such as importing codecs, performing regex for fast file searching, logging, and importing the multiprocessing library for concurrency to run multiple threads and processes. Various dependencies covered, including word encoding, regex, logging, and concurrency.', 'The chapter delves into the usage of NLTK for tokenizing sentences and part of speech tagging, demonstrating the ease of use and usefulness of NLTK in natural language processing. Demonstrates tokenization of sentences and part of speech tagging using NLTK.', 'The chapter explains the significance of Word2Vec, created by Google, and its application in creating generalized word vectors through a neural network, which can be used for various purposes. Significance of Word2Vec and its application in creating generalized word vectors.', "The chapter demonstrates the process of cleaning data using NLTK's functions such as punkt for tokenization and stop words removal to enhance the accuracy of created vectors. Demonstrates the process of cleaning data using NLTK's functions.", 'The chapter utilizes glob to retrieve text file names and addresses the issue of locating the file names by using sorted glob.glob. Utilizes glob to retrieve text file names and addresses the issue of locating the file names.']}], 'duration': 834.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU423347.jpg', 'highlights': ["The 'future' import serves as a bridge between Python 2 and Python 3.", 'Importing dependencies is crucial for the functionality of the code.', "The chapter emphasizes the significance of importing 'future' as the missing link between Python 2 and Python 3.", 'The chapter covers various dependencies such as importing codecs, performing regex for fast file searching, logging, and importing the multiprocessing library for concurrency.', 'The chapter delves into the usage of NLTK for tokenizing sentences and part of speech tagging, demonstrating the ease of use and usefulness of NLTK in natural language processing.', 'The chapter explains the significance of Word2Vec, created by Google, and its application in creating generalized word vectors through a neural network, which can be used for various purposes.', "The chapter demonstrates the process of cleaning data using NLTK's functions such as punkt for tokenization and stop words removal to enhance the accuracy of created vectors.", 'The chapter utilizes glob to retrieve text file names and addresses the issue of locating the file names by using sorted glob.glob.']}, {'end': 2220.491, 'segs': [{'end': 1429.92, 'src': 'heatmap', 'start': 1283.074, 'weight': 0, 'content': [{'end': 1289.56, 'text': "OK, so we've got our book file names, right? And we next step is to combine the book into one string.", 'start': 1283.074, 'duration': 6.486}, {'end': 1293.765, 'text': 'And why do we want to do this? Because we want to have one corpus for all of those books.', 'start': 1289.62, 'duration': 4.145}, {'end': 1294.926, 'text': "And that's what this does.", 'start': 1293.885, 'duration': 1.041}, {'end': 1296.768, 'text': 'We initialize a raw corpus.', 'start': 1295.266, 'duration': 1.502}, {'end': 1299.911, 'text': 'We say you, let me make this bigger, because we really want to.', 'start': 1297.088, 'duration': 2.823}, {'end': 1307.568, 'text': "We start with you because it's Unicode, right? It's a Unicode string, and we want to convert it into a format that we can read easily.", 'start': 1300.606, 'duration': 6.962}, {'end': 1311.63, 'text': 'And what is that format? UTF-8 right here, okay? So UTF-8.', 'start': 1307.648, 'duration': 3.982}, {'end': 1314.391, 'text': 'So this is where the codex library comes into play.', 'start': 1312.01, 'duration': 2.381}, {'end': 1320.253, 'text': 'We are using a codex library to read in the book file name and convert it into UTF-8 format.', 'start': 1314.751, 'duration': 5.502}, {'end': 1327.415, 'text': 'Now, remember that corpus raw function we just initialized up here? Well, now we want to add all the books that we see to that corpus.', 'start': 1320.573, 'duration': 6.842}, {'end': 1329.316, 'text': "And the way we're going to do that.", 'start': 1327.955, 'duration': 1.361}, {'end': 1341.036, 'text': "The way we're going to do that is, we're going to add, We're going to add it all to this corpus draw and at the end of it,", 'start': 1331.351, 'duration': 9.685}, {'end': 1346.197, 'text': "it's going to have all of those books in one variable in memory corpus draw, okay?", 'start': 1341.036, 'duration': 5.161}, {'end': 1349.718, 'text': 'Which is going to be a very, very, very big variable, okay?', 'start': 1346.497, 'duration': 3.221}, {'end': 1352.879, 'text': "That's what we're going to do, and so that's the first step.", 'start': 1350.218, 'duration': 2.661}, {'end': 1356.34, 'text': "And once we have that, then we're going to split the corpus into sentences.", 'start': 1352.939, 'duration': 3.401}, {'end': 1361.902, 'text': 'Now, remember when I said we downloaded that punct model right up here? Let me show you guys.', 'start': 1356.7, 'duration': 5.202}, {'end': 1363.063, 'text': "And I'll take it out.", 'start': 1362.462, 'duration': 0.601}, {'end': 1364.003, 'text': 'download punch.', 'start': 1363.063, 'duration': 0.94}, {'end': 1366.645, 'text': "well, now we're going to actually load that into memory.", 'start': 1364.003, 'duration': 2.642}, {'end': 1369.688, 'text': "that it's a trained model And it's loaded in.", 'start': 1366.645, 'duration': 3.043}, {'end': 1372.25, 'text': "it's in a byte stream, and that's what pickle is.", 'start': 1369.688, 'duration': 2.562}, {'end': 1372.71, 'text': "it's a.", 'start': 1372.25, 'duration': 0.46}, {'end': 1375.972, 'text': "it's that file format that we can load as a byte stream.", 'start': 1372.71, 'duration': 3.262}, {'end': 1380.936, 'text': "now That's what that does and it's going to load it up into this tokenizer variable.", 'start': 1375.972, 'duration': 4.964}, {'end': 1382.617, 'text': 'this tokenizer is pre-trained.', 'start': 1380.936, 'duration': 1.681}, {'end': 1388.702, 'text': 'it turns Words into tokens, and the type of tokens we want are sentences, in our case, right.', 'start': 1382.617, 'duration': 6.085}, {'end': 1396.74, 'text': "so we'll use a tokenizer and We'll use a tokenizer to tokenize that corpus, which is every single word we have, right?", 'start': 1388.702, 'duration': 8.038}, {'end': 1398.802, 'text': 'And let me open this.', 'start': 1396.92, 'duration': 1.882}, {'end': 1401.806, 'text': 'So every single word we have, and this could be anything, guys.', 'start': 1398.963, 'duration': 2.843}, {'end': 1407.533, 'text': 'This could be any piece of text you want, any book, anything you download, any big piece of text.', 'start': 1401.826, 'duration': 5.707}, {'end': 1415.098, 'text': "The same principles apply, okay? We're going to put those all into this raw sentences variable.", 'start': 1407.713, 'duration': 7.385}, {'end': 1419.738, 'text': "Once we have that raw sentences variable, we're going to convert it into a word list.", 'start': 1415.738, 'duration': 4}, {'end': 1429.92, 'text': 'So what do I mean by a word list? Well, I also want to make sure.', 'start': 1420.038, 'duration': 9.882}], 'summary': 'Combine book files into one utf-8 formatted corpus, then tokenize into sentences for further processing.', 'duration': 63.123, 'max_score': 1283.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU1283074.jpg'}, {'end': 1628.168, 'src': 'embed', 'start': 1602.213, 'weight': 4, 'content': [{'end': 1606.516, 'text': "Now we're going to train word2vec, okay? These are our hyperparameters.", 'start': 1602.213, 'duration': 4.303}, {'end': 1608.277, 'text': "Let's talk about vectors for a second.", 'start': 1606.596, 'duration': 1.681}, {'end': 1612.26, 'text': "Okay, so I'm sure I can find a great image for this in a second.", 'start': 1608.817, 'duration': 3.443}, {'end': 1614.662, 'text': 'So word embeddings are here.', 'start': 1612.28, 'duration': 2.382}, {'end': 1619.485, 'text': 'So TensorFlow probably has a great image for this.', 'start': 1614.882, 'duration': 4.603}, {'end': 1623.308, 'text': "Okay, so here's a great one.", 'start': 1619.725, 'duration': 3.583}, {'end': 1623.828, 'text': "Here's a great one.", 'start': 1623.328, 'duration': 0.5}, {'end': 1626.767, 'text': "Copy image address, let's blow this image up.", 'start': 1625.086, 'duration': 1.681}, {'end': 1628.168, 'text': "Let's get it really big.", 'start': 1627.367, 'duration': 0.801}], 'summary': 'Training word2vec with hyperparameters for word embeddings in tensorflow.', 'duration': 25.955, 'max_score': 1602.213, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU1602213.jpg'}, {'end': 2093.138, 'src': 'embed', 'start': 2064.902, 'weight': 5, 'content': [{'end': 2066.681, 'text': "And the seed makes sure that it's deterministic.", 'start': 2064.902, 'duration': 1.779}, {'end': 2067.862, 'text': 'This is good for debugging.', 'start': 2066.722, 'duration': 1.14}, {'end': 2071.444, 'text': 'Deterministic, good for debugging.', 'start': 2069.283, 'duration': 2.161}, {'end': 2075.952, 'text': 'Okay, so this is our actual model right here.', 'start': 2073.351, 'duration': 2.601}, {'end': 2079.213, 'text': 'A Word2Vec model we imported from the GenSim library.', 'start': 2076.652, 'duration': 2.561}, {'end': 2081.014, 'text': 'And let me show you guys GenSim for a second.', 'start': 2079.253, 'duration': 1.761}, {'end': 2082.853, 'text': 'GenSim is super useful.', 'start': 2081.654, 'duration': 1.199}, {'end': 2086.916, 'text': "It's for topic modeling.", 'start': 2085.955, 'duration': 0.961}, {'end': 2089.637, 'text': 'Basically, you give it any kind of corpus like this.', 'start': 2087.216, 'duration': 2.421}, {'end': 2090.757, 'text': "It'll create a model.", 'start': 2090.036, 'duration': 0.721}, {'end': 2091.337, 'text': "It'll train it.", 'start': 2090.797, 'duration': 0.54}, {'end': 2091.956, 'text': 'You can save it.', 'start': 2091.357, 'duration': 0.599}, {'end': 2093.138, 'text': 'You can load it later on.', 'start': 2092.197, 'duration': 0.941}], 'summary': 'Word2vec model from gensim library for topic modeling is deterministic and good for debugging.', 'duration': 28.236, 'max_score': 2064.902, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2064902.jpg'}, {'end': 2191.859, 'src': 'embed', 'start': 2167.499, 'weight': 3, 'content': [{'end': 2174.388, 'text': "We haven't actually trained it, we've built our model, right? This is step three, build our model, which I should have written up here.", 'start': 2167.499, 'duration': 6.889}, {'end': 2183.532, 'text': 'So step three is build model, build model, okay? Once we built our model, we have loaded our corpus that we cleaned into memory.', 'start': 2174.528, 'duration': 9.004}, {'end': 2185.294, 'text': 'And we printed out the size of it.', 'start': 2184.013, 'duration': 1.281}, {'end': 2186.835, 'text': 'Now we can start training.', 'start': 2185.494, 'duration': 1.341}, {'end': 2189.718, 'text': "And it's going to train on all of those sentences we gave it.", 'start': 2187.276, 'duration': 2.442}, {'end': 2191.859, 'text': "It's going to take 30 or 40 seconds.", 'start': 2190.198, 'duration': 1.661}], 'summary': 'Model training on cleaned corpus takes 30-40 seconds', 'duration': 24.36, 'max_score': 2167.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2167499.jpg'}], 'start': 1259.459, 'title': 'Combining book file names into corpus and training word2vec model', 'summary': 'Discusses combining book file names into one corpus using the codex library to convert unicode strings into utf-8 format, aiming to create a single corpus for all the books. additionally, it covers the process of preparing a large text corpus, tokenizing it into sentences and words, training a word2vec model with hyperparameters such as dimensionality and minimum word count threshold, and finally building and training the model to create word vectors for semantic similarity and ranking.', 'chapters': [{'end': 1320.253, 'start': 1259.459, 'title': 'Combining book file names into corpus', 'summary': 'Discusses combining book file names into one corpus using the codex library to convert unicode strings into utf-8 format, aiming to create a single corpus for all the books.', 'duration': 60.794, 'highlights': ['The chapter emphasizes the importance of combining book file names into one string to create a single corpus for all the books.', 'It discusses the process of using the codex library to convert Unicode strings into UTF-8 format for easy reading.', 'The chapter mentions the need for a raw corpus and the utilization of the codex library for the conversion process.']}, {'end': 2220.491, 'start': 1320.573, 'title': 'Training word2vec model', 'summary': 'Covers the process of preparing a large text corpus, tokenizing it into sentences and words, training a word2vec model with hyperparameters such as dimensionality and minimum word count threshold, and finally building and training the model to create word vectors for semantic similarity and ranking.', 'duration': 899.918, 'highlights': ['The chapter illustrates the process of preparing a large text corpus, tokenizing it into sentences and words, and training a Word2Vec model. The process involves adding all the books to a corpus, loading a trained model into memory, tokenizing the corpus into sentences, and converting them into a word list.', 'The chapter explains the hyperparameters for training a Word2Vec model, such as dimensionality and minimum word count threshold. Hyperparameters like numFeatures (dimensionality) and minimum word count threshold are discussed, emphasizing the trade-off between complexity and accuracy.', 'The chapter discusses the process of building and training the Word2Vec model to create word vectors for semantic similarity and ranking. The process includes building the vocabulary using the sentences, training the model on the given sentences, and saving the trained model for future use.']}], 'duration': 961.032, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU1259459.jpg', 'highlights': ['The chapter emphasizes the importance of combining book file names into one string to create a single corpus for all the books.', 'It discusses the process of using the codex library to convert Unicode strings into UTF-8 format for easy reading.', 'The chapter mentions the need for a raw corpus and the utilization of the codex library for the conversion process.', 'The chapter illustrates the process of preparing a large text corpus, tokenizing it into sentences and words, and training a Word2Vec model.', 'The chapter explains the hyperparameters for training a Word2Vec model, such as dimensionality and minimum word count threshold.', 'The chapter discusses the process of building and training the Word2Vec model to create word vectors for semantic similarity and ranking.']}, {'end': 2463.192, 'segs': [{'end': 2390.649, 'src': 'heatmap', 'start': 2322.69, 'weight': 1, 'content': [{'end': 2326.611, 'text': "right, we've initialized tsne here, but we haven't trained tsne right.", 'start': 2322.69, 'duration': 3.921}, {'end': 2331.071, 'text': "so tsne is a model, it's a machine learning model and we have to train it.", 'start': 2326.611, 'duration': 4.46}, {'end': 2337.393, 'text': "okay, so we'll train it on that word vector matrix, and this is gonna take a minute or two, like it says.", 'start': 2331.071, 'duration': 6.322}, {'end': 2345.698, 'text': "and uh, So it's going to create this word vector.", 'start': 2337.393, 'duration': 8.305}, {'end': 2352.467, 'text': "It's a 2D matrix, right? So this is one gigantic matrix, and it's got the plots on the points with it.", 'start': 2345.738, 'duration': 6.729}, {'end': 2360.195, 'text': "Okay, so then we're going to plot what we've got, okay? So what do I mean by plot? We want to plot it in 2D space.", 'start': 2352.848, 'duration': 7.347}, {'end': 2367.88, 'text': 'So for every word we have in that vocab, we want to have three columns.', 'start': 2360.275, 'duration': 7.605}, {'end': 2371.442, 'text': 'The word, the X coordinate, and the Y coordinate.', 'start': 2367.92, 'duration': 3.522}, {'end': 2374.664, 'text': "Now, how does it get these coordinates? Well, that's what T-S-N-E does.", 'start': 2371.842, 'duration': 2.822}, {'end': 2390.649, 'text': "Not only is it squashing these vectors into columns two-dimensional vectors but it's also giving us the x and y coordinates of those vectors in two-dimensional space.", 'start': 2374.744, 'duration': 15.905}], 'summary': 'Training t-sne model to create 2d word vector matrix.', 'duration': 67.959, 'max_score': 2322.69, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2322690.jpg'}, {'end': 2463.192, 'src': 'embed', 'start': 2371.842, 'weight': 0, 'content': [{'end': 2374.664, 'text': "Now, how does it get these coordinates? Well, that's what T-S-N-E does.", 'start': 2371.842, 'duration': 2.822}, {'end': 2390.649, 'text': "Not only is it squashing these vectors into columns two-dimensional vectors but it's also giving us the x and y coordinates of those vectors in two-dimensional space.", 'start': 2374.744, 'duration': 15.905}, {'end': 2393.212, 'text': 'okay?. So these are all words from that corpus, right?', 'start': 2390.649, 'duration': 2.563}, {'end': 2395.493, 'text': 'These are all Game of Thrones-y words, right?', 'start': 2393.232, 'duration': 2.261}, {'end': 2407.532, 'text': "So that's what that does, and once we've got that, then we're going to plot them on a graph.", 'start': 2397.195, 'duration': 10.337}, {'end': 2413.016, 'text': "So this is where MapPlotLive comes into play, right? We're going to plot these points, and we're going to plot them on a graph.", 'start': 2407.612, 'duration': 5.404}, {'end': 2414.197, 'text': "And it's a lot.", 'start': 2413.537, 'duration': 0.66}, {'end': 2416.099, 'text': 'These are our word vectors.', 'start': 2414.538, 'duration': 1.561}, {'end': 2421.504, 'text': "There's a lot of them here, right? And we brought it down to scale so we could see a lot of them.", 'start': 2416.76, 'duration': 4.744}, {'end': 2426.908, 'text': 'But all of our word vectors or word embeddings, whatever you want to call them, are here in 2D space.', 'start': 2421.544, 'duration': 5.364}, {'end': 2431.352, 'text': 'Now, what are we going to do with them? Well, we could see what vectors are close to each other.', 'start': 2427.308, 'duration': 4.044}, {'end': 2434.351, 'text': "Let's start with that.", 'start': 2433.73, 'duration': 0.621}, {'end': 2439.515, 'text': "Let's see what vectors are close to each other and what that tells us about the data.", 'start': 2434.611, 'duration': 4.904}, {'end': 2446.124, 'text': "okay?. The first thing we want to do is zoom in on this right, and that's what this function does.", 'start': 2439.515, 'duration': 6.609}, {'end': 2452.707, 'text': 'it creates a bounding box of x and y coordinates in That graph that we have, and it shows just that bounding box.', 'start': 2446.124, 'duration': 6.583}, {'end': 2454.768, 'text': "That's what this function does, okay?", 'start': 2452.927, 'duration': 1.841}, {'end': 2463.192, 'text': "And so then we'll use that, we'll use that to, we'll say okay, so in the bounds of this and the xy bounds of of These coordinates that we give it,", 'start': 2454.828, 'duration': 8.364}], 'summary': 'T-s-n-e squashes word vectors into 2d space, then plots them on a graph to analyze proximity and relationships.', 'duration': 91.35, 'max_score': 2371.842, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2371842.jpg'}], 'start': 2220.491, 'title': 'Visualizing word vectors', 'summary': 'Discusses visualizing 300 dimensional word vectors using t-s-n-e method to squash vectors into two dimensions for easy plotting and viewing. it also explains visualizing word vectors in 2d space, analyzing the proximity of vectors and gaining insights from the data.', 'chapters': [{'end': 2345.698, 'start': 2220.491, 'title': 'Visualizing 300 dimensional word vectors', 'summary': 'Discusses the challenge of visualizing 300 dimensional word vectors and introduces the t-s-n-e method as a solution, which squashes the vectors into two dimensions for easy plotting and viewing.', 'duration': 125.207, 'highlights': ['T-S-N-E method is used to squash 300 dimensional word vectors into two dimensions, making it possible for humans to visualize the dataset easily.', 'The chapter emphasizes the need to train the T-S-N-E model on the word vector matrix, which takes a minute or two.', 'The significance of T-S-N-E in creating vectors for visualization is highlighted, with a recommendation to check out a video for a detailed explanation.']}, {'end': 2463.192, 'start': 2345.738, 'title': 'Visualizing word vectors in 2d space', 'summary': 'Explains the process of visualizing word vectors in 2d space using t-s-n-e to plot game of thrones-related words in a graph and analyze the proximity of vectors to each other, ultimately gaining insights from the data.', 'duration': 117.454, 'highlights': ['T-S-N-E squashes word vectors into 2D space and provides their x and y coordinates, allowing for the plotting of Game of Thrones-related words in a graph.', 'The visualization using MapPlotLib helps in analyzing the proximity of word vectors, offering insights into the relationships between different words.', 'The process also involves creating a bounding box in the graph to focus on specific coordinates and analyze the closeness of vectors, aiding in understanding the data.']}], 'duration': 242.701, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2220491.jpg', 'highlights': ['T-S-N-E method squashes 300 dimensional word vectors into two dimensions for easy visualization.', 'Training the T-S-N-E model on the word vector matrix takes a minute or two.', 'Visualizing word vectors in 2D space helps in analyzing the proximity of vectors and gaining insights from the data.', 'MapPlotLib aids in analyzing the proximity of word vectors and understanding the relationships between different words.', 'Creating a bounding box in the graph helps in focusing on specific coordinates and analyzing the closeness of vectors.']}, {'end': 2960.426, 'segs': [{'end': 2564.34, 'src': 'embed', 'start': 2534.402, 'weight': 0, 'content': [{'end': 2546.573, 'text': 'okay, not the brevity, the wrong word, wrong word, vector, the, the, this, The enormous awesomeness of vectors.', 'start': 2534.402, 'duration': 12.171}, {'end': 2548.354, 'text': 'OK, word clusters are related.', 'start': 2546.653, 'duration': 1.701}, {'end': 2550.575, 'text': "There's so much we can do with this.", 'start': 2549.054, 'duration': 1.521}, {'end': 2558.958, 'text': 'In every field, in legal, in law, we can train an AI judge using this thing, using semantic similarity,', 'start': 2551.155, 'duration': 7.803}, {'end': 2562.039, 'text': 'to see the differences between different case data.', 'start': 2558.958, 'duration': 3.081}, {'end': 2564.34, 'text': 'Doctors, we could use this to find new drugs.', 'start': 2562.059, 'duration': 2.281}], 'summary': 'Vectors enable diverse applications, from training ai judges to discovering new drugs.', 'duration': 29.938, 'max_score': 2534.402, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2534402.jpg'}, {'end': 2654.118, 'src': 'embed', 'start': 2625.403, 'weight': 1, 'content': [{'end': 2630.725, 'text': 'Because turning these words into vectors, turning our videos into vectors, turning our images into vectors,', 'start': 2625.403, 'duration': 5.322}, {'end': 2633.706, 'text': 'gives us a way to mathematically reason about these things.', 'start': 2630.725, 'duration': 2.981}, {'end': 2643.43, 'text': 'We can reason about them, just like we reason about numbers in a mathematical way, right?', 'start': 2633.766, 'duration': 9.664}, {'end': 2647.792, 'text': 'So this is the formula for the cosine similarity, right?', 'start': 2643.75, 'duration': 4.042}, {'end': 2654.118, 'text': 'So, given two vectors, We can use the dot product and the magnitude of those vectors to calculate them.', 'start': 2647.872, 'duration': 6.246}], 'summary': 'Vectorizing words, videos, and images enables mathematical reasoning and calculation using cosine similarity formula.', 'duration': 28.715, 'max_score': 2625.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2625403.jpg'}, {'end': 2888.409, 'src': 'embed', 'start': 2802.204, 'weight': 3, 'content': [{'end': 2804.585, 'text': 'Take any piece of text.', 'start': 2802.204, 'duration': 2.381}, {'end': 2805.085, 'text': 'Take a book.', 'start': 2804.605, 'duration': 0.48}, {'end': 2807.806, 'text': 'After this livestream, download a book.', 'start': 2805.765, 'duration': 2.041}, {'end': 2810.467, 'text': 'Download an e-book and convert it to a text format.', 'start': 2808.226, 'duration': 2.241}, {'end': 2811.687, 'text': 'And then use the code that I give you.', 'start': 2810.547, 'duration': 1.14}, {'end': 2814.328, 'text': 'And you can easily feed it to Word2Vec and create vectors.', 'start': 2812.187, 'duration': 2.141}, {'end': 2815.828, 'text': 'What do you do with these vectors?', 'start': 2814.708, 'duration': 1.12}, {'end': 2818.869, 'text': 'Well, then you can, besides the similarity in the distance.', 'start': 2816.028, 'duration': 2.841}, {'end': 2824.393, 'text': "What's a good application for a vector?", 'start': 2821.029, 'duration': 3.364}, {'end': 2827.716, 'text': 'Download a corpus of what your friends are saying.', 'start': 2825.954, 'duration': 1.762}, {'end': 2841.63, 'text': "You could rank personalities, like chats, like this guy's chats versus this guy's chat, or this guy's what he said in a speech versus what he said.", 'start': 2827.736, 'duration': 13.894}, {'end': 2844.973, 'text': 'If you want to compare Hitler to Trump, I just went political.', 'start': 2841.65, 'duration': 3.323}, {'end': 2846.504, 'text': 'Not trying to go political.', 'start': 2845.763, 'duration': 0.741}, {'end': 2847.884, 'text': 'Anyway, but I just did.', 'start': 2846.564, 'duration': 1.32}, {'end': 2851.827, 'text': 'If you want to compare anything, word vectors are good for that.', 'start': 2848.565, 'duration': 3.262}, {'end': 2858.531, 'text': 'Ranking Words are everywhere, guys.', 'start': 2853.208, 'duration': 5.323}, {'end': 2861.133, 'text': "Any kind of similarity or ranking, that's what it's for.", 'start': 2859.072, 'duration': 2.061}, {'end': 2864.075, 'text': "And there's a lot of possibilities.", 'start': 2861.553, 'duration': 2.522}, {'end': 2872.277, 'text': 'OK?. How are the words assembled into vectors?', 'start': 2864.075, 'duration': 8.202}, {'end': 2873.178, 'text': 'Is it just context??', 'start': 2872.317, 'duration': 0.861}, {'end': 2875.619, 'text': 'Are there any ontological network being built?', 'start': 2873.218, 'duration': 2.401}, {'end': 2877.141, 'text': 'If so, how?', 'start': 2875.82, 'duration': 1.321}, {'end': 2887.348, 'text': 'Right so when Google released Word2Vect, OK, so it trained neural network on these vectors.', 'start': 2878.762, 'duration': 8.586}, {'end': 2888.409, 'text': 'And these are labeled.', 'start': 2887.408, 'duration': 1.001}], 'summary': 'Convert e-book to text, use code to create and compare word vectors for diverse applications.', 'duration': 86.205, 'max_score': 2802.204, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2802204.jpg'}], 'start': 2463.192, 'title': 'Word vectors and semantic similarity', 'summary': "Delves into training vectors for creating semantic similarity using cosine similarity, with potential applications in legal, medical, and scientific fields. it also covers real-life applications of word vectors, such as converting text to vectors for similarity and ranking, and the use of google's word2vec trained neural networks on labeled corpus of words to create generalized vectors.", 'chapters': [{'end': 2722.509, 'start': 2463.192, 'title': 'Vector training and semantic similarity', 'summary': 'Explores training vectors on a model to create semantic similarity using cosine similarity and other metrics, emphasizing the potential applications in legal, medical, and scientific fields.', 'duration': 259.317, 'highlights': ['Vectors trained on the model created semantic similarity using cosine similarity and other metrics, presenting potential applications in legal, medical, and scientific fields.', 'Using cosine similarity, the chapter demonstrates how turning words into vectors enables mathematical reasoning and measurement of semantic similarity.', 'The method of measuring semantic similarity using cosine similarity and other metrics is explained, showcasing the mathematical reasoning enabled by turning words into vectors.']}, {'end': 2960.426, 'start': 2722.509, 'title': 'Word vectors and their applications', 'summary': "Discusses the real-life applications of word vectors, including the ability to convert text to vectors for similarity and ranking, as well as how google's word2vec trained neural networks on labeled corpus of words to create generalized vectors.", 'duration': 237.917, 'highlights': ['The chapter explains how to use word vectors to convert text to vectors for similarity and ranking, with the ability to download a book, convert it to text format, and use the provided code to create vectors.', 'It details the real-life applications of word vectors, such as ranking personalities in chats or speeches, and comparing different entities, with the potential for various similarities or ranking applications.', "The chapter discusses Google's Word2Vec, which trained neural networks on labeled corpus of words to create generalized vectors by converting words to vectors and single words to vectors, allowing for the generation of more generalized vectors based on similarity."]}], 'duration': 497.234, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/pY9EwZ02sXU/pics/pY9EwZ02sXU2463192.jpg', 'highlights': ['Vectors trained on the model created semantic similarity using cosine similarity and other metrics, presenting potential applications in legal, medical, and scientific fields.', 'Using cosine similarity, the chapter demonstrates how turning words into vectors enables mathematical reasoning and measurement of semantic similarity.', 'The method of measuring semantic similarity using cosine similarity and other metrics is explained, showcasing the mathematical reasoning enabled by turning words into vectors.', 'The chapter explains how to use word vectors to convert text to vectors for similarity and ranking, with the ability to download a book, convert it to text format, and use the provided code to create vectors.', 'It details the real-life applications of word vectors, such as ranking personalities in chats or speeches, and comparing different entities, with the potential for various similarities or ranking applications.', "The chapter discusses Google's Word2Vec, which trained neural networks on labeled corpus of words to create generalized vectors by converting words to vectors and single words to vectors, allowing for the generation of more generalized vectors based on similarity."]}], 'highlights': ['The tutorial covers creating word vectors from game of thrones data, python dependencies, combining book file names into a corpus, training word2vec model, and visualizing and analyzing word vectors for semantic similarity using t-s-n-e method, cosine similarity, and real-life applications in legal, medical, and scientific fields.', 'Vectors trained on the model created semantic similarity using cosine similarity and other metrics, presenting potential applications in legal, medical, and scientific fields.', 'Using cosine similarity, the chapter demonstrates how turning words into vectors enables mathematical reasoning and measurement of semantic similarity.', 'The method of measuring semantic similarity using cosine similarity and other metrics is explained, showcasing the mathematical reasoning enabled by turning words into vectors.', 'Visualizing word vectors in 2D space helps in analyzing the proximity of vectors and gaining insights from the data.', "The 'future' import serves as a bridge between Python 2 and Python 3.", 'Importing dependencies is crucial for the functionality of the code.', "The chapter emphasizes the significance of importing 'future' as the missing link between Python 2 and Python 3.", 'The chapter covers various dependencies such as importing codecs, performing regex for fast file searching, logging, and importing the multiprocessing library for concurrency.', 'The chapter delves into the usage of NLTK for tokenizing sentences and part of speech tagging, demonstrating the ease of use and usefulness of NLTK in natural language processing.', 'The chapter explains the significance of Word2Vec, created by Google, and its application in creating generalized word vectors through a neural network, which can be used for various purposes.', "The chapter demonstrates the process of cleaning data using NLTK's functions such as punkt for tokenization and stop words removal to enhance the accuracy of created vectors.", 'The chapter utilizes glob to retrieve text file names and addresses the issue of locating the file names by using sorted glob.glob.', 'The chapter emphasizes the importance of combining book file names into one string to create a single corpus for all the books.', 'It discusses the process of using the codex library to convert Unicode strings into UTF-8 format for easy reading.', 'The chapter mentions the need for a raw corpus and the utilization of the codex library for the conversion process.', 'The chapter illustrates the process of preparing a large text corpus, tokenizing it into sentences and words, and training a Word2Vec model.', 'The chapter explains the hyperparameters for training a Word2Vec model, such as dimensionality and minimum word count threshold.', 'The chapter discusses the process of building and training the Word2Vec model to create word vectors for semantic similarity and ranking.', 'T-S-N-E method squashes 300 dimensional word vectors into two dimensions for easy visualization.', 'Training the T-S-N-E model on the word vector matrix takes a minute or two.', 'MapPlotLib aids in analyzing the proximity of word vectors and understanding the relationships between different words.', 'Creating a bounding box in the graph helps in focusing on specific coordinates and analyzing the closeness of vectors.', "The speaker acknowledges the need to delve deeply into math and deep learning in the next weekly video, addressing the audience's feedback and concerns.", 'The importance of using relevant words for the problem is stressed, with a focus on the Game of Thrones story.', 'The speaker plans to create word vectors from the five books of the Game of Thrones series, obtained from Pirate Bay.', 'The tools JavaScript, Consnet.js, and Word2Vec are recommended for creating word vectors.']}