title
Geoff Hinton: Latest AI Research & The Future of AI
description
Geoff Hinton Hinton is one of the pioneers of deep learning, and shared the 2018 Turing Award with colleagues Yoshua Bengio and Yann LeCun. In 2017, he introduced capsule networks, an alternative to convolutional neural networks that take into account the pose of objects in a 3D world, solving a problem in computer vision in which elements of an object change their position when viewed from different angles.
detail
{'title': 'Geoff Hinton: Latest AI Research & The Future of AI', 'heatmap': [{'end': 593.746, 'start': 558.281, 'weight': 0.783}, {'end': 1149.665, 'start': 1119.527, 'weight': 0.81}, {'end': 2242.563, 'start': 2172.055, 'weight': 0.701}, {'end': 2889.004, 'start': 2851.429, 'weight': 1}], 'summary': 'Geoff hinton, a pioneer of deep learning who won the 2018 turing award, is discussed in this video along with the evolution and significance of capsule networks, unsupervised learning, and contrastive learning, highlighting their applications in computer vision and image recognition as well as their potential impact on machine learning advances.', 'chapters': [{'end': 46.688, 'segs': [{'end': 46.688, 'src': 'embed', 'start': 0.129, 'weight': 0, 'content': [{'end': 4.073, 'text': "Hi, I'm Craig Smith and this is Eye on AI.", 'start': 0.129, 'duration': 3.944}, {'end': 10.233, 'text': 'This week I speak to Jeff Hinton,', 'start': 8.071, 'duration': 2.162}, {'end': 18.278, 'text': 'who has lived at the outer reaches of machine learning research since an aborted attempt at a carpentry career a half century ago.', 'start': 10.233, 'duration': 8.045}, {'end': 28.303, 'text': 'After that brief dogleg, he came back into line with his illustrious ancestors, George Boole, the father of Boolean logic, and George Everest,', 'start': 18.818, 'duration': 9.485}, {'end': 33.206, 'text': "British Surveyor General of India and eponym of the world's tallest mountain.", 'start': 28.303, 'duration': 4.903}, {'end': 42.608, 'text': 'Jeff is one of the pioneers of deep learning and shared the 2018 Turing Award with colleagues Joshua Bengio and Yann LeCun.', 'start': 33.866, 'duration': 8.742}, {'end': 46.688, 'text': 'A year earlier, he had introduced capsule networks,', 'start': 43.188, 'duration': 3.5}], 'summary': 'Interview with jeff hinton, a pioneer of deep learning and 2018 turing award recipient.', 'duration': 46.559, 'max_score': 0.129, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM129.jpg'}], 'start': 0.129, 'title': 'Jeff hinton: pioneer of deep learning', 'summary': 'Explores the life and achievements of jeff hinton, a pioneer of deep learning, who shared the 2018 turing award with colleagues joshua bengio and yann lecun and introduced capsule networks a year earlier.', 'chapters': [{'end': 46.688, 'start': 0.129, 'title': 'Jeff hinton: pioneer of deep learning', 'summary': 'Explores the life and achievements of jeff hinton, a pioneer of deep learning, who shared the 2018 turing award with colleagues joshua bengio and yann lecun and introduced capsule networks a year earlier.', 'duration': 46.559, 'highlights': ['Jeff Hinton is a pioneer of deep learning and shared the 2018 Turing Award with colleagues Joshua Bengio and Yann LeCun.', 'Hinton introduced capsule networks a year before sharing the 2018 Turing Award.', "Hinton's illustrious ancestors include George Boole, the father of Boolean logic, and George Everest, British Surveyor General of India and eponym of the world's tallest mountain.", "Hinton's career trajectory includes an aborted attempt at a carpentry career before diving into machine learning research."]}], 'duration': 46.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM129.jpg', 'highlights': ['Jeff Hinton shared the 2018 Turing Award with colleagues Joshua Bengio and Yann LeCun.', 'Hinton introduced capsule networks a year before sharing the 2018 Turing Award.', "Hinton's illustrious ancestors include George Boole, the father of Boolean logic, and George Everest, British Surveyor General of India.", 'Hinton had an aborted attempt at a carpentry career before diving into machine learning research.']}, {'end': 562.504, 'segs': [{'end': 124.976, 'src': 'embed', 'start': 46.688, 'weight': 0, 'content': [{'end': 56.47, 'text': 'an alternative to convolutional neural networks that take into account the pose of objects in a 3D world, solving the problem in computer vision,', 'start': 46.688, 'duration': 9.782}, {'end': 61.451, 'text': 'in which elements of an object change their position when viewed from different angles.', 'start': 56.47, 'duration': 4.981}, {'end': 67.275, 'text': "He has been largely silent since then and I'm delighted to have him on the podcast.", 'start': 62.091, 'duration': 5.184}, {'end': 73.699, 'text': 'We began, like so many of us do today, trying to get the teleconferencing system to work.', 'start': 68.215, 'duration': 5.484}, {'end': 78.403, 'text': 'I hope you find the conversation as engrossing as I did.', 'start': 74.42, 'duration': 3.983}, {'end': 88.259, 'text': "I don't think I need to introduce you or that you need to introduce yourself.", 'start': 83.517, 'duration': 4.742}, {'end': 92.661, 'text': "I do want to sort of recap what's gone on in the last year.", 'start': 88.859, 'duration': 3.802}, {'end': 94.341, 'text': "It's been quite a year.", 'start': 92.761, 'duration': 1.58}, {'end': 101.684, 'text': "Capsule networks had sort of faded from view, at least from the layman's point of view,", 'start': 95.402, 'duration': 6.282}, {'end': 107.807, 'text': 'and resurfaced at NeurIPS last December with your introduction of stopped capsule autoencoders.', 'start': 101.684, 'duration': 6.123}, {'end': 118.591, 'text': 'Then in February at the AAAI or AAAI conference, you talked about capsule networks as key to unsupervised learning.', 'start': 109.144, 'duration': 9.447}, {'end': 124.976, 'text': 'And in April you revived the idea of backpropagation as a learning function in the brain,', 'start': 119.612, 'duration': 5.364}], 'summary': 'Capsule networks address 3d object recognition in computer vision and made significant appearances in neurips and aaai conferences.', 'duration': 78.288, 'max_score': 46.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM46688.jpg'}, {'end': 207.716, 'src': 'embed', 'start': 181.068, 'weight': 2, 'content': [{'end': 188.132, 'text': 'So what capsules are trying to do is recognise whole objects by recognising their parts and the relationships between the parts.', 'start': 181.068, 'duration': 7.064}, {'end': 195.45, 'text': 'So if you see something that might be an eye and you see something that might be a nose,', 'start': 189.767, 'duration': 5.683}, {'end': 200.793, 'text': 'the possible eye could say where the face should be and the possible nose could say where the face should be.', 'start': 195.45, 'duration': 5.343}, {'end': 207.716, 'text': "And if they agree on where the face should be, then you say, hey, they're in the right relation to make a face, so we'll instantiate a face.", 'start': 200.813, 'duration': 6.903}], 'summary': 'Capsules recognize whole objects by recognizing parts and their relationships.', 'duration': 26.648, 'max_score': 181.068, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM181068.jpg'}, {'end': 245.03, 'src': 'embed', 'start': 223.365, 'weight': 4, 'content': [{'end': 233.812, 'text': "But the other problem which we overcame with stacked capsule autoencoders is that if you've seen, say, a circle in a line drawing,", 'start': 223.365, 'duration': 10.447}, {'end': 238.434, 'text': "you don't know whether it's a left eye or a right eye, or the front wheel of a car or the back wheel of a car.", 'start': 233.812, 'duration': 4.622}, {'end': 245.03, 'text': 'And so it has to vote for all sorts of objects it might be a part of.', 'start': 239.675, 'duration': 5.355}], 'summary': 'Overcame problem with stacked capsule autoencoders by enabling objects to vote for all possible parts they might be a part of.', 'duration': 21.665, 'max_score': 223.365, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM223365.jpg'}, {'end': 340.455, 'src': 'embed', 'start': 318.156, 'weight': 7, 'content': [{'end': 327.683, 'text': 'So what you ought to do is have them interact with each other a bit and use the spatial relations between them to allow each part to become more confident about what kind of part it is.', 'start': 318.156, 'duration': 9.527}, {'end': 335.472, 'text': "So if you're a circle and there's a triangle at the right relative position to be a nose, if you're a left eye,", 'start': 329.229, 'duration': 6.243}, {'end': 337.433, 'text': "then you get more confident that you're a left eye.", 'start': 335.472, 'duration': 1.961}, {'end': 340.455, 'text': "And that's what transformers are very good at.", 'start': 338.734, 'duration': 1.721}], 'summary': 'Transformers use spatial relations for confidence in part identification.', 'duration': 22.299, 'max_score': 318.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM318156.jpg'}, {'end': 484.478, 'src': 'embed', 'start': 456.246, 'weight': 3, 'content': [{'end': 467.695, 'text': 'We made it so instead of trying to learn by supervision, by giving it labels, it learned to create a whole that was good at reconstructing the parts.', 'start': 456.246, 'duration': 11.449}, {'end': 470.177, 'text': "And so that's unsupervised learning.", 'start': 468.916, 'duration': 1.261}, {'end': 476.562, 'text': 'At some point, you need to connect it to language.', 'start': 471.858, 'duration': 4.704}, {'end': 484.478, 'text': 'Yeah, so all the learning in Stack Capsule Auto-Encoded is, almost all the learning, is unsupervised.', 'start': 477.611, 'duration': 6.867}], 'summary': 'Unsupervised learning in stack capsule auto-encoded focuses on creating a good whole from parts.', 'duration': 28.232, 'max_score': 456.246, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM456246.jpg'}], 'start': 46.688, 'title': '3d object pose and capsule networks in computer vision', 'summary': 'Discusses an alternative to convolutional neural networks for solving the problem of object pose in a 3d world in computer vision, emphasizing the significance of the conversation with the guest. it also highlights the evolution of capsule networks, from their resurgence at neurips and aaai conferences to the use of unsupervised learning and stack capsule autoencoders, with a shift from supervised to unsupervised learning and the use of set transformers.', 'chapters': [{'end': 94.341, 'start': 46.688, 'title': '3d object pose in computer vision', 'summary': "Discusses an alternative to convolutional neural networks for solving the problem of object pose in a 3d world in computer vision, and highlights the significance of the conversation with the guest in the context of the past year's events and accomplishments.", 'duration': 47.653, 'highlights': ['An alternative to convolutional neural networks addressing the pose of objects in a 3D world in computer vision is discussed.', "The significance of the conversation with the guest in the context of the past year's events and accomplishments is highlighted.", "The challenges of teleconferencing and its relevance in today's world are mentioned."]}, {'end': 562.504, 'start': 95.402, 'title': 'Revolution of capsule networks', 'summary': 'Highlights the evolution of capsule networks, from their resurgence at neurips and aaai conferences to the use of unsupervised learning and stack capsule autoencoders, emphasizing the recognition of whole objects by recognizing their parts and the relationships between them, with a shift from supervised to unsupervised learning and the use of set transformers.', 'duration': 467.102, 'highlights': ['The chapter emphasizes the evolution of capsule networks, from their resurgence at NeurIPS and AAAI conferences to the use of unsupervised learning and stack capsule autoencoders. The resurgence of capsule networks at NeurIPS and AAAI conferences, the shift from supervised to unsupervised learning, and the introduction of stack capsule autoencoders.', 'The chapter focuses on the recognition of whole objects by recognizing their parts and the relationships between them, with the use of set transformers. Capsules aim to recognize whole objects by recognizing parts and their relationships, using set transformers and emphasizing the importance of unsupervised learning.', 'The chapter discusses the shift from supervised to unsupervised learning and the challenges associated with recognizing parts and their relationships. The shift from supervised to unsupervised learning in capsules, the challenges of recognizing parts and relationships, and the use of unsupervised learning for better training.', 'The chapter explains the use of stack capsule autoencoders and the concept of dynamic rooting to rectify the issue of wrong votes in higher-level capsules. The use of stack capsule autoencoders, the concept of dynamic rooting to rectify wrong votes, and the challenges associated with rectifying wrong votes in higher-level capsules.', 'The chapter highlights the use of transformers to allow parts to interact with each other and become more confident about their identity, aiding in disambiguation and specific confident votes. The use of transformers to aid parts in becoming more confident about their identity, disambiguation, and the generation of specific confident votes.']}], 'duration': 515.816, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM46688.jpg', 'highlights': ['An alternative to convolutional neural networks addressing the pose of objects in a 3D world in computer vision is discussed.', 'The chapter emphasizes the evolution of capsule networks, from their resurgence at NeurIPS and AAAI conferences to the use of unsupervised learning and stack capsule autoencoders.', 'The chapter focuses on the recognition of whole objects by recognizing their parts and the relationships between them, with the use of set transformers.', 'The chapter discusses the shift from supervised to unsupervised learning and the challenges associated with recognizing parts and their relationships.', 'The chapter explains the use of stack capsule autoencoders and the concept of dynamic rooting to rectify the issue of wrong votes in higher-level capsules.', "The challenges of teleconferencing and its relevance in today's world are mentioned.", "The significance of the conversation with the guest in the context of the past year's events and accomplishments is highlighted.", 'The chapter highlights the use of transformers to allow parts to interact with each other and become more confident about their identity, aiding in disambiguation and specific confident votes.']}, {'end': 858.892, 'segs': [{'end': 622.693, 'src': 'embed', 'start': 564.308, 'weight': 0, 'content': [{'end': 581.824, 'text': 'Would this kind of unsupervised learning in a larger system also be able to make assumptions or inferences about relationships between objects or the laws of physics,', 'start': 564.308, 'duration': 17.516}, {'end': 582.365, 'text': 'for example?', 'start': 581.824, 'duration': 0.541}, {'end': 585.508, 'text': 'Those are two somewhat different questions.', 'start': 583.606, 'duration': 1.902}, {'end': 593.746, 'text': "In the long run, we'd like it to do that, but let's return to the laws of physics later on when we talk about SimCLR.", 'start': 588.382, 'duration': 5.364}, {'end': 601.931, 'text': 'For now, it recognizes objects by seeing parts in the correct relationships.', 'start': 595.487, 'duration': 6.444}, {'end': 606.014, 'text': 'And you recognize scenes by seeing objects in the correct relationships.', 'start': 602.711, 'duration': 3.303}, {'end': 610.136, 'text': 'In a scene, the relationships between objects are typically somewhat looser.', 'start': 607.034, 'duration': 3.102}, {'end': 613.078, 'text': 'But yes, it can do that.', 'start': 611.497, 'duration': 1.581}, {'end': 617.021, 'text': 'It can recognize that objects are related in the right way to make a particular kind of a scene.', 'start': 613.118, 'duration': 3.903}, {'end': 622.693, 'text': 'Sinclair came up later in the year.', 'start': 618.25, 'duration': 4.443}], 'summary': 'Unsupervised learning can recognize object and scene relationships, aims to infer laws of physics.', 'duration': 58.385, 'max_score': 564.308, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM564308.jpg'}, {'end': 712.676, 'src': 'embed', 'start': 678.188, 'weight': 1, 'content': [{'end': 685.113, 'text': 'And we want those patterns to be similar if the crops came from the same image, and different if they came from different images.', 'start': 678.188, 'duration': 6.925}, {'end': 688.161, 'text': "If you just say make them similar, that's easy.", 'start': 686.42, 'duration': 1.741}, {'end': 689.922, 'text': 'You just make all of the vectors be identical.', 'start': 688.181, 'duration': 1.741}, {'end': 696.606, 'text': 'The trick is you have to make them similar if they came from the same image and different if they came from different images.', 'start': 692.123, 'duration': 4.483}, {'end': 700.188, 'text': "And so that's called contrastive learning.", 'start': 698.227, 'duration': 1.961}, {'end': 712.676, 'text': 'And Ting Chen in the Google Lab in Toronto, with some help from others of us, made that work extremely well.', 'start': 701.749, 'duration': 10.927}], 'summary': 'Contrastive learning aims to make patterns similar within the same image and different across different images, demonstrated effectively by ting chen in google lab, toronto.', 'duration': 34.488, 'max_score': 678.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM678188.jpg'}, {'end': 831.143, 'src': 'embed', 'start': 804.415, 'weight': 2, 'content': [{'end': 814.339, 'text': "After you've done that, you then just directly learn to turn those representations with no extra hidden layers into class labels.", 'start': 804.415, 'duration': 9.924}, {'end': 818.279, 'text': "So that's called a linear classifier.", 'start': 816.178, 'duration': 2.101}, {'end': 819.679, 'text': "It doesn't have hidden ones in it.", 'start': 818.599, 'duration': 1.08}, {'end': 822.28, 'text': 'And it does remarkably well.', 'start': 820.72, 'duration': 1.56}, {'end': 831.143, 'text': "So a linear classifier based on those representations that we've got by pure unsupervised learning, with no knowledge of the labels, can do as well.", 'start': 822.6, 'duration': 8.543}], 'summary': 'Unsupervised learning produces linear classifier, achieving remarkable results.', 'duration': 26.728, 'max_score': 804.415, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM804415.jpg'}], 'start': 564.308, 'title': 'Unsupervised learning and contrastive learning', 'summary': "Explores unsupervised learning's ability to recognize relationships in objects and scenes, making assumptions about object relationships and laws of physics, and discusses simclr. additionally, it delves into the development of contrastive learning, emphasizing its application in representing image patches and its impressive performance on imagenet without labels.", 'chapters': [{'end': 622.693, 'start': 564.308, 'title': 'Unsupervised learning and recognizing relationships', 'summary': 'Discusses the ability of unsupervised learning to recognize objects and scenes based on correct relationships, while also aiming to make assumptions about relationships between objects and the laws of physics, with a mention of simclr.', 'duration': 58.385, 'highlights': ['Unsupervised learning recognizes objects by seeing parts in the correct relationships and scenes by seeing objects in the correct relationships.', 'It can recognize that objects are related in the right way to make a particular kind of a scene, with somewhat looser relationships between objects in scenes.', "In the long run, the goal is for unsupervised learning to make assumptions or inferences about relationships between objects or the laws of physics, with further discussion on SimCLR's role in this."]}, {'end': 858.892, 'start': 623.954, 'title': 'Contrastive learning for image representation', 'summary': 'Discusses the development of contrastive learning, emphasizing its application in representing image patches and subsequent use in unsupervised learning, leading to impressive performance on imagenet without the use of labels.', 'duration': 234.938, 'highlights': ['Contrastive learning aims to create similar vector representations for patches of the same image and different representations for patches of different images, leading to the development of unsupervised learning models. The development of contrastive learning focuses on creating similar vector representations for patches from the same image and different representations for patches from different images, ultimately leading to the creation of unsupervised learning models.', "Ting Chen's work on contrastive learning significantly improved its performance, leading to remarkable results in unsupervised learning and image recognition tasks. Ting Chen's work significantly improved the performance of contrastive learning, particularly in unsupervised learning and image recognition, demonstrating remarkable results.", 'The use of a linear classifier based on unsupervised learning representations achieved comparable performance to supervised methods on ImageNet, showcasing the effectiveness of the unsupervised approach. The application of a linear classifier based on unsupervised learning representations demonstrated comparable performance to supervised methods on ImageNet, highlighting the effectiveness of the unsupervised approach.']}], 'duration': 294.584, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM564308.jpg', 'highlights': ['Unsupervised learning recognizes objects by seeing parts in the correct relationships and scenes by seeing objects in the correct relationships.', 'Contrastive learning aims to create similar vector representations for patches of the same image and different representations for patches of different images, leading to the development of unsupervised learning models.', 'The use of a linear classifier based on unsupervised learning representations achieved comparable performance to supervised methods on ImageNet, showcasing the effectiveness of the unsupervised approach.', "In the long run, the goal is for unsupervised learning to make assumptions or inferences about relationships between objects or the laws of physics, with further discussion on SimCLR's role in this."]}, {'end': 1250.448, 'segs': [{'end': 886.446, 'src': 'embed', 'start': 859.993, 'weight': 0, 'content': [{'end': 866.34, 'text': 'And in that training, in one of the things I read, you talked about using augmented data.', 'start': 859.993, 'duration': 6.347}, {'end': 868.821, 'text': "Yes, it's very important when you do this.", 'start': 867, 'duration': 1.821}, {'end': 874.743, 'text': 'You can think of the two different crops as different ways of getting representations of the same image.', 'start': 869.641, 'duration': 5.102}, {'end': 877.523, 'text': "But that's the major thing you do.", 'start': 875.623, 'duration': 1.9}, {'end': 880.644, 'text': 'But you also have to do things like mess with the colour balance.', 'start': 877.964, 'duration': 2.68}, {'end': 886.446, 'text': 'So, for example, if I give you two different crops from the same image,', 'start': 881.765, 'duration': 4.681}], 'summary': 'Augmented data is important for getting representations of the same image and involves adjusting color balance.', 'duration': 26.453, 'max_score': 859.993, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM859993.jpg'}, {'end': 957.601, 'src': 'embed', 'start': 932.038, 'weight': 1, 'content': [{'end': 940.883, 'text': "As you're training on the data, you'll get an image, you'll take two different crops of the image, and then you will augment those crops.", 'start': 932.038, 'duration': 8.845}, {'end': 942.144, 'text': "You'll change the color balance.", 'start': 940.903, 'duration': 1.241}, {'end': 949.978, 'text': "Right. So you can't really think of it as modifying the data so much as given an image.", 'start': 942.484, 'duration': 7.494}, {'end': 957.601, 'text': 'you then get these crops with modified color balance and you can modify all sorts of other things like orientation and stuff like that.', 'start': 949.978, 'duration': 7.623}], 'summary': 'Training involves getting image, taking two crops, and augmenting with color balance and other modifications.', 'duration': 25.563, 'max_score': 932.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM932038.jpg'}, {'end': 1039.54, 'src': 'embed', 'start': 1008.198, 'weight': 3, 'content': [{'end': 1010.519, 'text': 'And you can use the same contrastive learning technique for that.', 'start': 1008.198, 'duration': 2.321}, {'end': 1021.705, 'text': 'Yeah. And then at AAAI, when you were talking on the stage with Jan and Yoshua Bengio,', 'start': 1011.175, 'duration': 10.53}, {'end': 1032.835, 'text': 'you talked about capsule networks as a form of unsupervised learning that has promise going forward.', 'start': 1021.705, 'duration': 11.13}, {'end': 1039.54, 'text': 'SimClear is another method.', 'start': 1036.578, 'duration': 2.962}], 'summary': 'Contrastive learning and capsule networks discussed at aaai with jan and yoshua bengio.', 'duration': 31.342, 'max_score': 1008.198, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1008198.jpg'}, {'end': 1100.59, 'src': 'embed', 'start': 1073.813, 'weight': 5, 'content': [{'end': 1081.196, 'text': 'And you introduced this idea of NGRADs, Neural Gradient Representation by Activity Differences.', 'start': 1073.813, 'duration': 7.383}, {'end': 1087.839, 'text': 'Can you talk about that? Neuroscientists have been very skeptical about whether the brain can do anything like backpropagation.', 'start': 1082.396, 'duration': 5.443}, {'end': 1093.385, 'text': 'Yeah, And one of the big problems has been how does the brain communicate gradients?', 'start': 1088.099, 'duration': 5.286}, {'end': 1100.59, 'text': 'Because in back propagation you need to change your weight in proportional to the gradient of the error with respect to that weight,', 'start': 1094.326, 'duration': 6.264}], 'summary': "Ngrads, a neural representation concept, addresses brain's gradient communication, crucial for backpropagation.", 'duration': 26.777, 'max_score': 1073.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1073813.jpg'}, {'end': 1153.466, 'src': 'heatmap', 'start': 1117.005, 'weight': 4, 'content': [{'end': 1118.807, 'text': 'So you can represent both signs of error.', 'start': 1117.005, 'duration': 1.802}, {'end': 1128.675, 'text': 'And it also implies that the learning rule, which uses a gradient, is going to be something called spike time dependent plasticity.', 'start': 1119.527, 'duration': 9.148}, {'end': 1135.761, 'text': "That is, when you change your synapse strength, you're going to change it in proportion to the error derivative.", 'start': 1129.696, 'duration': 6.065}, {'end': 1144.743, 'text': "And that means you're going to want to change it in proportion to the rate of change of the postsynaptic activity.", 'start': 1137.2, 'duration': 7.543}, {'end': 1149.665, 'text': "It's going to be the presynaptic activity times the rate of change of the postsynaptic activity.", 'start': 1144.763, 'duration': 4.902}, {'end': 1153.466, 'text': "And that's called spike time-dependent plasticity, which they found in the brain.", 'start': 1150.445, 'duration': 3.021}], 'summary': 'Learning rule uses spike time-dependent plasticity to adjust synapse strength in proportion to error derivative and rate of change of postsynaptic activity.', 'duration': 36.461, 'max_score': 1117.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1117005.jpg'}, {'end': 1229.433, 'src': 'embed', 'start': 1206.826, 'weight': 2, 'content': [{'end': 1215.969, 'text': 'Most neural nets want to get a lot of knowledge represented in a modest number of parameters, like only a billion parameters, for example.', 'start': 1206.826, 'duration': 9.143}, {'end': 1218.269, 'text': "For a brain, that's a tiny number of parameters.", 'start': 1216.689, 'duration': 1.58}, {'end': 1222.111, 'text': "That's the number of parameters you have in a cubic millimetre of brain, roughly.", 'start': 1218.289, 'duration': 3.822}, {'end': 1229.433, 'text': "So we have trillions and trillions of parameters, but we don't have many training examples.", 'start': 1224.251, 'duration': 5.182}], 'summary': "Neural nets aim for a billion parameters, a tiny number for the brain's trillions.", 'duration': 22.607, 'max_score': 1206.826, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1206826.jpg'}], 'start': 859.993, 'title': 'Image recognition and unsupervised learning', 'summary': 'Covers the significance of augmented data in image recognition, emphasizing the use of different crops and color balance adjustments. additionally, it discusses unsupervised learning methods like contrastive learning and capsule networks, and explores the potential use of backpropagation in brain activity, highlighting differences in parameter size and training examples between neural nets and the brain.', 'chapters': [{'end': 957.601, 'start': 859.993, 'title': 'Augmented data for image recognition', 'summary': 'Discusses the importance of using augmented data in image recognition, particularly focusing on the significance of using different crops and adjusting color balance to prevent cheating in the recognition process.', 'duration': 97.608, 'highlights': ['Using different crops and adjusting the color balance are the two most important aspects of data augmentation in image recognition.', 'Data augmentation involves modifying the color balance and other features of different crops to prevent the model from recognizing images based solely on color distribution.']}, {'end': 1250.448, 'start': 959.009, 'title': 'Unsupervised learning and brain activity', 'summary': 'Discusses unsupervised learning methods such as contrastive learning and capsule networks, along with the potential use of backpropagation in brain activity, highlighting the challenges and differences in parameter size and training examples between neural nets and the brain.', 'duration': 291.439, 'highlights': ['The brain deals with a different problem from most neural nets, as it has trillions of parameters but limited training examples, while neural nets have a modest number of parameters and ample training.', 'The chapter discusses the use of contrastive learning for videos and images, along with the potential combination of different unsupervised learning methods like capsule networks and SimClear.', 'The concept of NGRADs, which represents an error by the rate of change of neural activity, is introduced, implying the use of spike time-dependent plasticity in brain learning.', 'The potential use of backpropagation in brain activity is discussed, highlighting the challenges and skepticism from neuroscientists.', 'The chapter addresses the limitations and challenges in implementing backpropagation in brain activity, emphasizing the difference in problem-solving between the brain and neural nets.']}], 'duration': 390.455, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM859993.jpg', 'highlights': ['Using different crops and adjusting the color balance are crucial for data augmentation in image recognition.', 'Data augmentation involves modifying color balance and other features of different crops to prevent the model from recognizing images based solely on color distribution.', 'The brain has trillions of parameters but limited training examples, while neural nets have a modest number of parameters and ample training.', 'The chapter discusses the use of contrastive learning for videos and images, along with the potential combination of different unsupervised learning methods like capsule networks and SimClear.', 'The concept of NGRADs, representing an error by the rate of change of neural activity, is introduced, implying the use of spike time-dependent plasticity in brain learning.', 'The potential use of backpropagation in brain activity is discussed, highlighting the challenges and skepticism from neuroscientists.', 'The chapter addresses the limitations and challenges in implementing backpropagation in brain activity, emphasizing the difference in problem-solving between the brain and neural nets.']}, {'end': 1997.912, 'segs': [{'end': 1308.11, 'src': 'embed', 'start': 1280.789, 'weight': 3, 'content': [{'end': 1284.271, 'text': 'So the idea is you have, say, some hierarchy of parts.', 'start': 1280.789, 'duration': 3.482}, {'end': 1292.535, 'text': 'You look at an image, you instantiate parts at different levels, and then from the high-level parts, you top-down predict the low-level parts.', 'start': 1285.231, 'duration': 7.304}, {'end': 1303.466, 'text': "And what you'd like to see is agreement between the top-down prediction, which depends on a larger context, and the bottom-up extraction of a part,", 'start': 1293.959, 'duration': 9.507}, {'end': 1304.927, 'text': 'which depends on a smaller context.', 'start': 1303.466, 'duration': 1.461}, {'end': 1308.11, 'text': 'So from some local region of the image, you extract a part.', 'start': 1305.608, 'duration': 2.502}], 'summary': 'The approach involves hierarchical part instantiation and top-down prediction for image analysis.', 'duration': 27.321, 'max_score': 1280.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1280789.jpg'}, {'end': 1401.345, 'src': 'embed', 'start': 1371.459, 'weight': 5, 'content': [{'end': 1372.721, 'text': 'I call it back relaxation.', 'start': 1371.459, 'duration': 1.262}, {'end': 1379.297, 'text': 'And over many time steps, it will get information backwards.', 'start': 1375.516, 'duration': 3.781}, {'end': 1382.058, 'text': "But it won't get information backwards in one trial.", 'start': 1379.737, 'duration': 2.321}, {'end': 1388.32, 'text': 'And back propagation sends information all the way backwards through a multi-layer net on a single presentation of an image.', 'start': 1382.558, 'duration': 5.762}, {'end': 1391.942, 'text': 'And back relaxation just gets it back one layer each time.', 'start': 1389.161, 'duration': 2.781}, {'end': 1396.243, 'text': 'And it needs multiple presentations of the same image to get it back all the way.', 'start': 1392.782, 'duration': 3.461}, {'end': 1401.345, 'text': 'So I got really interested in back relaxation.', 'start': 1399.004, 'duration': 2.341}], 'summary': 'Back relaxation gradually retrieves information backward, requiring multiple image presentations.', 'duration': 29.886, 'max_score': 1371.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1371459.jpg'}, {'end': 1449.755, 'src': 'embed', 'start': 1415.494, 'weight': 4, 'content': [{'end': 1421.819, 'text': 'The greedy bottom-up algorithm that I introduced in 2006 actually worked as well as this back relaxation.', 'start': 1415.494, 'duration': 6.325}, {'end': 1423.8, 'text': 'And that was a huge disappointment to me.', 'start': 1422.299, 'duration': 1.501}, {'end': 1428.924, 'text': 'I still want to go back and see if I can make back relaxation work better than greedy bottom-up.', 'start': 1425.201, 'duration': 3.723}, {'end': 1433.703, 'text': 'I see, and thus the June tweet that you had done.', 'start': 1430.14, 'duration': 3.563}, {'end': 1437.426, 'text': "It's when I discovered that back relaxation doesn't work any better than greedy bottom-up learning.", 'start': 1433.743, 'duration': 3.683}, {'end': 1449.755, 'text': "It's the assumption that the brain is so efficient that, even if greedy bottom-up can do it on its own, that there wouldn't be this top-down function?", 'start': 1439.287, 'duration': 10.468}], 'summary': "In 2006, the introduced greedy bottom-up algorithm worked as well as back relaxation, leading to disappointment. the discovery that back relaxation doesn't work better than greedy bottom-up learning raises questions about the efficiency of the brain.", 'duration': 34.261, 'max_score': 1415.494, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1415494.jpg'}, {'end': 1545.622, 'src': 'embed', 'start': 1494.383, 'weight': 1, 'content': [{'end': 1506.366, 'text': 'when we discovered that if you train stacks of autoencoders or restricted Boltzmann machines one hidden layer at a time and then you fine-tune it,', 'start': 1494.383, 'duration': 11.983}, {'end': 1507.046, 'text': 'it works very well.', 'start': 1506.366, 'duration': 0.68}, {'end': 1512.065, 'text': 'And that got neural nets going again.', 'start': 1508.983, 'duration': 3.082}, {'end': 1520.75, 'text': "People then did things like speech and vision on ImageNet where they would, they said, you don't need the pre-training.", 'start': 1512.625, 'duration': 8.125}, {'end': 1522.571, 'text': "You don't need to train these stacks of autoencoders.", 'start': 1520.79, 'duration': 1.781}, {'end': 1524.092, 'text': 'You can just train the whole thing supervised.', 'start': 1522.611, 'duration': 1.481}, {'end': 1526.694, 'text': 'And that was fine for a while.', 'start': 1525.593, 'duration': 1.101}, {'end': 1535.3, 'text': 'But then when they got even bigger data sets and even bigger networks, people have gone back to this unsupervised pre-training.', 'start': 1526.754, 'duration': 8.546}, {'end': 1536.44, 'text': "So that's what Bertie's doing.", 'start': 1535.36, 'duration': 1.08}, {'end': 1537.96, 'text': "Bertie's unsupervised pre-training.", 'start': 1536.48, 'duration': 1.48}, {'end': 1541.181, 'text': 'And GPT-3 uses unsupervised pre-training.', 'start': 1538.74, 'duration': 2.441}, {'end': 1545.622, 'text': 'And that is important now.', 'start': 1542.221, 'duration': 3.401}], 'summary': 'Unsupervised pre-training revived neural nets, used in gpt-3 and bertie.', 'duration': 51.239, 'max_score': 1494.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1494383.jpg'}, {'end': 1836.315, 'src': 'embed', 'start': 1799.281, 'weight': 0, 'content': [{'end': 1805.167, 'text': "There's kind of a convergence between computer vision and natural language processing.", 'start': 1799.281, 'duration': 5.886}, {'end': 1816.338, 'text': "How do you see that convergence progressing? And those are the two principal components of consciousness, if I'm not wrong.", 'start': 1806.128, 'duration': 10.21}, {'end': 1818.94, 'text': 'So I mean,', 'start': 1817.299, 'duration': 1.641}, {'end': 1836.315, 'text': "are we working toward a model that looks or that can perceive the world and AI model that can perceive the world that's closer to human perception in that it blends?", 'start': 1818.94, 'duration': 17.375}], 'summary': 'Convergence of computer vision and nlp advancing towards ai model closer to human perception.', 'duration': 37.034, 'max_score': 1799.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1799281.jpg'}], 'start': 1252.189, 'title': 'Back relaxation and unsupervised pre-training', 'summary': 'Covers back relaxation and contrastive learning, proposing a brain learning algorithm different from backpropagation and the resurgence of unsupervised pre-training in neural networks since 2006, converging computer vision and natural language processing.', 'chapters': [{'end': 1467.273, 'start': 1252.189, 'title': 'Back relaxation and bottom-up learning', 'summary': 'Discusses the concept of back relaxation and contrastive learning as in simclr, proposing a learning algorithm for the brain that is somewhat different from backpropagation, which is not as efficient but easier to implement, and the disappointment of discovering that greedy bottom-up learning worked as well as back relaxation in a multilayer net.', 'duration': 215.084, 'highlights': ['The chapter discusses the concept of back relaxation, proposing a learning algorithm for the brain that is somewhat different from backpropagation, which is not as efficient but easier to implement. Back relaxation is a proposed learning algorithm for the brain, different from backpropagation, easier to implement, and requires multiple presentations of the same image to get information back all the way.', 'The disappointment of discovering that greedy bottom-up learning worked as well as back relaxation in a multilayer net. The greedy bottom-up algorithm introduced in 2006 worked as well as back relaxation, leading to a huge disappointment and the desire to make back relaxation work better than greedy bottom-up.', 'The idea of generating agreement between a top-down representation and a bottom-up representation in a hierarchy of parts by comparing a top-down prediction with a bottom-up extraction. The concept involves instantiating parts at different levels in an image, top-down predicting low-level parts from high-level parts, and aiming for significant agreement between top-down prediction and bottom-up extraction, where they should agree on the same image but disagree on different images.']}, {'end': 1997.912, 'start': 1467.273, 'title': 'Unsupervised pre-training and capsule networks in ai', 'summary': 'Discusses the resurgence of unsupervised pre-training in neural networks, the significance of deep learning since 2006, the convergence of computer vision and natural language processing, and the motivation behind capsule networks to achieve representations more akin to human perception.', 'duration': 530.639, 'highlights': ['The significance of deep learning since 2006 The discovery of training stacks of autoencoders or restricted Boltzmann machines one hidden layer at a time and then fine-tuning it led to the resurgence of neural networks, proving effective in tasks such as speech and vision on ImageNet.', "The resurgence of unsupervised pre-training in neural networks While initially there was a shift towards supervised learning, the need for unsupervised learning has resurfaced, especially with bigger datasets and networks, as seen in the case of Bertie's unsupervised pre-training and GPT-3's use of unsupervised pre-training.", 'Motivation behind capsule networks to achieve representations more akin to human perception Capsule networks aim to provide representations similar to human perception, incorporating multiple ways of perceiving objects and imposing frames of reference, distinguishing them from convolutional nets and enabling neural nets to align more with human representation of objects.', 'The convergence of computer vision and natural language processing The rise of transformers in models like GPT-3 and capsule networks signifies a convergence between computer vision and natural language processing, potentially leading to AI models that perceive the world closer to human perception.']}], 'duration': 745.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1252189.jpg', 'highlights': ['The convergence of computer vision and natural language processing, as seen in the rise of transformers in models like GPT-3 and capsule networks, signifies a potential shift towards AI models that perceive the world closer to human perception.', "The resurgence of unsupervised pre-training in neural networks, especially with bigger datasets and networks, as seen in the case of Bertie's unsupervised pre-training and GPT-3's use of unsupervised pre-training, has proven effective in tasks such as speech and vision on ImageNet.", 'The significance of deep learning since 2006, with the discovery of training stacks of autoencoders or restricted Boltzmann machines one hidden layer at a time and then fine-tuning it, has led to the resurgence of neural networks, proving effective in tasks such as speech and vision on ImageNet.', 'The idea of generating agreement between a top-down representation and a bottom-up representation in a hierarchy of parts by comparing a top-down prediction with a bottom-up extraction involves instantiating parts at different levels in an image, top-down predicting low-level parts from high-level parts, and aiming for significant agreement between top-down prediction and bottom-up extraction.', 'The disappointment of discovering that greedy bottom-up learning worked as well as back relaxation in a multilayer net, leading to a huge disappointment and the desire to make back relaxation work better than greedy bottom-up.', 'The concept of back relaxation, proposing a learning algorithm for the brain that is somewhat different from backpropagation, which is not as efficient but easier to implement, involves requiring multiple presentations of the same image to get information back all the way.']}, {'end': 2927.248, 'segs': [{'end': 2049.221, 'src': 'embed', 'start': 2025.643, 'weight': 0, 'content': [{'end': 2033.971, 'text': "So what it's showing is that the kind of interactions between parts that works so well in things like BERT for words,", 'start': 2025.643, 'duration': 8.328}, {'end': 2040.417, 'text': "where you're getting word fragments to interact, also works when you're getting representations of patches of images to interact.", 'start': 2033.971, 'duration': 6.446}, {'end': 2049.221, 'text': "And it's also what's happening in stack capsule autoencoders, where we have a set transformer.", 'start': 2041.597, 'duration': 7.624}], 'summary': 'Interactions between image patches work like bert for words.', 'duration': 23.578, 'max_score': 2025.643, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2025643.jpg'}, {'end': 2245.407, 'src': 'heatmap', 'start': 2172.055, 'weight': 1, 'content': [{'end': 2182.278, 'text': "Basically, it's taking all this information in this data it's observed and it's boiling it down into these parameters that allow it to produce similar stuff,", 'start': 2172.055, 'duration': 10.223}, {'end': 2184.559, 'text': 'but not by matching to particular instances.', 'start': 2182.278, 'duration': 2.281}, {'end': 2185.219, 'text': "it's seen already.", 'start': 2184.559, 'duration': 0.66}, {'end': 2192.543, 'text': 'And in the same way, capsule networks are creating new representations.', 'start': 2186.279, 'duration': 6.264}, {'end': 2197.767, 'text': 'Yes And capsule networks should be able to deal with a new view of the same object.', 'start': 2193.144, 'duration': 4.623}, {'end': 2201.289, 'text': 'Yeah, So where is your research going now?', 'start': 2198.347, 'duration': 2.942}, {'end': 2204.411, 'text': 'I mean on these three streams, or are there other streams?', 'start': 2201.369, 'duration': 3.042}, {'end': 2209.714, 'text': "My main interest has always been unsupervised learning, because I think that's what most human learning is.", 'start': 2204.511, 'duration': 5.203}, {'end': 2215.572, 'text': "I'm interested in developing capsules further and in things like SimClear.", 'start': 2210.848, 'duration': 4.724}, {'end': 2219.094, 'text': "I'm also interested in making distillation work better.", 'start': 2216.032, 'duration': 3.062}, {'end': 2231.804, 'text': "So the idea of distillation is you have a great big model and you've trained it on data and it's extracted the regular patterns in the data and got them into its parameters.", 'start': 2220.055, 'duration': 11.749}, {'end': 2236.635, 'text': 'And now you want to train a much smaller model.', 'start': 2233.331, 'duration': 3.304}, {'end': 2242.563, 'text': "that would be as good as the big model, or almost as good as a big model, but you couldn't have trained directly on the data.", 'start': 2236.635, 'duration': 5.928}, {'end': 2245.407, 'text': 'And so we see this all over the place.', 'start': 2243.885, 'duration': 1.522}], 'summary': 'Research focuses on unsupervised learning, capsule networks, and distillation for model compression.', 'duration': 25.352, 'max_score': 2172.055, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2172055.jpg'}, {'end': 2300.227, 'src': 'embed', 'start': 2273.678, 'weight': 2, 'content': [{'end': 2283.237, 'text': 'And then it basically gets turned into a soup, and out of that soup, you build the adult, which may look nothing like the larva.', 'start': 2273.678, 'duration': 9.559}, {'end': 2286.419, 'text': 'I mean, a caterpillar and a butterfly are very different things.', 'start': 2284.037, 'duration': 2.382}, {'end': 2290.061, 'text': "And they're optimized for different things.", 'start': 2287.96, 'duration': 2.101}, {'end': 2293.983, 'text': 'So the larva is optimized for sucking nutrients out of the environment.', 'start': 2290.101, 'duration': 3.882}, {'end': 2300.227, 'text': 'And then the butterfly is optimized for traveling around and mating.', 'start': 2295.484, 'duration': 4.743}], 'summary': 'Larva optimized for nutrient extraction, butterfly for travel and mating.', 'duration': 26.549, 'max_score': 2273.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2273678.jpg'}, {'end': 2539.149, 'src': 'embed', 'start': 2504.644, 'weight': 3, 'content': [{'end': 2509.867, 'text': "So this idea of contrastive representation learning seems to be very powerful, and Jan's exploiting it.", 'start': 2504.644, 'duration': 5.223}, {'end': 2517.35, 'text': "Ting Chen made it work really well for static images, and we're now trying to extend that to video.", 'start': 2511.307, 'duration': 6.043}, {'end': 2522.332, 'text': "But we're trying to extend it using attention, which is going to be very important for video,", 'start': 2518.19, 'duration': 4.142}, {'end': 2525.673, 'text': "because you can't possibly process everything in a video at a high resolution.", 'start': 2522.332, 'duration': 3.341}, {'end': 2528.252, 'text': 'Yeah, yeah.', 'start': 2526.67, 'duration': 1.582}, {'end': 2539.149, 'text': "And that's interesting when you relate this machine learning to learning in the brain, and certainly attention is is critical.", 'start': 2528.453, 'duration': 10.696}], 'summary': 'Contrastive representation learning being extended to video using attention for efficient processing.', 'duration': 34.505, 'max_score': 2504.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2504644.jpg'}, {'end': 2630.165, 'src': 'embed', 'start': 2601.29, 'weight': 4, 'content': [{'end': 2603.271, 'text': 'I mean, obviously, you want to connect to language.', 'start': 2601.29, 'duration': 1.981}, {'end': 2610.575, 'text': "There's very nice work going on at Google now in robotics where they're using deep learning for..", 'start': 2603.671, 'duration': 6.904}, {'end': 2617.316, 'text': "getting robot arms to do things, to manipulate things, but they're also interfacing it with language.", 'start': 2611.892, 'duration': 5.424}, {'end': 2630.165, 'text': "So Pierre Sermonet and Vincent van Hooch and others now have things where you can tell a robot what to do and the robot can also tell you what it's doing.", 'start': 2619.177, 'duration': 10.988}], 'summary': 'Google is using deep learning in robotics to connect language and manipulate robot arms.', 'duration': 28.875, 'max_score': 2601.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2601290.jpg'}, {'end': 2807.598, 'src': 'embed', 'start': 2753.872, 'weight': 5, 'content': [{'end': 2762.045, 'text': "So if you're just taking images or videos and just passively processing them, It doesn't make you think about attention.", 'start': 2753.872, 'duration': 8.173}, {'end': 2766.066, 'text': "But as soon as you have a robot that's moving around in the world, it's got to decide what to look at.", 'start': 2762.605, 'duration': 3.461}, {'end': 2770.568, 'text': 'And the sort of primary question in vision is where should I look next?', 'start': 2766.906, 'duration': 3.662}, {'end': 2776.17, 'text': "And that's been sort of widely ignored by people who just process static images.", 'start': 2771.768, 'duration': 4.402}, {'end': 2781.432, 'text': "Attention is crucial when it's sort of central to how human vision works.", 'start': 2777.41, 'duration': 4.022}, {'end': 2784.551, 'text': 'Can you sum up a little bit?', 'start': 2783.049, 'duration': 1.502}, {'end': 2795.146, 'text': 'everyone likes to hear about convergence of all of these things convergence of computer vision with natural language processing,', 'start': 2784.551, 'duration': 10.595}, {'end': 2801.513, 'text': 'convergence of unsupervised learning with supervised learning and reinforcement learning.', 'start': 2795.146, 'duration': 6.367}, {'end': 2807.598, 'text': "Is that beyond what you're really focused on?", 'start': 2803.394, 'duration': 4.204}], 'summary': "Robot's active vision requires attention to decide what to look at next, crucial in human vision. convergence of computer vision, nlp, unsupervised, supervised, and reinforcement learning is emphasized.", 'duration': 53.726, 'max_score': 2753.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2753872.jpg'}, {'end': 2878.8, 'src': 'embed', 'start': 2851.429, 'weight': 7, 'content': [{'end': 2856.671, 'text': 'the auditory thing gives you the word cow and the visual thing gives you whatever your visual representation of a cow is.', 'start': 2851.429, 'duration': 5.242}, {'end': 2857.791, 'text': 'then you learn they go together.', 'start': 2856.671, 'duration': 1.12}, {'end': 2865.615, 'text': "But actually supervision, when you actually get it in reality, it's just another correlation.", 'start': 2859.753, 'duration': 5.862}, {'end': 2873.618, 'text': "So it's all about complex correlations in the sensory input, both supervised and unsupervised learning.", 'start': 2866.836, 'duration': 6.782}, {'end': 2878.8, 'text': "And then there's correlations with payoffs, and that's reinforcement learning.", 'start': 2875.019, 'duration': 3.781}], 'summary': 'Auditory and visual inputs form complex correlations in supervised and unsupervised learning, leading to reinforcement learning with payoffs.', 'duration': 27.371, 'max_score': 2851.429, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2851429.jpg'}, {'end': 2889.004, 'src': 'heatmap', 'start': 2851.429, 'weight': 1, 'content': [{'end': 2856.671, 'text': 'the auditory thing gives you the word cow and the visual thing gives you whatever your visual representation of a cow is.', 'start': 2851.429, 'duration': 5.242}, {'end': 2857.791, 'text': 'then you learn they go together.', 'start': 2856.671, 'duration': 1.12}, {'end': 2865.615, 'text': "But actually supervision, when you actually get it in reality, it's just another correlation.", 'start': 2859.753, 'duration': 5.862}, {'end': 2873.618, 'text': "So it's all about complex correlations in the sensory input, both supervised and unsupervised learning.", 'start': 2866.836, 'duration': 6.782}, {'end': 2878.8, 'text': "And then there's correlations with payoffs, and that's reinforcement learning.", 'start': 2875.019, 'duration': 3.781}, {'end': 2887.383, 'text': "But I think the correlations with payoffs don't have enough structure in them for you to do most of the learning.", 'start': 2881.061, 'duration': 6.322}, {'end': 2889.004, 'text': 'So most of the learning is unsupervised.', 'start': 2887.423, 'duration': 1.581}], 'summary': 'Learning involves complex correlations in sensory input, both supervised and unsupervised, with reinforcement learning playing a role in correlations with payoffs.', 'duration': 37.575, 'max_score': 2851.429, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM2851429.jpg'}], 'start': 1998.813, 'title': 'Advances in machine learning', 'summary': "Discusses transformers in image recognition, stack capsule autoencoders, insects' life cycle paralleling data distillation, unsupervised learning in video processing, and the significance of action in perception.", 'chapters': [{'end': 2245.407, 'start': 1998.813, 'title': 'Transformers in image recognition', 'summary': 'Discusses the use of transformers in image recognition and stack capsule autoencoders, highlighting the trend of interactions between parts and the interest in unsupervised learning and distillation.', 'duration': 246.594, 'highlights': ['The trend of interactions between parts, shown in the use of transformers in image recognition and stack capsule autoencoders, is a promising approach to building layers of representation.', 'The main interest of the researcher is in unsupervised learning, specifically in developing capsules further and making distillation work better.', "The paper 'Transformers for Image Recognition at Scale' discusses the interaction of representations of patches of images using transformers, and the training process involves supervised classification with 16 by 16 patches.", 'The concept of distillation involves training a much smaller model to be as good as a big model, which has extracted regular patterns in the data and got them into its parameters.']}, {'end': 2417.234, 'start': 2245.447, 'title': 'Insects and data distillation', 'summary': 'Discusses the life cycle of insects, drawing parallels to data mining, and emphasizes the concept of using large models to train smaller ones, likening it to an apprenticeship in science.', 'duration': 171.787, 'highlights': ['Insects have a life cycle consisting of a larva stage for extracting nutrients and an adult stage for mating and traveling. Insects have a two-stage life cycle, with the larva stage dedicated to extracting nutrients and the adult stage optimized for mating and traveling.', 'Large models are used to train smaller models in data distillation, akin to the way scientists teach school kids after conducting research. The concept of using large models to train smaller ones in data distillation is compared to scientists teaching school kids after conducting research, emphasizing the effectiveness of this approach.', 'The larva stage of insects is optimized for nutrient extraction, while the adult stage is optimized for other activities. The larva stage of insects is optimized for extracting nutrients, while the adult stage is optimized for activities such as mating and traveling.']}, {'end': 2734.361, 'start': 2417.234, 'title': 'Unsupervised learning and video processing', 'summary': 'Discusses the convergence of unsupervised learning methods in video processing, emphasizing the importance of contrastive representation learning and attention mechanisms. it also explores the intersection of machine learning with human learning and the significance of language interface in robotics.', 'duration': 317.127, 'highlights': ['The convergence of unsupervised learning methods in video processing, emphasizing contrastive representation learning and attention mechanisms The chapter discusses the convergence of unsupervised learning methods in video processing, highlighting the importance of contrastive representation learning and attention mechanisms for processing video data.', "The significance of language interface in robotics and machine learning's intersection with human learning The chapter explores the significance of language interface in robotics and the intersection of machine learning with human learning, emphasizing the importance of language interface in enabling robots to understand and communicate actions.", 'The role of unsupervised learning in understanding the laws of physics and learning through observation The chapter discusses the role of unsupervised learning in understanding the laws of physics and emphasizes the learning process through observation, highlighting the acquisition of skills such as throwing a basketball through trial and error and understanding the world without explicit language instruction.']}, {'end': 2927.248, 'start': 2735.58, 'title': 'Importance of action in perception', 'summary': 'Discusses the significance of active engagement in understanding the world, emphasizing the role of attention in vision and the convergence of computer vision with natural language processing and unsupervised learning with supervised learning and reinforcement learning.', 'duration': 191.668, 'highlights': ['The importance of active engagement in understanding the world is emphasized, with a focus on the role of attention in vision for robots and the need to decide where to look next. Emphasizes the significance of active engagement in understanding the world and the need for attention in vision for robots.', 'The convergence of computer vision with natural language processing and unsupervised learning with supervised learning and reinforcement learning is discussed. Discusses the convergence of computer vision with natural language processing and unsupervised learning with supervised learning and reinforcement learning.', 'The distinction between supervised learning and unsupervised learning is explained, highlighting the concept of complex correlations in sensory input for both types of learning. Explains the distinction between supervised learning and unsupervised learning, emphasizing the concept of complex correlations in sensory input for both types of learning.']}], 'duration': 928.435, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/N0ER1MC9cqM/pics/N0ER1MC9cqM1998813.jpg', 'highlights': ['The trend of interactions between parts, shown in the use of transformers in image recognition and stack capsule autoencoders, is a promising approach to building layers of representation.', 'The concept of distillation involves training a much smaller model to be as good as a big model, which has extracted regular patterns in the data and got them into its parameters.', 'The larva stage of insects is optimized for extracting nutrients, while the adult stage is optimized for activities such as mating and traveling.', 'The chapter discusses the convergence of unsupervised learning methods in video processing, highlighting the importance of contrastive representation learning and attention mechanisms for processing video data.', 'The chapter explores the significance of language interface in robotics and the intersection of machine learning with human learning, emphasizing the importance of language interface in enabling robots to understand and communicate actions.', 'Emphasizes the significance of active engagement in understanding the world and the need for attention in vision for robots.', 'Discusses the convergence of computer vision with natural language processing and unsupervised learning with supervised learning and reinforcement learning.', 'Explains the distinction between supervised learning and unsupervised learning, emphasizing the concept of complex correlations in sensory input for both types of learning.']}], 'highlights': ['Hinton introduced capsule networks a year before sharing the 2018 Turing Award.', 'The convergence of computer vision and natural language processing, as seen in the rise of transformers in models like GPT-3 and capsule networks, signifies a potential shift towards AI models that perceive the world closer to human perception.', 'The chapter emphasizes the evolution of capsule networks, from their resurgence at NeurIPS and AAAI conferences to the use of unsupervised learning and stack capsule autoencoders.', 'Unsupervised learning recognizes objects by seeing parts in the correct relationships and scenes by seeing objects in the correct relationships.', 'Contrastive learning aims to create similar vector representations for patches of the same image and different representations for patches of different images, leading to the development of unsupervised learning models.', 'Using different crops and adjusting the color balance are crucial for data augmentation in image recognition.', 'The trend of interactions between parts, shown in the use of transformers in image recognition and stack capsule autoencoders, is a promising approach to building layers of representation.', 'The larva stage of insects is optimized for extracting nutrients, while the adult stage is optimized for activities such as mating and traveling.']}