title
Stanford CS230: Deep Learning | Autumn 2018 | Lecture 2 - Deep Learning Intuition

description
Andrew Ng, Adjunct Professor & Kian Katanforoosh, Lecturer - Stanford University https://stanford.io/3eJW8yT Andrew Ng Adjunct Professor, Computer Science Kian Katanforoosh Lecturer, Computer Science To follow along with the course schedule and syllabus, visit: http://cs230.stanford.edu/

detail
{'title': 'Stanford CS230: Deep Learning | Autumn 2018 | Lecture 2 - Deep Learning Intuition', 'heatmap': [{'end': 404.092, 'start': 243.477, 'weight': 0.726}, {'end': 599.653, 'start': 493.035, 'weight': 0.769}, {'end': 747.343, 'start': 693.584, 'weight': 0.716}, {'end': 1341.228, 'start': 1282.24, 'weight': 0.833}, {'end': 2089.641, 'start': 2033.26, 'weight': 0.766}, {'end': 2344.803, 'start': 2282.776, 'weight': 0.952}, {'end': 3332.84, 'start': 3275.335, 'weight': 1}, {'end': 3576.418, 'start': 3524.674, 'weight': 0.724}], 'summary': 'The lecture provides a systematic approach for deep learning projects, covering neural network encoding, image classification, face verification considerations, image comparison, advanced image processing techniques, style transfer, and speech recognition implementation, emphasizing the importance of dataset size and architecture selection for successful project implementation.', 'chapters': [{'end': 566.663, 'segs': [{'end': 62.023, 'src': 'embed', 'start': 34.386, 'weight': 0, 'content': [{'end': 37.548, 'text': "Let's start the lecture while you guys are doing that.", 'start': 34.386, 'duration': 3.162}, {'end': 51.116, 'text': "Okay, So today's lecture is going to be about deep learning, intuition, and the goal is to give you a systematic way to think about projects,", 'start': 40.75, 'duration': 10.366}, {'end': 52.337, 'text': 'everything related to deep learning.', 'start': 51.116, 'duration': 1.221}, {'end': 58.801, 'text': 'It includes how to collect your data, how to label your data, how to choose an architecture,', 'start': 52.797, 'duration': 6.004}, {'end': 62.023, 'text': 'but also how to design a proper loss function to optimize.', 'start': 58.801, 'duration': 3.222}], 'summary': 'Lecture on deep learning, covering data collection, labeling, architecture, and loss function design.', 'duration': 27.637, 'max_score': 34.386, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU34386.jpg'}, {'end': 200.685, 'src': 'embed', 'start': 170.062, 'weight': 1, 'content': [{'end': 173.663, 'text': 'Uh, the function used to do so is called the loss function.', 'start': 170.062, 'duration': 3.601}, {'end': 177.444, 'text': "You've seen an example of a loss function this week that is a logistic loss function.", 'start': 173.943, 'duration': 3.501}, {'end': 180.664, 'text': 'Uh, we will see more loss functions, uh, later on.', 'start': 178.164, 'duration': 2.5}, {'end': 190.527, 'text': 'Uh, computing the gradient of this loss function is going to tell you how much should I move my parameters in order to update uh in or- in order to make the loss go down.', 'start': 181.565, 'duration': 8.962}, {'end': 194.508, 'text': 'So in order to make this function recognize cats better than before.', 'start': 191.127, 'duration': 3.381}, {'end': 200.685, 'text': 'You do that many, many times until you find the right parameters to plug in your architecture.', 'start': 195.376, 'duration': 5.309}], 'summary': 'Loss function calculates parameter updates to minimize loss for cat recognition.', 'duration': 30.623, 'max_score': 170.062, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU170062.jpg'}, {'end': 404.092, 'src': 'heatmap', 'start': 243.477, 'weight': 0.726, 'content': [{'end': 250.963, 'text': 'I think the loss function is something that, that people struggle with to understand what loss function to, to choose, uh, for a specific project.', 'start': 243.477, 'duration': 7.486}, {'end': 253.566, 'text': "And we're going to put a huge emphasis on that today.", 'start': 251.003, 'duration': 2.563}, {'end': 260.231, 'text': 'Okay And of course, in the architecture, you can change the activation functions.', 'start': 254.867, 'duration': 5.364}, {'end': 262.932, 'text': 'In this optimization loop, you can choose a specific optimizers.', 'start': 260.331, 'duration': 2.601}, {'end': 271.138, 'text': "We're going to see in about three weeks all the optimizers that can be Adam, stochastic, gradient descent, batch, gradient descent, RMS,", 'start': 263.373, 'duration': 7.765}, {'end': 271.999, 'text': 'prop and momentum.', 'start': 271.138, 'duration': 0.861}, {'end': 274.56, 'text': 'And finally, all the hyper-parameters.', 'start': 272.959, 'duration': 1.601}, {'end': 276.161, 'text': 'What is the learning rate of this loop?', 'start': 274.62, 'duration': 1.541}, {'end': 278.383, 'text': "What is the batch that I'm using for my optimization?", 'start': 276.322, 'duration': 2.061}, {'end': 282.526, 'text': "We're going to see all that together, but there's a bunch of things that can change in this scheme.", 'start': 278.423, 'duration': 4.103}, {'end': 290.319, 'text': 'Any questions on that in general? So far so good? Okay.', 'start': 284.067, 'duration': 6.252}, {'end': 296.304, 'text': "So let's take the first architecture that we've seen together, logistic regression.", 'start': 292.361, 'duration': 3.943}, {'end': 301.769, 'text': 'As you know, an image in computer science can be represented by a 3D matrix.', 'start': 297.085, 'duration': 4.684}, {'end': 307.093, 'text': 'Each matrix represent a certain color, RGB, red, green, blue.', 'start': 302.549, 'duration': 4.544}, {'end': 311.557, 'text': 'We can take all these numbers from these 3D metrics and put it in a vector.', 'start': 307.914, 'duration': 3.643}, {'end': 312.418, 'text': 'We flatten it.', 'start': 311.837, 'duration': 0.581}, {'end': 314.859, 'text': 'in order to give it to our logistic regression.', 'start': 313.078, 'duration': 1.781}, {'end': 323.022, 'text': 'We forward propagate it, we multiply it by w which is our parameter and b which is our bias, give it to a sigmoid function, get an output.', 'start': 315.299, 'duration': 7.723}, {'end': 330.026, 'text': 'If the network is trained properly, we should get a number that is more than 0.5 here to tell us that there is a cat in this image.', 'start': 323.463, 'duration': 6.563}, {'end': 332.106, 'text': 'So this is the basic scheme.', 'start': 330.866, 'duration': 1.24}, {'end': 342.887, 'text': 'Now, uh, my question for you is, if I want to do the same thing but, uh, I want to have a classifier that can classify several animals.', 'start': 333.447, 'duration': 9.44}, {'end': 347.771, 'text': 'So, in the image, there could be a giraffe, there could be an elephant, or there could be a cat.', 'start': 343.447, 'duration': 4.324}, {'end': 365.063, 'text': 'How would you modify this architecture? Yes.', 'start': 348.512, 'duration': 16.551}, {'end': 365.824, 'text': 'Yes, exactly.', 'start': 365.063, 'duration': 0.761}, {'end': 366.584, 'text': "So that's a good point.", 'start': 365.864, 'duration': 0.72}, {'end': 367.805, 'text': 'We could add several units.', 'start': 366.624, 'duration': 1.181}, {'end': 373.27, 'text': 'So several neurons, one for each animal and we will call it multi-logistic regression.', 'start': 368.006, 'duration': 5.264}, {'end': 374.832, 'text': 'So it could be something like that.', 'start': 373.791, 'duration': 1.041}, {'end': 378.055, 'text': 'So we have a fully connection here.', 'start': 375.973, 'duration': 2.082}, {'end': 382.599, 'text': 'Before we were all, all the inputs were connected to this neuron and now we added two neurons.', 'start': 378.555, 'duration': 4.044}, {'end': 386.482, 'text': 'And each neuron is going to be responsible for one animal.', 'start': 383.019, 'duration': 3.463}, {'end': 404.092, 'text': 'How do we know which neuron is responsible for which animal? Is the network going to figure it out on its own or do we have to help it? Exactly.', 'start': 386.702, 'duration': 17.39}], 'summary': 'The transcript covers topics such as loss function, activation functions, optimizers, hyper-parameters, and modifying architecture for multi-class classification.', 'duration': 160.615, 'max_score': 243.477, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU243477.jpg'}, {'end': 301.769, 'src': 'embed', 'start': 254.867, 'weight': 2, 'content': [{'end': 260.231, 'text': 'Okay And of course, in the architecture, you can change the activation functions.', 'start': 254.867, 'duration': 5.364}, {'end': 262.932, 'text': 'In this optimization loop, you can choose a specific optimizers.', 'start': 260.331, 'duration': 2.601}, {'end': 271.138, 'text': "We're going to see in about three weeks all the optimizers that can be Adam, stochastic, gradient descent, batch, gradient descent, RMS,", 'start': 263.373, 'duration': 7.765}, {'end': 271.999, 'text': 'prop and momentum.', 'start': 271.138, 'duration': 0.861}, {'end': 274.56, 'text': 'And finally, all the hyper-parameters.', 'start': 272.959, 'duration': 1.601}, {'end': 276.161, 'text': 'What is the learning rate of this loop?', 'start': 274.62, 'duration': 1.541}, {'end': 278.383, 'text': "What is the batch that I'm using for my optimization?", 'start': 276.322, 'duration': 2.061}, {'end': 282.526, 'text': "We're going to see all that together, but there's a bunch of things that can change in this scheme.", 'start': 278.423, 'duration': 4.103}, {'end': 290.319, 'text': 'Any questions on that in general? So far so good? Okay.', 'start': 284.067, 'duration': 6.252}, {'end': 296.304, 'text': "So let's take the first architecture that we've seen together, logistic regression.", 'start': 292.361, 'duration': 3.943}, {'end': 301.769, 'text': 'As you know, an image in computer science can be represented by a 3D matrix.', 'start': 297.085, 'duration': 4.684}], 'summary': 'The transcript covers the ability to change activation functions, choose specific optimizers, and adjust hyper-parameters in the context of machine learning.', 'duration': 46.902, 'max_score': 254.867, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU254867.jpg'}], 'start': 6.499, 'title': 'Deep learning project approach', 'summary': 'Provides a high-level overview for deep learning projects, emphasizing systematic thinking in data collection, architecture selection, and loss function design. it also discusses the flexibility in neural network architecture and the importance of choosing the right loss function, particularly focusing on multi-logistic regression and label encoding methods for image classification.', 'chapters': [{'end': 234.369, 'start': 6.499, 'title': 'Deep learning intuition', 'summary': 'Discusses the systematic way to think about projects related to deep learning, including how to collect and label data, choose an architecture, and design a proper loss function, aiming to provide a high-level overview and good start for projects.', 'duration': 227.87, 'highlights': ['Deep learning involves modeling a function that takes an input and gives an output, with the architecture being the design and the parameters being the core parts involving millions of numbers. Deep learning involves modeling a function that takes an input and gives an output, with the architecture being the design and the parameters being the core parts involving millions of numbers.', 'The lecture aims to provide a systematic way of thinking for different projects, covering decisions related to data collection, data labeling, architecture selection, and loss function design. The lecture aims to provide a systematic way of thinking for different projects, covering decisions related to data collection, data labeling, architecture selection, and loss function design.', "The function used to compare the model's output to the ground truth is called the loss function, and computing its gradient guides parameter updates to minimize the loss and improve model performance. The function used to compare the model's output to the ground truth is called the loss function, and computing its gradient guides parameter updates to minimize the loss and improve model performance."]}, {'end': 566.663, 'start': 235.59, 'title': 'Neural network architecture and loss function', 'summary': 'Discusses the flexibility in changing the neural network architecture, including activation functions, optimizers, and hyper-parameters, and emphasizes the importance of choosing the right loss function, with a focus on multi-logistic regression and the implications of different label encoding methods in image classification.', 'duration': 331.073, 'highlights': ['The chapter emphasizes the importance of choosing the right loss function, with a focus on multi-logistic regression and the implications of different label encoding methods in image classification. The loss function is highlighted as something people struggle to understand, and the importance of choosing the right loss function for a specific project is emphasized. The discussion delves into multi-logistic regression, highlighting the implications of different label encoding methods in image classification, such as one-hot encoding and multi-hot encoding.', 'The flexibility in changing the neural network architecture, including activation functions, optimizers, and hyper-parameters, is discussed. The transcript covers the various elements of neural network architecture that can be adjusted, such as activation functions, optimizers (Adam, stochastic gradient descent, batch gradient descent, RMSprop, momentum), and hyper-parameters (learning rate, batch size).']}], 'duration': 560.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU6499.jpg', 'highlights': ['The lecture aims to provide a systematic way of thinking for different projects, covering decisions related to data collection, data labeling, architecture selection, and loss function design.', "The function used to compare the model's output to the ground truth is called the loss function, and computing its gradient guides parameter updates to minimize the loss and improve model performance.", 'The chapter emphasizes the importance of choosing the right loss function, with a focus on multi-logistic regression and the implications of different label encoding methods in image classification.', 'The flexibility in changing the neural network architecture, including activation functions, optimizers, and hyper-parameters, is discussed.']}, {'end': 1445.021, 'segs': [{'end': 614.416, 'src': 'embed', 'start': 588.568, 'weight': 1, 'content': [{'end': 595.051, 'text': 'Okay Now, the concept I wanted to introduce in this recap was the concept of encoding.', 'start': 588.568, 'duration': 6.483}, {'end': 599.653, 'text': 'Uh, you probably- some of you have probably seen this image before.', 'start': 595.911, 'duration': 3.742}, {'end': 612.413, 'text': 'If you have a- a network that is not too shallow, you would notice that what the first neurons see, are very, uh, precise representation of the data.', 'start': 600.554, 'duration': 11.859}, {'end': 614.416, 'text': "So they're pixel level representation of the data.", 'start': 612.433, 'duration': 1.983}], 'summary': 'Introducing the concept of encoding in neural networks for precise pixel-level data representation.', 'duration': 25.848, 'max_score': 588.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU588568.jpg'}, {'end': 669.161, 'src': 'embed', 'start': 638.942, 'weight': 0, 'content': [{'end': 643.165, 'text': "Because the first neurons will see pixels, they're gonna output a little more detailed information.", 'start': 638.942, 'duration': 4.223}, {'end': 645.927, 'text': 'like I found an edge here, I found an edge there, and so on.', 'start': 643.165, 'duration': 2.762}, {'end': 647.268, 'text': 'Give it to the second layer.', 'start': 646.248, 'duration': 1.02}, {'end': 649.851, 'text': 'The second layer is going to see more complex information.', 'start': 647.629, 'duration': 2.222}, {'end': 658.917, 'text': "It's going to give it to the third layer, which is going to assemble some high-level, complex features that could be eyes, nose, mouth,", 'start': 650.031, 'duration': 8.886}, {'end': 660.598, 'text': "depending on what network you've been training.", 'start': 658.917, 'duration': 1.681}, {'end': 669.161, 'text': "So this is an extraction of what's happening in each layer, uh, when the network was trained on, uh, face recognition.", 'start': 661.298, 'duration': 7.863}], 'summary': 'Neural network layers process data incrementally, leading to high-level features like eyes, nose, and mouth in face recognition.', 'duration': 30.219, 'max_score': 638.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU638942.jpg'}, {'end': 747.343, 'src': 'heatmap', 'start': 693.584, 'weight': 0.716, 'content': [{'end': 695.686, 'text': 'Okay So this is what we call an encoding.', 'start': 693.584, 'duration': 2.102}, {'end': 704.212, 'text': 'It means if I extract the information from this layer, so all the numbers that are coming out of these edges,', 'start': 696.286, 'duration': 7.926}, {'end': 708.936, 'text': 'I extract them I will have a complex representation of my input data.', 'start': 704.212, 'duration': 4.724}, {'end': 715.521, 'text': 'If I extract the numbers that are at the end of the first layer, I will have a lower-level representation of my data that might be edges.', 'start': 709.416, 'duration': 6.105}, {'end': 719.865, 'text': "Okay We're going to use these encoding, uh, throughout this lecture.", 'start': 716.842, 'duration': 3.023}, {'end': 726.372, 'text': 'Any questions on that? Okay.', 'start': 720.825, 'duration': 5.547}, {'end': 729.614, 'text': "So, let's build intuition on concrete applications.", 'start': 727.813, 'duration': 1.801}, {'end': 736.657, 'text': "We're going to start, uh, with a short warm-up with the day and night classification and then quickly move to face verification and face recognition.", 'start': 730.034, 'duration': 6.623}, {'end': 741.4, 'text': "And after that, we'll do some art generation and finish with a trigger word detection.", 'start': 737.398, 'duration': 4.002}, {'end': 747.343, 'text': "If we have time, we- we'll talk about how to ship a model, which is shipping architecture plus parameters.", 'start': 741.76, 'duration': 5.583}], 'summary': 'The lecture covers encoding, applications like day-night classification, face recognition, and art generation.', 'duration': 53.759, 'max_score': 693.584, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU693584.jpg'}, {'end': 915.389, 'src': 'embed', 'start': 886.921, 'weight': 3, 'content': [{'end': 889.524, 'text': "So, let's say we did a problem that was cat recognition.", 'start': 886.921, 'duration': 2.603}, {'end': 892.146, 'text': 'Detect if there is a cat on an image or not.', 'start': 890.465, 'duration': 1.681}, {'end': 899.478, 'text': 'In this problem, we remember that with 10, 000 images, we managed to train a pretty good classifier.', 'start': 893.194, 'duration': 6.284}, {'end': 908.084, 'text': "How do you compare this problem to the CAT problem? You think it's easier or harder? Easier.", 'start': 900.199, 'duration': 7.885}, {'end': 908.764, 'text': 'Yeah, I agree.', 'start': 908.264, 'duration': 0.5}, {'end': 909.585, 'text': "That's probably easier.", 'start': 908.804, 'duration': 0.781}, {'end': 915.389, 'text': 'So, in terms of complexity, this task looks less complex than the CAT recognition task.', 'start': 910.305, 'duration': 5.084}], 'summary': 'Trained classifier with 10,000 images for cat recognition, considered easier than a more complex cat recognition task.', 'duration': 28.468, 'max_score': 886.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU886921.jpg'}, {'end': 999.855, 'src': 'embed', 'start': 974.253, 'weight': 4, 'content': [{'end': 978.997, 'text': 'I think for this task, if you take outside pictures, 10, 000 images is going to be enough.', 'start': 974.253, 'duration': 4.744}, {'end': 985.625, 'text': 'But if you want the network to detect indoor as well, you probably need 100, 000 images or something.', 'start': 979.841, 'duration': 5.784}, {'end': 988.827, 'text': 'And this is based on comparing with projects you did in the past.', 'start': 986.226, 'duration': 2.601}, {'end': 990.108, 'text': "So it's gonna come with experience.", 'start': 989.007, 'duration': 1.101}, {'end': 996.933, 'text': 'Now, as you know, when you have a dataset, you need to split it between train, validation, and test sets.', 'start': 991.809, 'duration': 5.124}, {'end': 998.234, 'text': 'Some of you have heard that.', 'start': 997.313, 'duration': 0.921}, {'end': 999.855, 'text': "We're going to see it together even more.", 'start': 998.294, 'duration': 1.561}], 'summary': 'To train for outdoor detection, 10,000 images suffice; for indoor detection, 100,000 images may be needed, based on past projects.', 'duration': 25.602, 'max_score': 974.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU974253.jpg'}, {'end': 1045.021, 'src': 'embed', 'start': 1016.02, 'weight': 5, 'content': [{'end': 1022.406, 'text': 'I think we we would go more towards 80-20, because the test set is made for analyzed,', 'start': 1016.02, 'duration': 6.386}, {'end': 1025.569, 'text': 'to analyze if your network is doing well on real-world data or not.', 'start': 1022.406, 'duration': 3.163}, {'end': 1029.532, 'text': 'I think 2, 000 images is enough to get that sense, probably.', 'start': 1026.109, 'duration': 3.423}, {'end': 1033.135, 'text': 'and you want to put complicated examples in this dataset as well.', 'start': 1029.813, 'duration': 3.322}, {'end': 1037.837, 'text': 'So, I will go towards 80-20 and the bigger the dataset, the more I would put in the train set.', 'start': 1033.494, 'duration': 4.343}, {'end': 1045.021, 'text': 'So, if I have one million images, I would put even more like 98 percent maybe in the train set and two percent to test my model.', 'start': 1038.297, 'duration': 6.724}], 'summary': 'The speaker recommends an 80-20 split for a dataset of 2000 images, increasing the training set for larger datasets.', 'duration': 29.001, 'max_score': 1016.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1016020.jpg'}, {'end': 1146.916, 'src': 'embed', 'start': 1109.073, 'weight': 6, 'content': [{'end': 1112.814, 'text': 'Why do we want low resolution is because in terms of computation is going to be better.', 'start': 1109.073, 'duration': 3.741}, {'end': 1122.841, 'text': "Remember, if I have a 32 by 32 image, how many pixels there are? If it's color, I have 32 times 32 times 3.", 'start': 1113.919, 'duration': 8.922}, {'end': 1126.462, 'text': 'If I have 400 by 400, I have 400 by 400 by 3.', 'start': 1122.841, 'duration': 3.621}, {'end': 1127.082, 'text': "It's a lot more.", 'start': 1126.462, 'duration': 0.62}, {'end': 1132.103, 'text': 'So I want to minimize the resolution in order to still be able to achieve good performance.', 'start': 1127.882, 'duration': 4.221}, {'end': 1146.916, 'text': 'So what does it mean to still achieve good performance? How do I get this number? Okay.', 'start': 1133.244, 'duration': 13.672}], 'summary': 'Low resolution reduces computation; 32x32 image has 32x32x3 pixels, 400x400 has 400x400x3.', 'duration': 37.843, 'max_score': 1109.073, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1109073.jpg'}, {'end': 1214.348, 'src': 'embed', 'start': 1186.312, 'weight': 7, 'content': [{'end': 1190.875, 'text': 'classify those and classify those, and I would compare human performance on all these three types of resolution.', 'start': 1186.312, 'duration': 4.563}, {'end': 1197.139, 'text': "in order to decide what's the minimum resolution that I can use in order to get perfect human performance.", 'start': 1191.856, 'duration': 5.283}, {'end': 1208.685, 'text': 'So by doing that, I got that 64 by 64 by 3 was enough resolution for a human to detect if an image is taken during the day or during the night.', 'start': 1198.54, 'duration': 10.145}, {'end': 1211.267, 'text': 'And this is a pretty small resolution in imaging.', 'start': 1209.246, 'duration': 2.021}, {'end': 1214.348, 'text': 'but it seems like a small, like an easy task.', 'start': 1212.386, 'duration': 1.962}], 'summary': 'Research found 64x64x3 resolution sufficient for humans to detect day/night images.', 'duration': 28.036, 'max_score': 1186.312, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1186312.jpg'}, {'end': 1341.228, 'src': 'heatmap', 'start': 1282.24, 'weight': 0.833, 'content': [{'end': 1297.444, 'text': 'And what should be the loss function finally? Yeah.', 'start': 1282.24, 'duration': 15.204}, {'end': 1302.007, 'text': "So the log likelihood, so it's also called the logistic loss, that's the one you're talking about.", 'start': 1298.225, 'duration': 3.782}, {'end': 1307.77, 'text': "So the way you get this number and, and you will prove it in, in CS229, we're, we're not going to prove it here.", 'start': 1302.687, 'duration': 5.083}, {'end': 1316.094, 'text': 'But basically, you interpret your data in a probabilistic way and you take, uh, the maximum likelihood estimation of the data,', 'start': 1308.15, 'duration': 7.944}, {'end': 1318.795, 'text': 'which gives you this formula for those of you who did the math behind', 'start': 1316.094, 'duration': 2.701}, {'end': 1322.337, 'text': 'You can ask in office hours, TAs are gonna help you understand it more properly.', 'start': 1319.335, 'duration': 3.002}, {'end': 1329.316, 'text': 'Okay And of course, this means that if y equals 0, we want y hat, the prediction to be close to 0.', 'start': 1324.011, 'duration': 5.305}, {'end': 1332.74, 'text': 'If y equal 1, we want y hat, the prediction to be close to 1.', 'start': 1329.316, 'duration': 3.424}, {'end': 1334.441, 'text': 'Okay So this was the warm-up.', 'start': 1332.74, 'duration': 1.701}, {'end': 1337.865, 'text': "Now, we're going to delve into phase verification.", 'start': 1335.622, 'duration': 2.243}, {'end': 1341.228, 'text': 'Any question on DNA classification? Yes.', 'start': 1338.245, 'duration': 2.983}], 'summary': 'The loss function is the log likelihood, also known as logistic loss, interpreted in a probabilistic way for maximum likelihood estimation.', 'duration': 58.988, 'max_score': 1282.24, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1282240.jpg'}, {'end': 1415.475, 'src': 'embed', 'start': 1383.335, 'weight': 2, 'content': [{'end': 1384.576, 'text': 'This can depend on the task.', 'start': 1383.335, 'duration': 1.241}, {'end': 1392.223, 'text': 'Like if I talk about- if I- if I tell you about speech recognition, you want to figure out if your model is doing well for all accents in the world.', 'start': 1385.077, 'duration': 7.146}, {'end': 1395.507, 'text': 'So your test set might be very big and very distributed.', 'start': 1392.924, 'duration': 2.583}, {'end': 1403.357, 'text': 'In this case, you might have a few examples that are during the day, few during the night, and a few at dawn, sunset, sunrise, and also indoor.', 'start': 1396.148, 'duration': 7.209}, {'end': 1405.28, 'text': 'Few of those is going to give you a number.', 'start': 1403.858, 'duration': 1.422}, {'end': 1407.282, 'text': "So there's no good number.", 'start': 1405.86, 'duration': 1.422}, {'end': 1408.684, 'text': 'There is like, you have to gauge it.', 'start': 1407.422, 'duration': 1.262}, {'end': 1410.847, 'text': 'Okay One more question.', 'start': 1410.006, 'duration': 0.841}, {'end': 1415.475, 'text': "Yeah, that's a good question.", 'start': 1414.594, 'duration': 0.881}], 'summary': 'Evaluating speech recognition models across global accents with a diverse test set and various environmental conditions. no specific quantifiable data mentioned.', 'duration': 32.14, 'max_score': 1383.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1383335.jpg'}, {'end': 1456.11, 'src': 'embed', 'start': 1423.281, 'weight': 8, 'content': [{'end': 1429.585, 'text': "But for this one specifically, you choose this one because it, it, it's a, it's a convex function for classification prob- problem.", 'start': 1423.281, 'duration': 6.304}, {'end': 1432.227, 'text': "It's easier to optimize than other loss functions.", 'start': 1430.146, 'duration': 2.081}, {'end': 1435.59, 'text': 'So, there is a proof but, but I will not go over it here.', 'start': 1432.688, 'duration': 2.902}, {'end': 1443.48, 'text': 'if you know the L1 loss that compares y to y hat, this one is harder to optimize for a classification problem.', 'start': 1437.355, 'duration': 6.125}, {'end': 1445.021, 'text': 'We would use it for regression problems.', 'start': 1443.56, 'duration': 1.461}, {'end': 1456.11, 'text': 'Okay So our new game is, uh, the school wants to use face verification to validate student IDs in facilities like the gym.', 'start': 1447.043, 'duration': 9.067}], 'summary': 'Logistic loss function is chosen for classification due to its convexity and ease of optimization, while l1 loss is considered harder to optimize for classification problems.', 'duration': 32.829, 'max_score': 1423.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1423281.jpg'}], 'start': 567.623, 'title': 'Neural network encoding and image classification', 'summary': 'Introduces neural network encoding, explaining how it creates complex features in face recognition. it also covers image classification, dataset size estimation, and factors influencing dataset size, including insights into the number of images needed based on task complexity.', 'chapters': [{'end': 729.614, 'start': 567.623, 'title': 'Neural network encoding and representation', 'summary': 'Introduces the concept of encoding in neural networks, explaining how the neurons in each layer process and represent the input data, eventually creating high-level complex features, such as eyes, nose, and mouth, in face recognition networks.', 'duration': 161.991, 'highlights': ['Neurons in each layer process and represent input data The neurons in each layer process and represent the input data, with the first layer focusing on pixel-level representation and subsequent layers creating more high-level, complex features.', 'Creation of high-level complex features in face recognition networks The network creates high-level complex features, such as eyes, nose, and mouth, in face recognition networks, through the processing and representation of input data by the neurons in each layer.', 'Application of encoding in neural networks The concept of encoding is introduced, explaining how the information extracted from each layer forms a complex representation of the input data, which is utilized in various applications.']}, {'end': 974.173, 'start': 730.034, 'title': 'Image classification and dataset size estimation', 'summary': 'Covers the process of image classification, dataset size estimation, and factors influencing dataset size, emphasizing the complexity of the task and its impact on the amount of data required, including insights into the number of images needed based on task complexity and specific scenarios.', 'duration': 244.139, 'highlights': ['The complexity of the task determines the amount of data needed; simpler tasks may require less data. Simpler tasks like day vs. night classification may require less data, while more complex tasks, such as classifying complicated indoor or twilight images, may necessitate a larger dataset.', 'Dataset size estimation based on task complexity and specific scenarios. Factors like the difficulty of the classification task, such as distinguishing indoor or twilight images, impact the dataset size required for effective training.', 'Comparison of task complexity to estimate dataset size requirement. Comparing the complexity of the current classification task to a simpler task like cat recognition helps estimate the dataset size needed for effective training.']}, {'end': 1445.021, 'start': 974.253, 'title': 'Image dataset training and testing', 'summary': 'Discusses the importance of image dataset size, the split between train and test sets, resolution of input images, and selection of loss function for model training, emphasizing the need for balance, experience, and human performance comparison.', 'duration': 470.768, 'highlights': ['The importance of dataset size is emphasized, with 100,000 images suggested for detecting indoor scenes. The speaker suggests that 100,000 images may be needed to train a network to detect indoor scenes, based on past project comparisons.', 'The need for a balanced split between train and test sets is explained, with a preference for an 80-20 split and the rationale behind it. The speaker recommends an 80-20 split for train and test sets, stating that a test set is necessary to analyze real-world performance, with 2,000 images deemed sufficient for testing.', 'The importance of selecting the right resolution for input images and its impact on computational efficiency is highlighted. The discussion emphasizes the need for low resolution to improve computational efficiency, with the suggestion to minimize resolution while achieving good performance.', 'The recommendation to compare model performance with human performance for resolution selection is provided, with a practical approach outlined. A practical approach to resolution selection involves comparing model performance with human classification at different resolutions, aiming to achieve perfect human performance at the minimum resolution.', 'The rationale for choosing the loss function, specifically the log likelihood, is explained with a focus on its suitability for classification problems. The choice of the log likelihood as the loss function is justified by its suitability for classification problems and ease of optimization, compared to other loss functions.']}], 'duration': 877.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU567623.jpg', 'highlights': ['The network creates high-level complex features, such as eyes, nose, and mouth, in face recognition networks, through the processing and representation of input data by the neurons in each layer.', 'The concept of encoding is introduced, explaining how the information extracted from each layer forms a complex representation of the input data, which is utilized in various applications.', 'Factors like the difficulty of the classification task, such as distinguishing indoor or twilight images, impact the dataset size required for effective training.', 'Comparing the complexity of the current classification task to a simpler task like cat recognition helps estimate the dataset size needed for effective training.', 'The speaker suggests that 100,000 images may be needed to train a network to detect indoor scenes, based on past project comparisons.', 'The speaker recommends an 80-20 split for train and test sets, stating that a test set is necessary to analyze real-world performance, with 2,000 images deemed sufficient for testing.', 'The discussion emphasizes the need for low resolution to improve computational efficiency, with the suggestion to minimize resolution while achieving good performance.', 'A practical approach to resolution selection involves comparing model performance with human classification at different resolutions, aiming to achieve perfect human performance at the minimum resolution.', 'The choice of the log likelihood as the loss function is justified by its suitability for classification problems and ease of optimization, compared to other loss functions.']}, {'end': 1737.407, 'segs': [{'end': 1468.024, 'src': 'embed', 'start': 1447.043, 'weight': 0, 'content': [{'end': 1456.11, 'text': 'Okay So our new game is, uh, the school wants to use face verification to validate student IDs in facilities like the gym.', 'start': 1447.043, 'duration': 9.067}, {'end': 1461.339, 'text': 'So you know, when you enter the gym, you swipe your ID and then, uh, I guess,', 'start': 1456.591, 'duration': 4.748}, {'end': 1468.024, 'text': "the person sees your face on the screen based on this ID and looks at your face in real and compares, let's say", 'start': 1461.339, 'duration': 6.685}], 'summary': 'New game: school uses face verification to validate student ids in facilities like the gym.', 'duration': 20.981, 'max_score': 1447.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1447043.jpg'}, {'end': 1607.5, 'src': 'embed', 'start': 1544.962, 'weight': 1, 'content': [{'end': 1547.884, 'text': 'The person standing in front of the camera when entering the gym.', 'start': 1544.962, 'duration': 2.922}, {'end': 1549.526, 'text': 'So this is the entrance of the gym.', 'start': 1548.205, 'duration': 1.321}, {'end': 1552.486, 'text': 'and Bertrand is trying to enter the gym.', 'start': 1550.825, 'duration': 1.661}, {'end': 1554.166, 'text': "So, it's him.", 'start': 1553.786, 'duration': 0.38}, {'end': 1570.993, 'text': 'Okay What should be the resolution? Those of you who have done projects in imaging, what do you think should be the resolution? 256.', 'start': 1555.127, 'duration': 15.866}, {'end': 1579.718, 'text': '256 by 256? Any other idea more precise? I think in general, you will go over 400.', 'start': 1570.993, 'duration': 8.725}, {'end': 1580.759, 'text': 'So 400 by 400.', 'start': 1579.718, 'duration': 1.041}, {'end': 1592.269, 'text': "What's the reason? Why do we need 64 for, for day and night and, and 400 for face verification? Yeah.", 'start': 1580.759, 'duration': 11.51}, {'end': 1594.751, 'text': "Yeah, there's more details to detect.", 'start': 1593.35, 'duration': 1.401}, {'end': 1603.177, 'text': 'So like distance between the eyes, probably, size of the nose, mouth, uh, general, general features of the face.', 'start': 1594.791, 'duration': 8.386}, {'end': 1605.979, 'text': 'These are harder to detect for a 64 by 64 image.', 'start': 1603.297, 'duration': 2.682}, {'end': 1607.5, 'text': 'And you can test it.', 'start': 1606.8, 'duration': 0.7}], 'summary': 'Discussion on image resolution: 256x256 vs. 400x400 for face verification and details detection.', 'duration': 62.538, 'max_score': 1544.962, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1544962.jpg'}, {'end': 1749.174, 'src': 'embed', 'start': 1718.196, 'weight': 2, 'content': [{'end': 1721.298, 'text': 'But in general, uh, the more complex the task, the more data you will need.', 'start': 1718.196, 'duration': 3.102}, {'end': 1729.343, 'text': "And we will see something called error analysis in about four weeks, which is once your network works, you're going to give it a lot of examples.", 'start': 1721.758, 'duration': 7.585}, {'end': 1735.606, 'text': "detect which examples are misclassified by your network, and you're going to add more of these in the training set.", 'start': 1730.023, 'duration': 5.583}, {'end': 1737.407, 'text': "So, you're going to boost your dataset.", 'start': 1736.047, 'duration': 1.36}, {'end': 1740.509, 'text': 'Okay Talking about the architecture.', 'start': 1739.389, 'duration': 1.12}, {'end': 1749.174, 'text': "If I ask you, what's the easiest way to compare two images? What would you do? Like these two images, the database image and the input image.", 'start': 1740.749, 'duration': 8.425}], 'summary': 'Complex tasks require more data; error analysis identifies misclassifications and boosts dataset; architecture involves comparing images.', 'duration': 30.978, 'max_score': 1718.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1718196.jpg'}], 'start': 1447.043, 'title': 'Implementing face verification for gym access and its considerations', 'summary': 'Covers implementing face verification for gym access, requiring a dataset with id-image mapping and labeled photos of students, and considering a high resolution like 400 by 400 for the model input. it also discusses the importance of image resolution and color in face verification, the trade-off between resolution and computational power, and the impact of dataset size on training the network for face verification.', 'chapters': [{'end': 1580.759, 'start': 1447.043, 'title': 'Face verification for gym access', 'summary': 'Discusses implementing face verification for student ids in the gym, requiring a dataset with id-image mapping and labeled photos of students, and considering a high resolution like 400 by 400 for the model input.', 'duration': 133.716, 'highlights': ['Implementing face verification for student IDs in the gym, requiring a dataset with ID-image mapping and labeled photos of students.', 'Considering a high resolution like 400 by 400 for the model input.', 'Discussing the need for a mapping between the ID and the image, and the requirement for pictures of every student labeled with their names.']}, {'end': 1737.407, 'start': 1580.759, 'title': 'Face verification and image resolution', 'summary': 'Discusses the importance of image resolution and color in face verification, the trade-off between resolution and computational power, and the impact of dataset size on training the network for face verification.', 'duration': 156.648, 'highlights': ["The output for face verification is 1 if it's u and 0 if it's not u, impacting access to certain facilities.", 'The trade-off between computation and resolution is crucial, with more complex tasks requiring larger datasets for training the network.', 'Color is important for face verification as it enables detection in different settings such as day and night, impacting the number of pixels and the ability to differentiate people based on color.', 'The need for a larger dataset for more complex tasks and the concept of error analysis to improve network performance are highlighted.', 'The difficulty in determining the exact number of images needed to train the network is discussed, with the suggestion to add more misclassified examples to boost the dataset.']}], 'duration': 290.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1447043.jpg', 'highlights': ['Implementing face verification for student IDs in the gym, requiring a dataset with ID-image mapping and labeled photos of students.', 'Considering a high resolution like 400 by 400 for the model input.', 'The trade-off between computation and resolution is crucial, with more complex tasks requiring larger datasets for training the network.', 'Color is important for face verification as it enables detection in different settings such as day and night, impacting the number of pixels and the ability to differentiate people based on color.']}, {'end': 2826.576, 'segs': [{'end': 1780.536, 'src': 'embed', 'start': 1739.389, 'weight': 2, 'content': [{'end': 1740.509, 'text': 'Okay Talking about the architecture.', 'start': 1739.389, 'duration': 1.12}, {'end': 1749.174, 'text': "If I ask you, what's the easiest way to compare two images? What would you do? Like these two images, the database image and the input image.", 'start': 1740.749, 'duration': 8.425}, {'end': 1753.217, 'text': 'Some sort of hash? Some sort of hash.', 'start': 1751.095, 'duration': 2.122}, {'end': 1758.5, 'text': 'What do you mean by hash? Uh, taking the input runs, uh, standardized function on it and then compare.', 'start': 1753.237, 'duration': 5.263}, {'end': 1763.425, 'text': 'Take this, run it into a specific function.', 'start': 1760.082, 'duration': 3.343}, {'end': 1766.387, 'text': 'Take this, run it into a specific function and compare the two values.', 'start': 1763.545, 'duration': 2.842}, {'end': 1767.267, 'text': "That's correct.", 'start': 1766.847, 'duration': 0.42}, {'end': 1768.448, 'text': "That's a good idea.", 'start': 1767.768, 'duration': 0.68}, {'end': 1772.491, 'text': 'And the more basic one is just compute the distance between the pixels.', 'start': 1768.908, 'duration': 3.583}, {'end': 1776.393, 'text': "Just compute the distance between the pixels and you get if it's the same person or not.", 'start': 1773.171, 'duration': 3.222}, {'end': 1777.474, 'text': "Unfortunately, it doesn't work.", 'start': 1776.473, 'duration': 1.001}, {'end': 1780.536, 'text': 'And a few reasons are the background lighting can be different.', 'start': 1778.014, 'duration': 2.522}], 'summary': 'Comparing images using hash and pixel distance, but affected by lighting differences.', 'duration': 41.147, 'max_score': 1739.389, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1739389.jpg'}, {'end': 1834.592, 'src': 'embed', 'start': 1801.827, 'weight': 0, 'content': [{'end': 1804.428, 'text': "So, it doesn't work to just compare these two pictures together.", 'start': 1801.827, 'duration': 2.601}, {'end': 1813.774, 'text': 'We need to find a function that we will apply this, this, these two images to and will give us, a more- a better representation of the image.', 'start': 1804.808, 'duration': 8.966}, {'end': 1816.617, 'text': "So that's what we're going to do now.", 'start': 1815.476, 'duration': 1.141}, {'end': 1824.043, 'text': "What we're going to do is that we'll encode information, use the encoding that we talked about of the picture in a vector.", 'start': 1817.397, 'duration': 6.646}, {'end': 1834.592, 'text': 'So we want a vector that would represent features like distance between eyes, nose, mouth, color, or all these type of stuff, hair, uh, in a vector.', 'start': 1824.623, 'duration': 9.969}], 'summary': 'Finding a function to apply to two images for a better representation, encoding picture information into a vector with features like distance between eyes, nose, mouth, and color.', 'duration': 32.765, 'max_score': 1801.827, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1801827.jpg'}, {'end': 1920.611, 'src': 'embed', 'start': 1893.329, 'weight': 1, 'content': [{'end': 1896.791, 'text': 'Like right now, if I take a random network, I give my image to it.', 'start': 1893.329, 'duration': 3.462}, {'end': 1898.132, 'text': "it's gonna output a random vector.", 'start': 1896.791, 'duration': 1.341}, {'end': 1901.033, 'text': 'This vector is not gonna contain any useful information.', 'start': 1898.632, 'duration': 2.401}, {'end': 1906.757, 'text': "I wanna make sure that this information is useful, and that's how I will design my loss function.", 'start': 1901.534, 'duration': 5.223}, {'end': 1913.508, 'text': 'Okay, So, just to recap, we gather all student faces encoding in a database.', 'start': 1908.266, 'duration': 5.242}, {'end': 1920.611, 'text': 'once we have this and given a new picture, we compute the distance between- between the new picture and all the vectors in the database.', 'start': 1913.508, 'duration': 7.103}], 'summary': 'Designing a loss function to ensure useful information in image vectors, computing distance between new picture and database vectors.', 'duration': 27.282, 'max_score': 1893.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1893329.jpg'}, {'end': 2089.641, 'src': 'heatmap', 'start': 2033.26, 'weight': 0.766, 'content': [{'end': 2041.866, 'text': "And we're going to rely on these two assumptions and these two thoughts in order to generate, uh, our loss function by giving it triplets.", 'start': 2033.26, 'duration': 8.606}, {'end': 2047.87, 'text': 'Triplets means three pictures one that we call anchor that is the person- a person.', 'start': 2042.206, 'duration': 5.664}, {'end': 2052.273, 'text': 'one that we call positive, that is the same person as the anchor, but a different picture of that person.', 'start': 2047.87, 'duration': 4.403}, {'end': 2057.148, 'text': 'and the third one that we call negative, that is a picture of someone else.', 'start': 2053.464, 'duration': 3.684}, {'end': 2065.476, 'text': 'And now what we want to do is to minimize the encoding distance between the anchor and the positive and maximize the encoding distance between the anchor of the ne- and the negative.', 'start': 2058.007, 'duration': 7.469}, {'end': 2079.315, 'text': 'Does the- these two thoughts make sense? So now, my question for you is what should be the loss function what should be the loss function.', 'start': 2067.056, 'duration': 12.259}, {'end': 2085.518, 'text': 'So please go on Menti and enter the code and there are three options here, A, B, and C.', 'start': 2079.335, 'duration': 6.183}, {'end': 2089.641, 'text': 'Choose which of these you think should be the right loss function to use for this problem.', 'start': 2085.518, 'duration': 4.123}], 'summary': 'Using triplets to minimize encoding distance and maximize loss function. choose a, b, or c on menti.', 'duration': 56.381, 'max_score': 2033.26, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU2033260.jpg'}, {'end': 2254.621, 'src': 'embed', 'start': 2203.381, 'weight': 4, 'content': [{'end': 2204.662, 'text': 'Okay All right.', 'start': 2203.381, 'duration': 1.281}, {'end': 2205.202, 'text': '20 more seconds.', 'start': 2204.682, 'duration': 0.52}, {'end': 2213.127, 'text': "Okay Let's see what we have.", 'start': 2205.382, 'duration': 7.745}, {'end': 2241.853, 'text': "Okay So two-thirds of the people think that, that it's the first answer A.", 'start': 2232.907, 'duration': 8.946}, {'end': 2243.234, 'text': 'So I, I read it for everyone.', 'start': 2241.853, 'duration': 1.381}, {'end': 2249.418, 'text': 'The loss is equal to the L2 distance between the encoding of A and the encoding of B,', 'start': 2243.734, 'duration': 5.684}, {'end': 2254.621, 'text': 'minus the L2 distance between the encoding of A and the encoding of N.', 'start': 2249.418, 'duration': 5.203}], 'summary': 'Two-thirds of people think first answer is a.', 'duration': 51.24, 'max_score': 2203.381, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU2203381.jpg'}, {'end': 2344.803, 'src': 'heatmap', 'start': 2282.776, 'weight': 0.952, 'content': [{'end': 2287.58, 'text': 'We want to maximize the distance between the encoding of A and the encoding of the negative.', 'start': 2282.776, 'duration': 4.804}, {'end': 2291.542, 'text': "That's why we have a minus sign here, because we want the loss to go down.", 'start': 2287.94, 'duration': 3.602}, {'end': 2295.485, 'text': 'And to go down, we put a minus sign and we maximize this term.', 'start': 2292.083, 'duration': 3.402}, {'end': 2298.968, 'text': "And on the other hand, we want to minimize the other term because it's a positive term.", 'start': 2296.126, 'duration': 2.842}, {'end': 2301.692, 'text': 'Okay So I agree.', 'start': 2300.191, 'duration': 1.501}, {'end': 2302.712, 'text': 'Good answer.', 'start': 2302.312, 'duration': 0.4}, {'end': 2305.493, 'text': 'Okay That was the first time you used this tool.', 'start': 2303.872, 'duration': 1.621}, {'end': 2306.673, 'text': "It's gonna be quicker next time.", 'start': 2305.553, 'duration': 1.12}, {'end': 2312.475, 'text': 'Okay So we have, uh, we have, uh, figured out what the loss function should be.', 'start': 2308.014, 'duration': 4.461}, {'end': 2313.796, 'text': 'And now think about it.', 'start': 2312.935, 'duration': 0.861}, {'end': 2321.438, 'text': "Now that we designed our loss function, we're able to use an optimization algorithm, run an image in the network.", 'start': 2314.336, 'duration': 7.102}, {'end': 2325.82, 'text': 'Sorry, run, run three images in the network like that.', 'start': 2322.679, 'duration': 3.141}, {'end': 2333.217, 'text': 'get three outputs encoding of A, encoding of P, encoding of N, compute the loss,', 'start': 2327.054, 'duration': 6.163}, {'end': 2337.459, 'text': 'take the gradients of the loss and update the parameters in order to minimize the loss.', 'start': 2333.217, 'duration': 4.242}, {'end': 2344.803, 'text': 'Hopefully, after doing that many times, we would get an encoding that represents features of the face.', 'start': 2338.22, 'duration': 6.583}], 'summary': 'Maximize distance between encodings of a and negative, minimize positive term, update parameters to minimize loss.', 'duration': 62.027, 'max_score': 2282.776, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU2282776.jpg'}], 'start': 1739.389, 'title': 'Image comparison and facial identity verification', 'summary': 'Discusses challenges in image comparison, including limitations of pixel distance method, and explores facial image encoding for identity verification using deep networks and triplets, as well as the design of loss function for face recognition.', 'chapters': [{'end': 1801.346, 'start': 1739.389, 'title': 'Image comparison techniques', 'summary': 'Discusses the challenges of comparing images and the limitations of pixel distance method in determining image similarity, highlighting issues such as background lighting variation and physical appearance changes.', 'duration': 61.957, 'highlights': ['The limitations of comparing images using pixel distance method due to factors such as varying background lighting and physical appearance changes.', 'The suggestion of using a hash function to standardize the input image and compare the values as an alternative method for image comparison.', 'Challenges in image comparison due to factors such as makeup, beard growth, age difference, and outdated IDs affecting the accuracy of image similarity.']}, {'end': 2065.476, 'start': 1801.827, 'title': 'Facial image encoding for identity verification', 'summary': 'Discusses the process of encoding facial images into vectors for identity verification, including the use of deep networks, the training process, and the use of triplets to minimize encoding distance.', 'duration': 263.649, 'highlights': ['Facial image encoding involves finding a function to apply to two images to provide a better representation, such as encoding features like distance between eyes and color into a vector.', 'The vector encoding of images from the facility and the ID should be close to each other when run through a properly trained network, with a threshold distance for verification.', 'The process of designing a loss function involves gathering student face encodings in a database and computing the distance between the new picture and all the vectors in the database for identity verification.', 'To train the model to understand facial features, open datasets with millions of pictures of faces are used to help the model learn and generalize the features of the face.', 'The use of one-hot encoding for student identification presents challenges due to the need for network modification as new students enter the school, leading to the preference for similar encoding for images of the same person and different encoding for images of different persons.']}, {'end': 2826.576, 'start': 2067.056, 'title': 'Face recognition and loss function', 'summary': 'Discusses the design of a loss function for face recognition, the role of the alpha term in the loss function, and the use of different algorithms such as k-nearest neighbors and k-means for face recognition and clustering.', 'duration': 759.52, 'highlights': ['The loss function is designed as the L2 distance between the encoding of A and the encoding of P, minus the L2 distance between the encoding of A and the encoding of N, with 2/3 of the participants choosing option A as the right loss function. The loss function is determined as the L2 distance between encodings, with 2/3 participants choosing option A.', 'The alpha term in the loss function prevents the network from stabilizing at zero and encourages the network to learn meaningful features, acting as a margin to stabilize the network. The alpha term prevents network stabilization at zero and encourages meaningful feature learning.', 'For face recognition, the addition of a detection element to the pipeline and the use of a one-to-n comparison with the database pictures are considered, along with the potential use of algorithms like K-nearest neighbors and K-means for clustering. Face recognition may involve adding a detection element and using one-to-n comparison with the database, along with utilizing K-nearest neighbors and K-means for clustering.']}], 'duration': 1087.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU1739389.jpg', 'highlights': ['Facial image encoding involves finding a function to apply to two images to provide a better representation, such as encoding features like distance between eyes and color into a vector.', 'The process of designing a loss function involves gathering student face encodings in a database and computing the distance between the new picture and all the vectors in the database for identity verification.', 'The limitations of comparing images using pixel distance method due to factors such as varying background lighting and physical appearance changes.', 'The suggestion of using a hash function to standardize the input image and compare the values as an alternative method for image comparison.', 'The loss function is designed as the L2 distance between the encoding of A and the encoding of P, minus the L2 distance between the encoding of A and the encoding of N, with 2/3 of the participants choosing option A as the right loss function.']}, {'end': 3469.141, 'segs': [{'end': 2923.172, 'src': 'embed', 'start': 2876.806, 'weight': 0, 'content': [{'end': 2880.587, 'text': 'So you have a deep network and you wanna decide where should you take the encoding from.', 'start': 2876.806, 'duration': 3.781}, {'end': 2884.768, 'text': 'In this case, the more complex the task, the deeper you would go.', 'start': 2881.407, 'duration': 3.361}, {'end': 2892.55, 'text': 'But for face verification, what you want, and you know it as a human, you want to know features like, uh, distance between eyes, nose, and stuff.', 'start': 2885.148, 'duration': 7.402}, {'end': 2893.811, 'text': 'And so you have to go deeper.', 'start': 2892.811, 'duration': 1}, {'end': 2896.933, 'text': 'you need the first layers to figure out the edges.', 'start': 2894.491, 'duration': 2.442}, {'end': 2901.016, 'text': 'give the edges to the second layer, the second layer, to figure out the nose, the eyes.', 'start': 2896.933, 'duration': 4.083}, {'end': 2905.239, 'text': 'give it to the third layer, the third layer, to figure out the distances between the eyes, the distances between the ears.', 'start': 2901.016, 'duration': 4.223}, {'end': 2910.183, 'text': 'So you would go deeper and get the encoding deeper because you know that you want high-level features.', 'start': 2905.759, 'duration': 4.424}, {'end': 2915.266, 'text': 'Okay Art generation.', 'start': 2912.524, 'duration': 2.742}, {'end': 2919.149, 'text': 'Even a picture make it look beautiful.', 'start': 2917.668, 'duration': 1.481}, {'end': 2923.172, 'text': 'As usual, data.', 'start': 2922.132, 'duration': 1.04}], 'summary': 'For face verification, deeper network for high-level features. art generation involves making pictures beautiful.', 'duration': 46.366, 'max_score': 2876.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU2876806.jpg'}, {'end': 3109.308, 'src': 'embed', 'start': 3070.418, 'weight': 3, 'content': [{'end': 3071.559, 'text': 'We want to generate an image, yeah.', 'start': 3070.418, 'duration': 1.141}, {'end': 3082.564, 'text': 'Okay So given, uh, given, uh, given in this, this, uh, architecture, uh, generates a new in the style of.', 'start': 3071.839, 'duration': 10.725}, {'end': 3083.824, 'text': 'Okay Yeah.', 'start': 3083.124, 'duration': 0.7}, {'end': 3093.229, 'text': "Probably. So what you're proposing is we get an image that is the content image, and we have a network that is the style style network,", 'start': 3083.864, 'duration': 9.365}, {'end': 3097.991, 'text': 'which will style this image and we will get the content, but styled version of the content.', 'start': 3093.229, 'duration': 4.762}, {'end': 3109.308, 'text': 'Yes So use certain feature of a style and change this style according to what the network has learned.', 'start': 3104.126, 'duration': 5.182}], 'summary': 'Generate a styled image using a network to modify content.', 'duration': 38.89, 'max_score': 3070.418, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3070418.jpg'}, {'end': 3210.241, 'src': 'embed', 'start': 3184.71, 'weight': 5, 'content': [{'end': 3189.411, 'text': 'The edges are usually a good representation of the content of the image.', 'start': 3184.71, 'duration': 4.701}, {'end': 3195.532, 'text': 'So, I might have a very good network, give my content image, extract the information from the first layer.', 'start': 3189.891, 'duration': 5.641}, {'end': 3197.693, 'text': 'This information is going to be the content of the image.', 'start': 3195.732, 'duration': 1.961}, {'end': 3207.055, 'text': 'Now, the question is how do I get the style? I want to give my style image and find a way to extract the style.', 'start': 3199.233, 'duration': 7.822}, {'end': 3210.241, 'text': "That's what we're going to learn later in this course.", 'start': 3208.359, 'duration': 1.882}], 'summary': 'Neural network extracts content and style from images.', 'duration': 25.531, 'max_score': 3184.71, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3184710.jpg'}, {'end': 3282.16, 'src': 'embed', 'start': 3254.52, 'weight': 4, 'content': [{'end': 3262.366, 'text': 'You can find ImageNet classification- classification networks online that were trained to recognize more than thousand- thousands of objects.', 'start': 3254.52, 'duration': 7.846}, {'end': 3267.97, 'text': 'This network is going to understand basically anything you give it.', 'start': 3264.107, 'duration': 3.863}, {'end': 3272.113, 'text': "If I give it the Louvre Museum, it's going to find all the edges very easily.", 'start': 3268.53, 'duration': 3.583}, {'end': 3275.335, 'text': "It's going to figure out that there is- it's during the day.", 'start': 3272.553, 'duration': 2.782}, {'end': 3282.16, 'text': "it's going to figure out their buildings on the sides and all the features of the image, because it was trained for a month on thousands of classes.", 'start': 3275.335, 'duration': 6.825}], 'summary': 'Imagenet classification networks trained to recognize thousands of objects, understand diverse inputs due to extensive training.', 'duration': 27.64, 'max_score': 3254.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3254520.jpg'}, {'end': 3332.84, 'src': 'heatmap', 'start': 3275.335, 'weight': 1, 'content': [{'end': 3282.16, 'text': "it's going to figure out their buildings on the sides and all the features of the image, because it was trained for a month on thousands of classes.", 'start': 3275.335, 'duration': 6.825}, {'end': 3289.47, 'text': "Let's say we have this network, we give our content image to it, and we extract information from the first few layers.", 'start': 3283.208, 'duration': 6.262}, {'end': 3294.952, 'text': 'This information, we call it content C, content of the content image.', 'start': 3290.09, 'duration': 4.862}, {'end': 3297.132, 'text': 'Does that make sense?', 'start': 3296.512, 'duration': 0.62}, {'end': 3306.435, 'text': 'Now I give the style image and I will use another method, that is called the grain matrix, to extract style S style of the style image.', 'start': 3298.273, 'duration': 8.162}, {'end': 3314.861, 'text': "Okay? And now the question is what should be the loss function? So let's go on Menti.", 'start': 3308.616, 'duration': 6.245}, {'end': 3332.84, 'text': 'So same code as usual, just open it.', 'start': 3329.957, 'duration': 2.883}], 'summary': 'Neural network trained for a month on thousands of classes to extract content and style from images.', 'duration': 57.505, 'max_score': 3275.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3275335.jpg'}], 'start': 2826.576, 'title': 'Advanced image processing techniques', 'summary': 'Covers the significance of deep networks for face verification and the process of art generation and style extraction. it emphasizes the need for extracting high-level features in face verification and highlights techniques like backpropagation, encoding, and the use of gram matrix for style extraction.', 'chapters': [{'end': 2923.172, 'start': 2826.576, 'title': 'Deep network for face verification', 'summary': 'Discusses the importance of going deeper into the network for obtaining high-level features in face verification, emphasizing the need to extract features like distance between eyes and nose through multiple layers.', 'duration': 96.596, 'highlights': ['The more complex the task, the deeper you would go in the deep network for face verification, to obtain high-level features like distance between eyes and nose, and facial features.', 'The process involves using the first layers to figure out the edges, the second layer to identify the nose and eyes, and the third layer to determine the distances between facial features.', 'Art generation and making a picture look beautiful also require high-level features obtained by going deeper into the network.']}, {'end': 3469.141, 'start': 2924.414, 'title': 'Art generation and style extraction', 'summary': 'Discusses the process of generating an image that combines the content of one image with the style of another, using techniques such as backpropagation to the image, encoding for content extraction, and the use of a technique called gram matrix for style extraction from the image. it also emphasizes the importance of using imagenet classification networks for understanding and extracting image features.', 'duration': 544.727, 'highlights': ['The process of generating an image that combines the content of one image with the style of another using techniques such as backpropagation to the image, encoding for content extraction, and the use of a technique called Gram matrix for style extraction. This explains the core concept of the chapter and its focus on art generation and style extraction.', 'The importance of using ImageNet classification networks for understanding and extracting image features. Highlighting the significance of using ImageNet classification networks to extract image features effectively.', 'The discussion about the need for a network that understands pictures well to extract edges and features effectively. Emphasizing the importance of utilizing a network that comprehensively understands and processes image features for effective extraction.']}], 'duration': 642.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU2826576.jpg', 'highlights': ['The process involves using the first layers to figure out the edges, the second layer to identify the nose and eyes, and the third layer to determine the distances between facial features.', 'The more complex the task, the deeper you would go in the deep network for face verification, to obtain high-level features like distance between eyes and nose, and facial features.', 'Art generation and making a picture look beautiful also require high-level features obtained by going deeper into the network.', 'The process of generating an image that combines the content of one image with the style of another using techniques such as backpropagation to the image, encoding for content extraction, and the use of a technique called Gram matrix for style extraction.', 'The importance of using ImageNet classification networks for understanding and extracting image features.', 'The discussion about the need for a network that understands pictures well to extract edges and features effectively.']}, {'end': 3806.361, 'segs': [{'end': 3576.418, 'src': 'heatmap', 'start': 3469.161, 'weight': 0, 'content': [{'end': 3474.322, 'text': 'Okay Someone who has answered the second, uh, question and I, I will read it out loud.', 'start': 3469.161, 'duration': 5.161}, {'end': 3480.644, 'text': 'The loss is the L2 difference between the style of the style image and the generated style,', 'start': 3475.022, 'duration': 5.622}, {'end': 3485.726, 'text': "plus the L2 distance between the generate- the generator's content and the content's content.", 'start': 3480.644, 'duration': 5.082}, {'end': 3499.893, 'text': 'Yeah So yeah, we want to minimize both terms here.', 'start': 3489.627, 'duration': 10.266}, {'end': 3505.157, 'text': 'So we want the content of the content image to look like the content of the generated image.', 'start': 3501.054, 'duration': 4.103}, {'end': 3507.279, 'text': 'So we want to minimize the L2 distance of these two.', 'start': 3505.237, 'duration': 2.042}, {'end': 3513.444, 'text': 'And the reason we use a plus is because we also want to minimize the difference of styles between the generated and the style image.', 'start': 3507.739, 'duration': 5.705}, {'end': 3521.171, 'text': "So you see, we don't have any terms that says style of the content image minus style of the generated image is minimized.", 'start': 3514.145, 'duration': 7.026}, {'end': 3522.692, 'text': 'This is the loss we want.', 'start': 3521.811, 'duration': 0.881}, {'end': 3532.073, 'text': 'Okay Okay.', 'start': 3524.674, 'duration': 7.399}, {'end': 3535.334, 'text': 'So, just going over the architecture again.', 'start': 3532.413, 'duration': 2.921}, {'end': 3540.135, 'text': "So, the last function we're going to use will be the one we saw.", 'start': 3535.354, 'duration': 4.781}, {'end': 3544.916, 'text': "And so, one thing that I want to emphasize here is we're not training the network.", 'start': 3540.835, 'duration': 4.081}, {'end': 3546.796, 'text': "There's no parameter that we train.", 'start': 3545.356, 'duration': 1.44}, {'end': 3549.917, 'text': 'The parameters are in the ImageNet classification network.', 'start': 3547.116, 'duration': 2.801}, {'end': 3551.417, 'text': "We use them, we don't train them.", 'start': 3550.237, 'duration': 1.18}, {'end': 3553.657, 'text': 'What we will train is the image.', 'start': 3552.017, 'duration': 1.64}, {'end': 3556.878, 'text': 'So, you get an image and you start with white noise.', 'start': 3554.717, 'duration': 2.161}, {'end': 3563.203, 'text': "you run this image through the classification network, but you don't care about the classification of this image.", 'start': 3558.018, 'duration': 5.185}, {'end': 3567.426, 'text': 'ImageNet is going to give a random class to this image, totally random.', 'start': 3563.783, 'duration': 3.643}, {'end': 3573.252, 'text': 'Uh, instead, you will extract content G and style G.', 'start': 3568.688, 'duration': 4.564}, {'end': 3576.418, 'text': 'Okay, So from this image,', 'start': 3574.997, 'duration': 1.421}], 'summary': 'Minimize l2 difference between style and generated style, plus content similarity. parameters in imagenet are not trained, only the image.', 'duration': 77.635, 'max_score': 3469.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3469161.jpg'}, {'end': 3657.379, 'src': 'embed', 'start': 3594.71, 'weight': 1, 'content': [{'end': 3597.131, 'text': 'instead of stopping in the network,', 'start': 3594.71, 'duration': 2.421}, {'end': 3605.015, 'text': 'you go all the way back to the pixels of the image and you decide how much should I move the pixels in order to make this loss go down.', 'start': 3597.131, 'duration': 7.884}, {'end': 3607.857, 'text': 'And you do that many times, you do that many times.', 'start': 3605.976, 'duration': 1.881}, {'end': 3614, 'text': 'And the more you do that, the more this is going to look like the content of the content image and the style of the style image.', 'start': 3608.357, 'duration': 5.643}, {'end': 3615.221, 'text': 'Yeah, one question.', 'start': 3614.601, 'duration': 0.62}, {'end': 3621.161, 'text': 'New example of content style images you need to do a new training like this? Yeah.', 'start': 3615.976, 'duration': 5.185}, {'end': 3629.328, 'text': 'So the downside of this network is, although it has the flexibility to work with any style, any content, every time you want to generate an image,', 'start': 3621.321, 'duration': 8.007}, {'end': 3630.769, 'text': 'you have to do this training loop.', 'start': 3629.328, 'duration': 1.441}, {'end': 3636.875, 'text': "While the other network that you talked about doesn't need that, because the model is trained to to convert the content to a style.", 'start': 3631.37, 'duration': 5.505}, {'end': 3638.176, 'text': 'you just give it and it goes.', 'start': 3636.875, 'duration': 1.301}, {'end': 3646.195, 'text': 'train the network on many kinds of, like 1A images, or do you only need to do it on one kind of kind of image? Which network??', 'start': 3640.273, 'duration': 5.922}, {'end': 3647.456, 'text': 'You talk about this network? Yeah.', 'start': 3646.235, 'duration': 1.221}, {'end': 3651.897, 'text': 'Yeah So do we need to train this network on 1A images? Usually not.', 'start': 3647.916, 'duration': 3.981}, {'end': 3654.618, 'text': 'This network is trained on millions of images.', 'start': 3652.517, 'duration': 2.101}, {'end': 3657.379, 'text': "It's basically seen everything you can imagine.", 'start': 3654.838, 'duration': 2.541}], 'summary': 'Training loop required for each new image, but network trained on millions of images for flexibility.', 'duration': 62.669, 'max_score': 3594.71, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3594710.jpg'}, {'end': 3754.138, 'src': 'embed', 'start': 3731.331, 'weight': 6, 'content': [{'end': 3739.579, 'text': 'I think probably the content because, uh, the, the edges at least look like the content is going to, to help the, the network, uh, converge quicker.', 'start': 3731.331, 'duration': 8.248}, {'end': 3740.18, 'text': "Yeah, that's true.", 'start': 3739.679, 'duration': 0.501}, {'end': 3741.801, 'text': "You don't have to start with white noise.", 'start': 3740.44, 'duration': 1.361}, {'end': 3745.585, 'text': 'In generally, the baseline is start with white noise so that anything can happen.', 'start': 3742.382, 'duration': 3.203}, {'end': 3749.208, 'text': "If you give it the content to start with, it's going to have a bias towards the content.", 'start': 3745.925, 'duration': 3.283}, {'end': 3751.591, 'text': "Yeah But if you train longer, it's fine.", 'start': 3749.849, 'duration': 1.742}, {'end': 3754.138, 'text': 'Okay One more question and then we can move on.', 'start': 3752.478, 'duration': 1.66}], 'summary': 'Starting with content helps network converge quicker, reducing need for white noise.', 'duration': 22.807, 'max_score': 3731.331, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3731331.jpg'}, {'end': 3810.465, 'src': 'embed', 'start': 3785.792, 'weight': 7, 'content': [{'end': 3793.699, 'text': 'but the network finds all the features on the image and then we use a post-processing technique that is called the Gram matrix in order to extract what we call style.', 'start': 3785.792, 'duration': 7.907}, {'end': 3797.982, 'text': "It's basically a, a cross-correlation of all the features of the network.", 'start': 3794.579, 'duration': 3.403}, {'end': 3799.624, 'text': 'We will learn it together later on.', 'start': 3798.263, 'duration': 1.361}, {'end': 3804.96, 'text': "Okay Let's move on to the next application because we don't have too much time.", 'start': 3801.897, 'duration': 3.063}, {'end': 3806.361, 'text': 'So this is the one I prefer.', 'start': 3805.42, 'duration': 0.941}, {'end': 3810.465, 'text': 'Uh, given a 10-second audio speech, detect the word activate.', 'start': 3806.381, 'duration': 4.084}], 'summary': "Using gram matrix to extract style from image features; detecting 'activate' in 10-second audio speech", 'duration': 24.673, 'max_score': 3785.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3785792.jpg'}], 'start': 3469.161, 'title': 'Style transfer and neural network image extraction', 'summary': 'Discusses the loss function for style transfer, emphasizing minimization of l2 differences between style and content images without parameter training, and using neural networks to extract content and style images through backpropagation, updating the image to resemble the content and style, and the use of imagenet to extract features for content and style recognition.', 'chapters': [{'end': 3567.426, 'start': 3469.161, 'title': 'Style transfer loss function', 'summary': 'Discusses the loss function for style transfer, emphasizing the minimization of l2 differences between style and content images, with no parameter training in the network.', 'duration': 98.265, 'highlights': ["The loss is the L2 difference between the style of the style image and the generated style, plus the L2 distance between the generator's content and the content's content.", 'The goal is to minimize both the L2 distance between the content of the content image and the content of the generated image, and the difference of styles between the generated and the style image.', 'The network is not trained; only the image is trained to minimize the loss function.']}, {'end': 3806.361, 'start': 3568.688, 'title': 'Neural network image extraction', 'summary': 'Discusses using neural networks to extract content and style images through backpropagation, updating the image to resemble the content and style, and the use of imagenet to extract features for content and style recognition.', 'duration': 237.673, 'highlights': ['Neural network uses backpropagation to update the image and resemble the content and style The process involves computing the loss function and derivatives, then backpropagating to the pixels of the image to determine pixel movement, which results in the image resembling the content and style.', 'Network flexibility and training loop for image generation The discussed network offers flexibility to work with any style and content, but requires a training loop for every image generation, unlike other networks trained to convert content to a style without the need for retraining.', 'Training of the network on millions of images for flexibility The network is trained on millions of images, allowing it to work with various styles and contents without the need for retraining on specific images.', 'Starting with white noise as a baseline for image generation The baseline approach involves starting with white noise to allow for unbiased generation, although using content as a starting point can lead to a bias towards the content, which can be mitigated by longer training.', 'Use of ImageNet for feature extraction for content and style recognition ImageNet is utilized to extract features for content and style recognition, where the network finds edges on the image and a post-processing technique using the Gram matrix is used to extract style features.']}], 'duration': 337.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3469161.jpg', 'highlights': ["The loss is the L2 difference between the style of the style image and the generated style, plus the L2 distance between the generator's content and the content's content.", 'Neural network uses backpropagation to update the image and resemble the content and style The process involves computing the loss function and derivatives, then backpropagating to the pixels of the image to determine pixel movement, which results in the image resembling the content and style.', 'The goal is to minimize both the L2 distance between the content of the content image and the content of the generated image, and the difference of styles between the generated and the style image.', 'The network is not trained; only the image is trained to minimize the loss function.', 'Network flexibility and training loop for image generation The discussed network offers flexibility to work with any style and content, but requires a training loop for every image generation, unlike other networks trained to convert content to a style without the need for retraining.', 'Training of the network on millions of images for flexibility The network is trained on millions of images, allowing it to work with various styles and contents without the need for retraining on specific images.', 'Starting with white noise as a baseline for image generation The baseline approach involves starting with white noise to allow for unbiased generation, although using content as a starting point can lead to a bias towards the content, which can be mitigated by longer training.', 'Use of ImageNet for feature extraction for content and style recognition ImageNet is utilized to extract features for content and style recognition, where the network finds edges on the image and a post-processing technique using the Gram matrix is used to extract style features.']}, {'end': 4960.29, 'segs': [{'end': 3855.627, 'src': 'embed', 'start': 3822.335, 'weight': 0, 'content': [{'end': 3832.913, 'text': 'What data do we need? Do we need a lot or no? probably a lot because there are many accents.', 'start': 3822.335, 'duration': 10.578}, {'end': 3842.659, 'text': "And one thing that is counter-intuitive is that if two humans like let's say, let's say two, two women speak as a human,", 'start': 3833.253, 'duration': 9.406}, {'end': 3847.322, 'text': 'you would say these voices are are pretty similar, right?', 'start': 3842.659, 'duration': 4.663}, {'end': 3848.523, 'text': 'You can detect the word.', 'start': 3847.542, 'duration': 0.981}, {'end': 3855.627, 'text': 'What the network sees is a list of numbers that are totally different from one person to another,', 'start': 3849.844, 'duration': 5.783}], 'summary': 'Need a lot of data due to many accents, voices appear similar but network detects different numbers.', 'duration': 33.292, 'max_score': 3822.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3822335.jpg'}, {'end': 3912.849, 'src': 'embed', 'start': 3881.671, 'weight': 1, 'content': [{'end': 3886.715, 'text': 'What should be the input of the network? It should be a 10-second audio clip that we can represent like that.', 'start': 3881.671, 'duration': 5.044}, {'end': 3892.259, 'text': 'The 10-second audio clip is going to contain some positive words in green.', 'start': 3887.555, 'duration': 4.704}, {'end': 3903.544, 'text': "positive word is activate and it's also going to contain negative words in pink, like kitchen lion, whatever words that are not activated.", 'start': 3892.259, 'duration': 11.285}, {'end': 3906.386, 'text': 'And we want only to detect the positive word.', 'start': 3904.485, 'duration': 1.901}, {'end': 3912.849, 'text': 'What should be the sample rate? Again, same question you would test on humans.', 'start': 3907.667, 'duration': 5.182}], 'summary': 'Input: 10-sec audio clip with positive & negative words. task: detect positive word.', 'duration': 31.178, 'max_score': 3881.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3881671.jpg'}, {'end': 4112.02, 'src': 'embed', 'start': 4077.896, 'weight': 3, 'content': [{'end': 4082.308, 'text': 'Uh, the, the important thing is to know that the first one would also work.', 'start': 4077.896, 'duration': 4.412}, {'end': 4084.29, 'text': 'We just need a ton of data.', 'start': 4082.849, 'duration': 1.441}, {'end': 4088.374, 'text': 'We need a lot more data to make the first labeling scheme work than we need for the second one.', 'start': 4084.51, 'duration': 3.864}, {'end': 4093.999, 'text': 'Does that make sense? So yeah, we will use something like that.', 'start': 4089.275, 'duration': 4.724}, {'end': 4101.145, 'text': 'Um, would, do you guys have a one where it, uh, activation where it starts or would you have one for the entire activation? Good question.', 'start': 4094.019, 'duration': 7.126}, {'end': 4102.988, 'text': 'Actually, this is not the best labeling scheme.', 'start': 4101.446, 'duration': 1.542}, {'end': 4112.02, 'text': 'As you said, should the one come before or after the word was said? What do you guys think? Before? After.', 'start': 4104.255, 'duration': 7.765}], 'summary': 'To make the first labeling scheme work, we need a lot more data than for the second one.', 'duration': 34.124, 'max_score': 4077.896, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4077896.jpg'}, {'end': 4239.149, 'src': 'embed', 'start': 4207.678, 'weight': 4, 'content': [{'end': 4211.581, 'text': 'I think there are two things that are really critical when you, when you build such a project.', 'start': 4207.678, 'duration': 3.903}, {'end': 4217.045, 'text': 'The first one is to have a straight strategic data acquisition pipeline.', 'start': 4212.361, 'duration': 4.684}, {'end': 4219.547, 'text': "So, let's talk more about that.", 'start': 4218.587, 'duration': 0.96}, {'end': 4226.253, 'text': 'We said that our data should be 10-second audio clips that contain positive and negative words from many different accents.', 'start': 4219.767, 'duration': 6.486}, {'end': 4239.149, 'text': 'How would you collect this data? Try.', 'start': 4228.413, 'duration': 10.736}], 'summary': 'Critical for project: strategic data pipeline for 10-second audio clips with positive and negative words from various accents.', 'duration': 31.471, 'max_score': 4207.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4207678.jpg'}, {'end': 4428.929, 'src': 'embed', 'start': 4402.019, 'weight': 5, 'content': [{'end': 4405.542, 'text': 'take one second audio clips of negative words of the same people as well.', 'start': 4402.019, 'duration': 3.523}, {'end': 4410.305, 'text': 'put it in the pink database and get background noise from anywhere I can find it.', 'start': 4405.542, 'duration': 4.763}, {'end': 4410.946, 'text': "It's very cheap.", 'start': 4410.365, 'duration': 0.581}, {'end': 4414.288, 'text': 'And then create this synthetic data, label it automatically.', 'start': 4411.466, 'duration': 2.822}, {'end': 4421.977, 'text': 'And you know, with like five positive words, five negative words, five backgrounds, you can create a lot of data points.', 'start': 4416.288, 'duration': 5.689}, {'end': 4428.929, 'text': 'Okay So this is an important technique that you might wanna think about in your projects.', 'start': 4424.962, 'duration': 3.967}], 'summary': 'Generate synthetic data by combining negative words and background noise for cost-effective and efficient data creation.', 'duration': 26.91, 'max_score': 4402.019, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4402019.jpg'}, {'end': 4552.522, 'src': 'embed', 'start': 4522.893, 'weight': 6, 'content': [{'end': 4525.074, 'text': 'This is a spectrogram of an audio speech.', 'start': 4522.893, 'duration': 2.181}, {'end': 4527.195, 'text': "You're going to learn a little bit more about that.", 'start': 4525.654, 'duration': 1.541}, {'end': 4535.177, 'text': 'So, after I got the spectrogram, which is better than the 1D signal for the network, I would use an LSTM, which is a recurrent neural network,', 'start': 4527.295, 'duration': 7.882}, {'end': 4540.252, 'text': 'and add a sigmoid layer after it to get probabilities between zero and one.', 'start': 4535.177, 'duration': 5.075}, {'end': 4547.098, 'text': "I would threshold them, everything be more than 0.5, I would consider that it's a one, everything less it's a zero.", 'start': 4540.793, 'duration': 6.305}, {'end': 4552.522, 'text': "I tried for a long time fitting this network on the data, it didn't work.", 'start': 4547.959, 'duration': 4.563}], 'summary': 'Using spectrogram for speech, lstm with sigmoid layer, thresholding, and unsuccessful fitting.', 'duration': 29.629, 'max_score': 4522.893, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4522893.jpg'}, {'end': 4609.89, 'src': 'embed', 'start': 4565.053, 'weight': 7, 'content': [{'end': 4566.814, 'text': 'He could told me, he could have told me.', 'start': 4565.053, 'duration': 1.761}, {'end': 4571.135, 'text': 'So he told me there are several issues with this network.', 'start': 4567.694, 'duration': 3.441}, {'end': 4575.577, 'text': 'The first one is your hyperparameters in the Fourier transform.', 'start': 4571.875, 'duration': 3.702}, {'end': 4576.137, 'text': "They're wrong.", 'start': 4575.717, 'duration': 0.42}, {'end': 4581.099, 'text': 'Go on my GitHub, you will find what hyperparameters I use for this Fourier transform.', 'start': 4576.777, 'duration': 4.322}, {'end': 4586.02, 'text': 'You will find specifically what sample rate, what window size, what frequencies I used.', 'start': 4581.159, 'duration': 4.861}, {'end': 4587.641, 'text': 'So that was better.', 'start': 4586.941, 'duration': 0.7}, {'end': 4592.443, 'text': 'Then he said, one issue is that your recurrent neural network is too big.', 'start': 4588.521, 'duration': 3.922}, {'end': 4593.763, 'text': "It's super hard to train.", 'start': 4592.843, 'duration': 0.92}, {'end': 4595.404, 'text': 'Instead, you should reduce it.', 'start': 4594.323, 'duration': 1.081}, {'end': 4601.447, 'text': "So I've used, so he told me to use a convolution to reduce the number of time steps of my audio clip.", 'start': 4596.345, 'duration': 5.102}, {'end': 4603.728, 'text': 'You will learn about all these layers later.', 'start': 4601.867, 'duration': 1.861}, {'end': 4609.89, 'text': 'Uh, and also use batch norm, which is a specific type of layer that, that makes the training easier.', 'start': 4603.748, 'duration': 6.142}], 'summary': 'Issues: wrong hyperparameters, big rnn; solutions: specific hyperparameters, smaller rnn, convolution, batch norm', 'duration': 44.837, 'max_score': 4565.053, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4565053.jpg'}, {'end': 4908.362, 'src': 'embed', 'start': 4879.083, 'weight': 9, 'content': [{'end': 4882.325, 'text': 'So, these are the predictions and the other ones are the ground truth.', 'start': 4879.083, 'duration': 3.242}, {'end': 4892.99, 'text': 'What is interesting is that a two-pixel error on this cat is much more important than a two-pixel error on this human because the box is smaller.', 'start': 4883.305, 'duration': 9.685}, {'end': 4900.013, 'text': "So, that's why you use a square root to penalize more the errors on small boxes than on big boxes.", 'start': 4893.53, 'duration': 6.483}, {'end': 4903.194, 'text': 'Okay And finally, the final slide.', 'start': 4900.993, 'duration': 2.201}, {'end': 4904.515, 'text': "Okay Let's go over that.", 'start': 4903.534, 'duration': 0.981}, {'end': 4908.362, 'text': 'So just recalling what we have for next week.', 'start': 4906.161, 'duration': 2.201}], 'summary': 'Using square root to penalize errors on small boxes more than big boxes in object detection.', 'duration': 29.279, 'max_score': 4879.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU4879083.jpg'}], 'start': 3806.381, 'title': 'Speech recognition implementation', 'summary': "Covers word 'activate' detection in a 10-second audio speech, emphasizing the need for a large dataset with various accents, gender voices, and age groups. it discusses challenges of speech labeling, comparison between human and computer labeling schemes, and the importance of architecture search, hyperparameter tuning, and seeking expert advice for successful project implementation.", 'chapters': [{'end': 3942.101, 'start': 3806.381, 'title': 'Word activate detection', 'summary': "Discusses the process of detecting the word 'activate' in a 10-second audio speech, emphasizing the need for a large dataset containing various accents, gender voices, and age groups, and the use of a classification model to detect the word with a 10-second audio clip.", 'duration': 135.72, 'highlights': ['The need for a large dataset containing various accents, gender voices, and age groups is emphasized, as different frequencies in voices can appear very similar to humans but are represented by totally different numbers for the network. Emphasizes the need for a large dataset with various accents, gender voices, and age groups due to the different representations of frequencies in voices, despite appearing similar to humans.', "The input of the network should be a 10-second audio clip containing positive words like 'activate' in green and negative words in pink, with the goal of detecting only the positive word. Specifies the requirement for the network's input to be a 10-second audio clip with positive and negative words, emphasizing the detection of the positive word 'activate'.", "The output of the network is suggested to be a classification model, providing a binary result of 'yes' or 'no' for the presence of the word 'activate' in the audio clip. Suggests the output of the network to be a classification model with a binary result, indicating the presence or absence of the word 'activate' in the audio clip."]}, {'end': 4421.977, 'start': 3943.822, 'title': 'Speech labeling and data acquisition', 'summary': 'Discusses the challenges of speech labeling, the comparison between human and computer labeling schemes, the need for ample data, and the criticality of a strategic data acquisition pipeline for success in a speech recognition project.', 'duration': 478.155, 'highlights': ['The comparison between human and computer labeling schemes reveals the superiority of the latter, requiring less data for effective results. The computer labeling scheme is found to be more effective than the human one, requiring less data for successful implementation.', 'The need for a strategic data acquisition pipeline is emphasized, with methods including manual labeling, collection from diverse accents, and programmatic generation of samples. The chapter stresses the importance of a strategic data acquisition pipeline, including methods such as manual labeling, collection from diverse accents, and programmatic generation of samples for efficient data acquisition.', 'The use of background noise and programmatic generation of samples is highlighted as a cost-effective and efficient method for creating a large volume of data points. The method of using background noise and programmatic generation of samples is emphasized as a cost-effective and efficient way to create a large volume of data points for training models.']}, {'end': 4960.29, 'start': 4424.962, 'title': 'Importance of architecture search and expert advice', 'summary': 'Emphasizes the importance of architecture search, hyperparameter tuning, and seeking expert advice for successful project implementation, as demonstrated through the use of fourier transform and lstm in speech recognition, with specific guidance on hyperparameters and network optimization, along with an overview of loss function in object detection.', 'duration': 535.328, 'highlights': ['The chapter emphasizes the importance of seeking expert advice for architecture search and hyperparameter tuning, as demonstrated through the successful implementation of Fourier transform and LSTM in speech recognition, with specific guidance on hyperparameters and network optimization. importance of seeking expert advice, successful implementation of Fourier transform and LSTM in speech recognition, specific guidance on hyperparameters and network optimization', 'Specific guidance is provided on hyperparameters for the Fourier transform, including sample rate, window size, and frequencies, leading to improved network performance. specific guidance on hyperparameters for the Fourier transform, improved network performance', 'Recommendations are given to optimize the network architecture, including reducing the size of the recurrent neural network and utilizing convolution and batch normalization layers, resulting in a successfully trained architecture within a day. recommendations to optimize the network architecture, reducing the size of the recurrent neural network, utilizing convolution and batch normalization layers, successfully trained architecture within a day', 'Insights are shared on the loss function in object detection, specifically discussing the components related to bounding boxes, objectness probability, and class identification, along with the rationale behind using a square root to penalize errors on small bounding boxes. insights on the loss function in object detection, components related to bounding boxes, objectness probability, class identification, rationale behind using a square root to penalize errors on small bounding boxes']}], 'duration': 1153.909, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AwQHqWyHRpU/pics/AwQHqWyHRpU3806381.jpg', 'highlights': ['Emphasizes the need for a large dataset with various accents, gender voices, and age groups due to the different representations of frequencies in voices, despite appearing similar to humans.', "Specifies the requirement for the network's input to be a 10-second audio clip with positive and negative words, emphasizing the detection of the positive word 'activate'.", "Suggests the output of the network to be a classification model with a binary result, indicating the presence or absence of the word 'activate' in the audio clip.", 'The computer labeling scheme is found to be more effective than the human one, requiring less data for successful implementation.', 'The chapter stresses the importance of a strategic data acquisition pipeline, including methods such as manual labeling, collection from diverse accents, and programmatic generation of samples for efficient data acquisition.', 'The method of using background noise and programmatic generation of samples is emphasized as a cost-effective and efficient way to create a large volume of data points for training models.', 'Importance of seeking expert advice, successful implementation of Fourier transform and LSTM in speech recognition, specific guidance on hyperparameters and network optimization', 'Specific guidance on hyperparameters for the Fourier transform, improved network performance', 'Recommendations to optimize the network architecture, reducing the size of the recurrent neural network, utilizing convolution and batch normalization layers, successfully trained architecture within a day', 'Insights on the loss function in object detection, components related to bounding boxes, objectness probability, class identification, rationale behind using a square root to penalize errors on small bounding boxes']}], 'highlights': ['The lecture provides a systematic approach for deep learning projects, covering neural network encoding, image classification, face verification considerations, image comparison, advanced image processing techniques, style transfer, and speech recognition implementation, emphasizing the importance of dataset size and architecture selection for successful project implementation.', "The function used to compare the model's output to the ground truth is called the loss function, and computing its gradient guides parameter updates to minimize the loss and improve model performance.", 'The chapter emphasizes the importance of choosing the right loss function, with a focus on multi-logistic regression and the implications of different label encoding methods in image classification.', 'The flexibility in changing the neural network architecture, including activation functions, optimizers, and hyper-parameters, is discussed.', 'The network creates high-level complex features, such as eyes, nose, and mouth, in face recognition networks, through the processing and representation of input data by the neurons in each layer.', 'Factors like the difficulty of the classification task, such as distinguishing indoor or twilight images, impact the dataset size required for effective training.', 'The speaker suggests that 100,000 images may be needed to train a network to detect indoor scenes, based on past project comparisons.', 'The discussion emphasizes the need for low resolution to improve computational efficiency, with the suggestion to minimize resolution while achieving good performance.', 'A practical approach to resolution selection involves comparing model performance with human classification at different resolutions, aiming to achieve perfect human performance at the minimum resolution.', 'Implementing face verification for student IDs in the gym, requiring a dataset with ID-image mapping and labeled photos of students.', 'The trade-off between computation and resolution is crucial, with more complex tasks requiring larger datasets for training the network.', 'Facial image encoding involves finding a function to apply to two images to provide a better representation, such as encoding features like distance between eyes and color into a vector.', 'The process of designing a loss function involves gathering student face encodings in a database and computing the distance between the new picture and all the vectors in the database for identity verification.', 'The process involves using the first layers to figure out the edges, the second layer to identify the nose and eyes, and the third layer to determine the distances between facial features.', 'The more complex the task, the deeper you would go in the deep network for face verification, to obtain high-level features like distance between eyes and nose, and facial features.', 'The process of generating an image that combines the content of one image with the style of another using techniques such as backpropagation to the image, encoding for content extraction, and the use of a technique called Gram matrix for style extraction.', "The loss is the L2 difference between the style of the style image and the generated style, plus the L2 distance between the generator's content and the content's content.", 'Emphasizes the need for a large dataset with various accents, gender voices, and age groups due to the different representations of frequencies in voices, despite appearing similar to humans.', "Specifies the requirement for the network's input to be a 10-second audio clip with positive and negative words, emphasizing the detection of the positive word 'activate'.", 'The computer labeling scheme is found to be more effective than the human one, requiring less data for successful implementation.', 'The chapter stresses the importance of a strategic data acquisition pipeline, including methods such as manual labeling, collection from diverse accents, and programmatic generation of samples for efficient data acquisition.']}