title
Computer Vision and Perception for Self-Driving Cars (Deep Learning Course)
description
Learn about Computer Vision and Perception for Self Driving Cars. This series focuses on the different tasks that a Self Driving Car Perception unit would be required to do.
✏️ Course by Robotics with Sakshay. https://www.youtube.com/channel/UC57lEMTXZzXYu_y0FKdW6xA
⭐️ Course Contents and Links ⭐️
⌨️ (0:00:00) Introduction
⌨️ (0:02:16) Fully Convolutional Network | Road Segmentation
🔗 Kaggle Dataset: https://www.kaggle.com/sakshaymahna/kittiroadsegmentation
🔗 Kaggle Notebook: https://www.kaggle.com/sakshaymahna/fully-convolutional-network
🔗 KITTI Dataset: http://www.cvlibs.net/datasets/kitti/
🔗 Fully Convolutional Network Paper: https://arxiv.org/abs/1411.4038
🔗 Hand Crafted Road Segmentation: https://www.youtube.com/watch?v=hrin-qTn4L4
🔗 Deep Learning and CNNs: https://www.youtube.com/watch?v=aircAruvnKk
⌨️ (0:20:45) YOLO | 2D Object Detection
🔗 Kaggle Competition/Dataset: https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles
🔗 Visualization Notebook: https://www.kaggle.com/sakshaymahna/lyft-3d-object-detection-eda
🔗 YOLO Notebook: https://www.kaggle.com/sakshaymahna/yolov3-keras-2d-object-detection
🔗 Playlist on Fundamentals of Object Detection: https://www.youtube.com/playlist?list=PL_IHmaMAvkVxdDOBRg2CbcJBq9SY7ZUvs
🔗 Blog on YOLO: https://www.section.io/engineering-education/introduction-to-yolo-algorithm-for-object-detection/
🔗 YOLO Paper: https://arxiv.org/abs/1506.02640
⌨️ (0:35:51) Deep SORT | Object Tracking
🔗 Dataset: https://www.kaggle.com/sakshaymahna/kittiroadsegmentation
🔗 Notebook/Code: https://www.kaggle.com/sakshaymahna/deepsort/notebook
🔗 Blog on Deep SORT: https://medium.com/analytics-vidhya/object-tracking-using-deepsort-in-tensorflow-2-ec013a2eeb4f
🔗 Deep SORT Paper: https://arxiv.org/abs/1703.07402
🔗 Kalman Filter: https://www.youtube.com/playlist?list=PLn8PRpmsu08pzi6EMiYnR-076Mh-q3tWr
🔗 Hungarian Algorithm: https://www.geeksforgeeks.org/hungarian-algorithm-assignment-problem-set-1-introduction/
🔗 Cosine Distance Metric: https://www.machinelearningplus.com/nlp/cosine-similarity/
🔗 Mahalanobis Distance: https://www.machinelearningplus.com/statistics/mahalanobis-distance/
🔗 YOLO Algorithm: https://youtu.be/C3qmhPVUXiE
⌨️ (0:52:37) KITTI 3D Data Visualization | Homogenous Transformations
🔗 Dataset: https://www.kaggle.com/garymk/kitti-3d-object-detection-dataset
🔗 Notebook/Code: https://www.kaggle.com/sakshaymahna/lidar-data-visualization/notebook
🔗 LIDAR: https://geoslam.com/what-is-lidar/
🔗 Tesla doesn't use LIDAR: https://towardsdatascience.com/why-tesla-wont-use-lidar-57c325ae2ed5
⌨️ (1:06:45) Multi Task Attention Network (MTAN) | Multi Task Learning
🔗 Dataset: https://www.kaggle.com/sakshaymahna/cityscapes-depth-and-segmentation
🔗 Notebook/Code: https://www.kaggle.com/sakshaymahna/mtan-multi-task-attention-network
🔗 Data Visualization: https://www.kaggle.com/sakshaymahna/exploratory-data-analysis
🔗 MTAN Paper: https://arxiv.org/abs/1803.10704
🔗 Blog on Multi Task Learning: https://ruder.io/multi-task/
🔗 Image Segmentation and FCN: https://youtu.be/U_v0Tovp4XQ
⌨️ (1:20:58) SFA 3D | 3D Object Detection
🔗 Dataset: https://www.kaggle.com/garymk/kitti-3d-object-detection-dataset
🔗 Notebook/Code: https://www.kaggle.com/sakshaymahna/sfa3d
🔗 Data Visualization: https://www.kaggle.com/sakshaymahna/l...
🔗 Data Visualization Video: https://youtu.be/tb1H42kE0eE
🔗 SFA3D GitHub Repository: https://github.com/maudzung/SFA3D
🔗 Feature Pyramid Networks: https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c
🔗 Keypoint Feature Pyramid Network: https://arxiv.org/pdf/2001.03343.pdf
🔗 Heat Maps: https://en.wikipedia.org/wiki/Heat_map
🔗 Focal Loss: https://medium.com/visionwizard/understanding-focal-loss-a-quick-read-b914422913e7
🔗 L1 Loss: https://afteracademy.com/blog/what-are-l1-and-l2-loss-functions
🔗 Balanced L1 Loss: https://paperswithcode.com/method/balanced-l1-loss
🔗 Learning Rate Decay: https://medium.com/analytics-vidhya/learning-rate-decay-and-methods-in-deep-learning-2cee564f910b
🔗 Cosine Annealing: https://paperswithcode.com/method/cosine-annealing
⌨️ (1:40:24) UNetXST | Camera to Bird's Eye View
🔗 Dataset: https://www.kaggle.com/sakshaymahna/semantic-segmentation-bev
🔗 Dataset Visualization: https://www.kaggle.com/sakshaymahna/data-visualization
🔗 Notebook/Code: https://www.kaggle.com/sakshaymahna/unetxst
🔗 UNetXST Paper: https://arxiv.org/pdf/2005.04078.pdf
🔗 UNetXST Github Repository: https://github.com/ika-rwth-aachen/Cam2BEV
🔗 UNet: https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47
🔗 Image Transformations: https://kevinzakka.github.io/2017/01/10/stn-part1/
🔗 Spatial Transformer Networks: https://kevinzakka.github.io/2017/01/18/stn-part2/
detail
{'title': 'Computer Vision and Perception for Self-Driving Cars (Deep Learning Course)', 'heatmap': [{'end': 648.067, 'start': 568.146, 'weight': 0.744}, {'end': 792.721, 'start': 713.668, 'weight': 1}, {'end': 939.175, 'start': 861.761, 'weight': 0.83}, {'end': 1303.921, 'start': 1225.578, 'weight': 0.734}], 'summary': "Course on computer vision and perception for self-driving cars covers techniques for perception, including road segmentation, 2d and 3d object detection, image upsampling, lyft's 3d object detection challenge, object tracking algorithms, multitask learning, sfa3d for 3d object detection, and unet and spatial transformers implementation, offering practical insights into various deep learning architectures and algorithms for self-driving cars.", 'chapters': [{'end': 370.336, 'segs': [{'end': 50.496, 'src': 'embed', 'start': 17.702, 'weight': 0, 'content': [{'end': 26.089, 'text': 'In this course, we will go over a number of different projects that are related to the perception module of the self-driving car.', 'start': 17.702, 'duration': 8.387}, {'end': 30.29, 'text': 'This course is meant for an intermediate level of audience.', 'start': 26.989, 'duration': 3.301}, {'end': 38.032, 'text': 'You are required to know the basic concepts of Python as well as some basic concepts regarding deep learning.', 'start': 31.01, 'duration': 7.022}, {'end': 50.496, 'text': 'This course is meant for you if you are looking to get into the field of self-driving cars or if you are interested to do projects that are related to computer vision using deep learning.', 'start': 38.893, 'duration': 11.603}], 'summary': 'Intermediate course on perception in self-driving cars using python and deep learning.', 'duration': 32.794, 'max_score': 17.702, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI17702.jpg'}, {'end': 102.847, 'src': 'embed', 'start': 68.809, 'weight': 1, 'content': [{'end': 71.05, 'text': 'Our first project would be road segmentation.', 'start': 68.809, 'duration': 2.241}, {'end': 79.654, 'text': 'In this project, we would be doing segmentation of road images using a deep learning technique called fully convolutional networks.', 'start': 71.43, 'duration': 8.224}, {'end': 89.759, 'text': 'In the next project we would be doing 2D object detection on various different traffic scenarios using a deep learning technique called the YOLO algorithm.', 'start': 80.394, 'duration': 9.365}, {'end': 95.862, 'text': 'Next, we will be tracking those object detections using a technique called deep sort.', 'start': 90.559, 'duration': 5.303}, {'end': 102.847, 'text': 'After we are done with the 2D part in these three projects, we will be moving on to the 3D part.', 'start': 96.98, 'duration': 5.867}], 'summary': 'Projects include road segmentation, 2d object detection using yolo, and object tracking using deep sort.', 'duration': 34.038, 'max_score': 68.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI68809.jpg'}, {'end': 156.239, 'src': 'embed', 'start': 129.888, 'weight': 4, 'content': [{'end': 135.394, 'text': 'All the data set and the code notebooks are provided so that you can follow along with this course.', 'start': 129.888, 'duration': 5.506}, {'end': 140.969, 'text': 'Alright. so what is the road segmentation problem??', 'start': 137.867, 'duration': 3.102}, {'end': 149.374, 'text': 'You would say in road segmentation, given an input image, we are required to identify where are the roads present in this image.', 'start': 140.989, 'duration': 8.385}, {'end': 156.239, 'text': 'Now we as humans can see that road is present right over here in somewhat in the middle of the image.', 'start': 150.095, 'duration': 6.144}], 'summary': 'Identify the road in an image using provided data and code notebooks.', 'duration': 26.351, 'max_score': 129.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI129888.jpg'}, {'end': 258.634, 'src': 'embed', 'start': 231.735, 'weight': 2, 'content': [{'end': 239.439, 'text': 'If our model is able to perform well on these images, we can be kind of sure that they are going to perform well in real world as well.', 'start': 231.735, 'duration': 7.704}, {'end': 242.36, 'text': 'But there are some complications that are required as well.', 'start': 239.679, 'duration': 2.681}, {'end': 249.522, 'text': 'Okay, so by now you would have understood what is the problem of road segmentation and the dataset that we are working on,', 'start': 243.194, 'duration': 6.328}, {'end': 251.705, 'text': 'which is the kitty road dataset.', 'start': 249.522, 'duration': 2.183}, {'end': 258.634, 'text': 'Now, for this task of road segmentation, there exist a number of computer vision techniques that we can apply.', 'start': 252.206, 'duration': 6.428}], 'summary': 'Model must perform well on images for real-world application. complications and techniques for road segmentation in kitty road dataset.', 'duration': 26.899, 'max_score': 231.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI231735.jpg'}, {'end': 320.211, 'src': 'embed', 'start': 295.757, 'weight': 3, 'content': [{'end': 305.144, 'text': 'Then, with the advent of deep learning, we got rid of such hand crafting techniques wherein the computer would automatically learn these parameters,', 'start': 295.757, 'duration': 9.387}, {'end': 307.186, 'text': 'given a diverse set of data.', 'start': 305.144, 'duration': 2.042}, {'end': 311.93, 'text': 'So one such technique that we are going to discuss today is fully convolutional networks.', 'start': 307.606, 'duration': 4.324}, {'end': 320.211, 'text': 'Now, fully convolutional networks were one of the first techniques that introduced a method of segmentation end-to-end.', 'start': 312.526, 'duration': 7.685}], 'summary': 'Deep learning eliminates hand-crafting, fully convolutional networks introduced end-to-end segmentation method.', 'duration': 24.454, 'max_score': 295.757, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI295757.jpg'}], 'start': 0.049, 'title': 'Perception for self-driving cars', 'summary': "Focuses on a course teaching techniques for perception in self-driving cars, including road segmentation, 2d and 3d object detection, multi-task learning, and bird's eye view visualization. it also discusses utilizing fully convolutional networks to solve road segmentation using the kitty road dataset, eliminating the need for hand-crafted techniques.", 'chapters': [{'end': 129.287, 'start': 0.049, 'title': 'Course on perception for self-driving cars', 'summary': "Focuses on a course teaching various techniques related to perception for self-driving cars, including road segmentation, 2d object detection, 3d data visualization, multi-task learning, 3d object detection, and bird's eye view visualization.", 'duration': 129.238, 'highlights': ["The course covers projects related to perception for self-driving cars, including road segmentation, 2D object detection, 3D data visualization, multi-task learning, 3D object detection, and bird's eye view visualization.", 'The projects involve techniques such as fully convolutional networks for road segmentation and YOLO algorithm for 2D object detection.', 'The course is suitable for intermediate level individuals with knowledge of Python and basic concepts of deep learning.', 'The instructor, Sakshay, also promotes his robotics and artificial intelligence channel for further related content.', 'The chapter emphasizes the practical implementation of computer vision and deep learning techniques in the context of self-driving cars, catering to individuals interested in this field or related projects.']}, {'end': 370.336, 'start': 129.888, 'title': 'Road segmentation using fully convolutional networks', 'summary': 'Discusses the road segmentation problem, using the kitty road dataset to train a fully convolutional network to output a binary image marking the presence of roads, eliminating the need for hand-crafted techniques and achieving end-to-end segmentation.', 'duration': 240.448, 'highlights': ['The chapter discusses the road segmentation problem and introduces the kitty road dataset, consisting of input images and desired output images, used to evaluate model performance. The kitty road dataset is used to evaluate model performance, ensuring the model can perform well in the real world.', 'The chapter explains how fully convolutional networks (FCN) have revolutionized road segmentation by enabling end-to-end segmentation without the need for hand-crafted techniques. Fully convolutional networks eliminate the need for hand-crafted techniques by automatically learning parameters, providing end-to-end segmentation.', "The chapter outlines the goal of the road segmentation problem, which is to identify the presence of roads in input images and output a binary image marking the road's presence. The goal of the road segmentation problem is to identify the presence of roads in input images and output a binary image marking the road's presence."]}], 'duration': 370.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI49.jpg', 'highlights': ["The course covers projects related to perception for self-driving cars, including road segmentation, 2D object detection, 3D data visualization, multi-task learning, 3D object detection, and bird's eye view visualization.", 'The projects involve techniques such as fully convolutional networks for road segmentation and YOLO algorithm for 2D object detection.', 'The chapter discusses the road segmentation problem and introduces the kitty road dataset, consisting of input images and desired output images, used to evaluate model performance.', 'The chapter explains how fully convolutional networks (FCN) have revolutionized road segmentation by enabling end-to-end segmentation without the need for hand-crafted techniques.', "The chapter outlines the goal of the road segmentation problem, which is to identify the presence of roads in input images and output a binary image marking the road's presence."]}, {'end': 1245.025, 'segs': [{'end': 514.9, 'src': 'embed', 'start': 486.193, 'weight': 2, 'content': [{'end': 489.114, 'text': 'And we do this similarly for the complete matrix over here.', 'start': 486.193, 'duration': 2.921}, {'end': 494.737, 'text': 'Now, an even better method than nearest neighbor is called interpolation.', 'start': 490.135, 'duration': 4.602}, {'end': 502.417, 'text': 'Now interpolation is more like an average finding technique where we assign values based on a weighted average.', 'start': 495.495, 'duration': 6.922}, {'end': 507.418, 'text': 'How we do that? In interpolation, we try to work our way backward.', 'start': 502.777, 'duration': 4.641}, {'end': 514.9, 'text': 'If we want to calculate the value of this cell over here, then we approximately try to overlap this matrix with this matrix.', 'start': 507.678, 'duration': 7.222}], 'summary': 'Interpolation assigns values based on a weighted average to calculate cell values in a matrix.', 'duration': 28.707, 'max_score': 486.193, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI486193.jpg'}, {'end': 592.476, 'src': 'embed', 'start': 568.146, 'weight': 0, 'content': [{'end': 575.788, 'text': 'Now, interpolation method is one of the best technique that we could apply, but there exists an even better technique that we can apply as well,', 'start': 568.146, 'duration': 7.642}, {'end': 577.769, 'text': 'which is transposed convolutions.', 'start': 575.788, 'duration': 1.981}, {'end': 585.893, 'text': 'In all the techniques that we discussed before, we were assigning the values based on some heuristic that we hard-coded.', 'start': 579.409, 'duration': 6.484}, {'end': 592.476, 'text': 'Instead, in transposed convolution, we assign the values based on a learnable weight filter.', 'start': 586.493, 'duration': 5.983}], 'summary': 'Transposed convolutions is a better technique than interpolation method, using learnable weight filter for value assignment.', 'duration': 24.33, 'max_score': 568.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI568146.jpg'}, {'end': 648.067, 'src': 'heatmap', 'start': 568.146, 'weight': 0.744, 'content': [{'end': 575.788, 'text': 'Now, interpolation method is one of the best technique that we could apply, but there exists an even better technique that we can apply as well,', 'start': 568.146, 'duration': 7.642}, {'end': 577.769, 'text': 'which is transposed convolutions.', 'start': 575.788, 'duration': 1.981}, {'end': 585.893, 'text': 'In all the techniques that we discussed before, we were assigning the values based on some heuristic that we hard-coded.', 'start': 579.409, 'duration': 6.484}, {'end': 592.476, 'text': 'Instead, in transposed convolution, we assign the values based on a learnable weight filter.', 'start': 586.493, 'duration': 5.983}, {'end': 596.779, 'text': 'We define a filter of some size with some random weight values.', 'start': 593.137, 'duration': 3.642}, {'end': 604.423, 'text': 'We then use those weight values to assign the value for all the cells that are present in this matrix and pass this to the network.', 'start': 597.339, 'duration': 7.084}, {'end': 612.688, 'text': 'When the network is going to do a backpropagation, we are going to backpropagate the loss of the weights of those filters as well,', 'start': 605.083, 'duration': 7.605}, {'end': 614.709, 'text': 'and they are going to be learned as well.', 'start': 612.688, 'duration': 2.021}, {'end': 621.214, 'text': 'For the purpose of this project, I tried two methods which were interpolation and transpose convolutions.', 'start': 615.21, 'duration': 6.004}, {'end': 629.578, 'text': 'From what I saw, transposed convolutions did not work very well but interpolation method worked really well for our case.', 'start': 621.974, 'duration': 7.604}, {'end': 636.381, 'text': "Now, since we have understood what the upsampling operation is all about, so let's now discuss the architecture of the fully convolutional network.", 'start': 630.038, 'duration': 6.343}, {'end': 644.265, 'text': 'So, for the fully convolutional network, we require a backbone network which is an image classification network called VGGNet.', 'start': 636.961, 'duration': 7.304}, {'end': 648.067, 'text': 'Now, what is image classification and what is VGGNet?', 'start': 644.785, 'duration': 3.282}], 'summary': 'Transposed convolutions and interpolation methods were tested for upsampling, with the latter being more effective. the fully convolutional network requires a vggnet backbone for image classification.', 'duration': 79.921, 'max_score': 568.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI568146.jpg'}, {'end': 792.721, 'src': 'heatmap', 'start': 713.668, 'weight': 1, 'content': [{'end': 719.652, 'text': 'In the segmentation problem, the input is an image and the output is also an image that is of the same dimensions.', 'start': 713.668, 'duration': 5.984}, {'end': 725.296, 'text': "So, let's see the changes that we require to make to this VGG net in order to output such images.", 'start': 720.152, 'duration': 5.144}, {'end': 732.741, 'text': 'Now, VGG network is composed of a number of convolutional blocks which are placed sequentially to each other.', 'start': 725.876, 'duration': 6.865}, {'end': 738.961, 'text': 'of two convolutional layers followed by a max pooling layer.', 'start': 735.279, 'duration': 3.682}, {'end': 746.706, 'text': 'Now what we do is we extract the output of last three convolutional blocks which are pool 3, pool 4 and pool 5.', 'start': 739.662, 'duration': 7.044}, {'end': 753.531, 'text': 'So what we do first we up sample this pool 5 times 2 so we have a bigger image over here.', 'start': 746.706, 'duration': 6.825}, {'end': 758.194, 'text': 'Then we add this output to the output of pool 4 then what we get over here.', 'start': 754.131, 'duration': 4.063}, {'end': 767.899, 'text': 'We again upsample this output 2 times to get an output over here and finally again add this pool 3 layer which then we get the output over here.', 'start': 758.994, 'duration': 8.905}, {'end': 775.644, 'text': 'Finally, we upsample this output 8 times and the final output that we get is the segmentation mask that we require.', 'start': 768.54, 'duration': 7.104}, {'end': 784.809, 'text': 'Now, this long sequence is what constitutes the FCN architecture or particularly this is the FCN 8 architecture.', 'start': 776.364, 'duration': 8.445}, {'end': 792.721, 'text': 'There exist two other FCN architectures as well, FCN32 and FCN16.', 'start': 787.116, 'duration': 5.605}], 'summary': 'Modified vgg net outputs segmentation mask using fcn architecture.', 'duration': 79.053, 'max_score': 713.668, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI713668.jpg'}, {'end': 861.081, 'src': 'embed', 'start': 832.56, 'weight': 3, 'content': [{'end': 839.245, 'text': 'What this means is that FCN32 identifies coarser structures in the image,', 'start': 832.56, 'duration': 6.685}, {'end': 847.111, 'text': 'FCN16 identifies more finer structures in the image and FCN8 even more finer structures in the image.', 'start': 839.685, 'duration': 7.426}, {'end': 854.936, 'text': 'Using these addition operations that we have over here, we are combining the knowledge of these three networks to get the final output.', 'start': 847.571, 'duration': 7.365}, {'end': 861.081, 'text': 'Now these are the complete details of the architecture and how this architecture is somewhat working on the inside.', 'start': 855.397, 'duration': 5.684}], 'summary': 'Fcn32 identifies coarser structures, fcn16 more finer, fcn8 even more finer, combining knowledge of three networks for final output.', 'duration': 28.521, 'max_score': 832.56, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI832560.jpg'}, {'end': 939.175, 'src': 'heatmap', 'start': 861.761, 'weight': 0.83, 'content': [{'end': 864.362, 'text': 'Alright, so now our discussion for this is complete.', 'start': 861.761, 'duration': 2.601}, {'end': 870.345, 'text': "Now let's have a look at how this model was implemented and what are the different results that we got.", 'start': 864.783, 'duration': 5.562}, {'end': 871.586, 'text': "So let's have a look at that.", 'start': 870.405, 'duration': 1.181}, {'end': 875.148, 'text': 'The dataset and the code both are present on Kaggle.', 'start': 872.006, 'duration': 3.142}, {'end': 877.649, 'text': 'The link for both of these is in the description.', 'start': 875.588, 'duration': 2.061}, {'end': 879.49, 'text': 'You can have a look at them as well.', 'start': 877.949, 'duration': 1.541}, {'end': 880.735, 'text': 'R V.', 'start': 880.615, 'duration': 0.12}, {'end': 883.057, 'text': 'Okay, so this is the implementation.', 'start': 880.735, 'duration': 2.322}, {'end': 887.04, 'text': 'first we define an input layer with a parameter input shape.', 'start': 883.057, 'duration': 3.983}, {'end': 892.985, 'text': 'now input shape is a couple that contains image size, image size, common number of channels.', 'start': 887.04, 'duration': 5.945}, {'end': 899.911, 'text': 'image size we have set over here is 128, cross, 128, and number of channels is three, the rgb channels.', 'start': 892.985, 'duration': 6.926}, {'end': 905.018, 'text': 'Then we define the VGG network, but that is VGG16.', 'start': 901.616, 'duration': 3.402}, {'end': 913.363, 'text': 'TensorFlow or Keras already provides us with the implementation of VGG16, which is pre-trained on the ImageNet dataset.', 'start': 905.238, 'duration': 8.125}, {'end': 919.586, 'text': 'Then we extract these three blocks, which are the pool three, pool four, and pool five layers.', 'start': 914.383, 'duration': 5.203}, {'end': 925.709, 'text': 'For the first step, we upsample two times that we can see over here, the pool five layer.', 'start': 920.146, 'duration': 5.563}, {'end': 933.934, 'text': 'Then we add this upsample layer to the pool four, and we apply a one cross one convolution over here, as you can see.', 'start': 926.33, 'duration': 7.604}, {'end': 939.175, 'text': 'Now, this convolution is more like a placeholder to get the dimensions in the right size.', 'start': 934.514, 'duration': 4.661}], 'summary': 'Implemented vgg16 model with 3 upsampling steps and 1x1 convolution layer for dimension adjustment.', 'duration': 77.414, 'max_score': 861.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI861761.jpg'}, {'end': 1108.074, 'src': 'embed', 'start': 1079.287, 'weight': 4, 'content': [{'end': 1084.632, 'text': "Now let's apply some changes to the baseline network that we have and see how the results change.", 'start': 1079.287, 'duration': 5.345}, {'end': 1087.008, 'text': 'Now this is one change that we do.', 'start': 1085.327, 'duration': 1.681}, {'end': 1092.329, 'text': 'Instead of the add function, we are using the concatenate function over here.', 'start': 1087.468, 'duration': 4.861}, {'end': 1098.911, 'text': 'In adding, what we do is we add the corresponding values that are present in the matrix of two images.', 'start': 1092.769, 'duration': 6.142}, {'end': 1108.074, 'text': 'However, in concatenate, we just add that image to the back of the image and increase the number of channels of the input.', 'start': 1099.992, 'duration': 8.082}], 'summary': 'Testing change from add to concatenate function in baseline network.', 'duration': 28.787, 'max_score': 1079.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1079287.jpg'}, {'end': 1256.613, 'src': 'embed', 'start': 1225.578, 'weight': 5, 'content': [{'end': 1230.46, 'text': 'So we can clearly see that transpose convolution is not giving that good of results.', 'start': 1225.578, 'duration': 4.882}, {'end': 1239.583, 'text': "Although it has learned the task, but the output is kind of pixelated and doesn't really look good on the segmentation as well.", 'start': 1231.7, 'duration': 7.883}, {'end': 1240.763, 'text': 'All right.', 'start': 1240.423, 'duration': 0.34}, {'end': 1245.025, 'text': 'So these are the results of the network and some of its extensions as well.', 'start': 1240.783, 'duration': 4.242}, {'end': 1256.613, 'text': 'So before discussing what is 2D object detection problem, let us first see what is the data set that is provided to us in this project.', 'start': 1247.406, 'duration': 9.207}], 'summary': 'Transpose convolution shows pixelated output, not good for segmentation.', 'duration': 31.035, 'max_score': 1225.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1225578.jpg'}], 'start': 371.076, 'title': 'Image upsampling and fcn architecture', 'summary': 'Covers upsampling techniques including bed of nails, nearest neighbor, interpolation, and transposed convolutions, along with their applications in fully convolutional networks. additionally, it explains the transformation of vggnet into fcn8 architecture for image segmentation, discussing the differences between fcn32, fcn16, and fcn8, and presenting implementation details and results of the model.', 'chapters': [{'end': 636.381, 'start': 371.076, 'title': 'Understanding upsampling techniques', 'summary': 'Explains the upsampling operation and various techniques including bed of nails, nearest neighbor, interpolation, and transposed convolutions, highlighting the advantages of each method and their application in fully convolutional networks.', 'duration': 265.305, 'highlights': ['Interpolation method worked really well for the project, while transposed convolutions did not work very well, indicating the practical advantage of the interpolation method over transposed convolutions.', 'Transposed convolutions involve assigning values based on learnable weight filters, allowing the network to backpropagate the loss of the weights and learn the filters, highlighting the adaptability and learning capability of transposed convolutions.', 'The interpolation method assigns values based on a weighted average, providing a more refined and accurate approach compared to nearest neighbor and bed of nails, showcasing the superior performance of interpolation technique in upsampling.']}, {'end': 1245.025, 'start': 636.961, 'title': 'Fully convolutional network: fcn architecture', 'summary': 'Explains the transformation of vggnet for image classification into fcn8 architecture for image segmentation, detailing the operations and differences between fcn32, fcn16, and fcn8 as well as the implementation and results of the model.', 'duration': 608.064, 'highlights': ['The FCN architecture transforms VGGNet for image classification into FCN8 architecture for image segmentation, involving upsampling of pool layers, addition operations, and combining knowledge from FCN32 and FCN16 to identify finer structures in the image. The FCN architecture transforms VGGNet for image classification into FCN8 architecture for image segmentation, involving upsampling of pool layers, addition operations, and combining knowledge from FCN32 and FCN16 to identify finer structures in the image.', 'FCN32, FCN16, and FCN8 differ in the operations performed, with FCN8 combining the knowledge of FCN32 and FCN16 to identify coarser and finer structures in the image, resulting in the final output for segmentation. FCN32, FCN16, and FCN8 differ in the operations performed, with FCN8 combining the knowledge of FCN32 and FCN16 to identify coarser and finer structures in the image, resulting in the final output for segmentation.', 'The implementation of the FCN8 model involves defining the input layer, using pre-trained VGG16, extracting pool layers, upsampling, adding layers, and applying convolution operations to produce the final model for image segmentation. The implementation of the FCN8 model involves defining the input layer, using pre-trained VGG16, extracting pool layers, upsampling, adding layers, and applying convolution operations to produce the final model for image segmentation.', 'Alterations in the baseline network, such as using the concatenate function instead of add and replacing upsampling 2D with convolutional 2D transpose, result in variations in the segmentation results, with add function performing better than concatenate and transpose convolution producing pixelated outputs. Alterations in the baseline network, such as using the concatenate function instead of add and replacing upsampling 2D with convolutional 2D transpose, result in variations in the segmentation results, with add function performing better than concatenate and transpose convolution producing pixelated outputs.', "The model's performance and the impact of changes in the baseline network are evaluated based on the segmentation results, showcasing the effectiveness of different architectural choices in the FCN8 model. The model's performance and the impact of changes in the baseline network are evaluated based on the segmentation results, showcasing the effectiveness of different architectural choices in the FCN8 model."]}], 'duration': 873.949, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI371076.jpg', 'highlights': ['Interpolation method outperformed transposed convolutions, highlighting its practical advantage.', 'Transposed convolutions allow the network to backpropagate the loss of the weights and learn the filters.', 'Interpolation method provides a more refined and accurate approach compared to nearest neighbor and bed of nails.', 'FCN8 architecture combines knowledge from FCN32 and FCN16 to identify coarser and finer structures in the image.', 'Alterations in the baseline network result in variations in the segmentation results, with add function performing better than concatenate and transpose convolution producing pixelated outputs.', "The model's performance and the impact of changes in the baseline network are evaluated based on the segmentation results."]}, {'end': 2071.054, 'segs': [{'end': 1271.524, 'src': 'embed', 'start': 1247.406, 'weight': 3, 'content': [{'end': 1256.613, 'text': 'So before discussing what is 2D object detection problem, let us first see what is the data set that is provided to us in this project.', 'start': 1247.406, 'duration': 9.207}, {'end': 1260.215, 'text': 'So Lyft is a self-driving cars company.', 'start': 1257.333, 'duration': 2.882}, {'end': 1263.742, 'text': 'that develops self-driving car software.', 'start': 1261.181, 'duration': 2.561}, {'end': 1271.524, 'text': 'In order to further the research in self-driving cars, Lyft announced two competitions that were held previously.', 'start': 1264.062, 'duration': 7.462}], 'summary': 'Lyft announced two self-driving car competitions to further research in the field.', 'duration': 24.118, 'max_score': 1247.406, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1247406.jpg'}, {'end': 1492.349, 'src': 'embed', 'start': 1461.272, 'weight': 1, 'content': [{'end': 1470.08, 'text': 'However, in 3D object detection, apart from a 2D image, we would also be given some other data representation as well.', 'start': 1461.272, 'duration': 8.808}, {'end': 1480.349, 'text': 'Using that image and the additional data, we are required to return a 3D bounding box of that object along with its label.', 'start': 1470.761, 'duration': 9.588}, {'end': 1485.786, 'text': 'A very good example is in the 3D object detection data as well.', 'start': 1481.403, 'duration': 4.383}, {'end': 1492.349, 'text': 'As we can see, we are given 3D bounding boxes of the various cars in this image.', 'start': 1487.647, 'duration': 4.702}], 'summary': 'In 3d object detection, additional data is used to return 3d bounding boxes and labels.', 'duration': 31.077, 'max_score': 1461.272, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1461272.jpg'}, {'end': 1600.015, 'src': 'embed', 'start': 1573.205, 'weight': 0, 'content': [{'end': 1582.632, 'text': 'we can divide the YOLO algorithm into three stages, which are the anchor boxes, the intersection over union and the bounding box predictions.', 'start': 1573.205, 'duration': 9.427}, {'end': 1589.584, 'text': 'So how does this work? Now YOLO is also a deep learning technique that we use for object detection.', 'start': 1583.618, 'duration': 5.966}, {'end': 1592.147, 'text': 'In this we are given an input image.', 'start': 1589.905, 'duration': 2.242}, {'end': 1600.015, 'text': 'We pass this input image through a convolutional neural network and as an output we get another matrix.', 'start': 1592.507, 'duration': 7.508}], 'summary': 'Yolo algorithm has 3 stages: anchor boxes, iou, & bounding box predictions, used for object detection.', 'duration': 26.81, 'max_score': 1573.205, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1573205.jpg'}, {'end': 2049.739, 'src': 'embed', 'start': 2022.239, 'weight': 2, 'content': [{'end': 2025.62, 'text': "Okay, so now let's see this YOLO algorithm in action.", 'start': 2022.239, 'duration': 3.381}, {'end': 2031.022, 'text': 'YOLO is a very complex algorithm that would take quite a lot of time to develop.', 'start': 2026.14, 'duration': 4.882}, {'end': 2037.973, 'text': 'Hence, it is neither feasible nor an intelligent way that we implement YOLO from scratch.', 'start': 2032.11, 'duration': 5.863}, {'end': 2044.257, 'text': 'Hence, we are going to use this Keras YOLO v3 implementation provided by ExperianSort.', 'start': 2038.634, 'duration': 5.623}, {'end': 2047.079, 'text': 'Apologies if I pronounced it incorrectly.', 'start': 2044.277, 'duration': 2.802}, {'end': 2049.739, 'text': 'This is an open source implementation.', 'start': 2047.639, 'duration': 2.1}], 'summary': 'Implementing yolo algorithm using keras yolo v3 for efficiency and time-saving.', 'duration': 27.5, 'max_score': 2022.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2022239.jpg'}], 'start': 1247.406, 'title': 'Lyft 3d object detection challenge', 'summary': "Introduces the dataset for lyft's 3d object detection challenge, encompassing camera and lidar sensor data from a self-driving car trip, with lidar data visualized through images and videos. it also discusses 2d and 3d object detection, emphasizing yolo algorithm's stages and techniques for object detection in self-driving cars.", 'chapters': [{'end': 1411.068, 'start': 1247.406, 'title': 'Lyft 3d object detection challenge', 'summary': 'Introduces the dataset provided by lyft for their 3d object detection challenge, which includes camera and lidar sensor data captured during a self-driving car trip, with the lidar data being visualized through images and videos.', 'duration': 163.662, 'highlights': ['Lyft provided a dataset for their 3D object detection challenge, containing camera and LiDAR sensor data from a self-driving car trip. False', 'The LiDAR sensor works on the principle of reflection of laser lights to detect the distance of objects, with the dataset including camera and LiDAR data for the same trip. False', 'The dataset includes sample pictures of the trip, showing the localization of the car and the presence of other cars and pedestrians, along with the ability to visualize the data in both image and video formats. False']}, {'end': 2071.054, 'start': 1411.669, 'title': '2d and 3d object detection', 'summary': "Discusses the concepts of 2d and 3d object detection, highlighting the differences and the yolo algorithm's stages and techniques for object detection, emphasizing its speed and accuracy in self-driving cars.", 'duration': 659.385, 'highlights': ['YOLO Algorithm The YOLO algorithm for 2D object detection is explained, detailing its three stages - anchor boxes, intersection over union, and bounding box predictions, showcasing its ability to predict multiple objects in a single cell and its fast and accurate performance in self-driving cars.', 'Difference between 2D and 3D Object Detection The distinction between 2D and 3D object detection is highlighted, emphasizing the additional data representation and the ability of 3D object detection to predict 3D bounding boxes with labels, exemplifying its application in detecting occluded objects using LiDAR data.', 'Implementation of YOLO Algorithm The decision to use the Keras YOLO v3 implementation is mentioned, acknowledging the complexity of YOLO and its open-source nature, leading to a direct presentation of the results obtained from the dataset.']}], 'duration': 823.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI1247406.jpg', 'highlights': ['The YOLO algorithm for 2D object detection is explained, detailing its three stages - anchor boxes, intersection over union, and bounding box predictions, showcasing its ability to predict multiple objects in a single cell and its fast and accurate performance in self-driving cars.', 'The distinction between 2D and 3D object detection is highlighted, emphasizing the additional data representation and the ability of 3D object detection to predict 3D bounding boxes with labels, exemplifying its application in detecting occluded objects using LiDAR data.', 'The decision to use the Keras YOLO v3 implementation is mentioned, acknowledging the complexity of YOLO and its open-source nature, leading to a direct presentation of the results obtained from the dataset.', 'Lyft provided a dataset for their 3D object detection challenge, containing camera and LiDAR sensor data from a self-driving car trip.']}, {'end': 2574.215, 'segs': [{'end': 2164.649, 'src': 'embed', 'start': 2101.922, 'weight': 0, 'content': [{'end': 2120.131, 'text': 'we have the detection of a bus over here as well and we also have a truck that is much behind, Alright.', 'start': 2101.922, 'duration': 18.209}, {'end': 2127.259, 'text': 'so these were the results that we got on running the YOLOv3 algorithm on the Lyft 3D object detection dataset.', 'start': 2120.131, 'duration': 7.128}, {'end': 2137.092, 'text': 'Now 2D object detection is pretty much a solved problem and we already have very good algorithms like YOLO that do our tasks.', 'start': 2128.421, 'duration': 8.671}, {'end': 2143.281, 'text': 'However, there is some research that is still required on 3D object detection,', 'start': 2137.779, 'duration': 5.502}, {'end': 2148.863, 'text': 'which was also the motivation behind why Lyft conducted this competition as well.', 'start': 2143.281, 'duration': 5.582}, {'end': 2151.404, 'text': 'All right, so this is it for this video.', 'start': 2149.483, 'duration': 1.921}, {'end': 2162.388, 'text': 'For this particular project, we are going to use several videos that were recorded by a camera that was placed in front of an autonomous car.', 'start': 2153.505, 'duration': 8.883}, {'end': 2164.649, 'text': "So let's have a look at one of the videos.", 'start': 2162.948, 'duration': 1.701}], 'summary': 'Yolov3 algorithm detected bus and truck on lyft 3d object detection dataset.', 'duration': 62.727, 'max_score': 2101.922, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2101922.jpg'}, {'end': 2299.11, 'src': 'embed', 'start': 2273.935, 'weight': 4, 'content': [{'end': 2278.94, 'text': 'In object tracking we are not only detecting where are the objects present in this image,', 'start': 2273.935, 'duration': 5.005}, {'end': 2284.92, 'text': "but we are also tracking where the objects ID'd 1 and 2 are going.", 'start': 2279.813, 'duration': 5.107}, {'end': 2288.384, 'text': 'So this is the whole problem of object tracking.', 'start': 2285.861, 'duration': 2.523}, {'end': 2291.528, 'text': 'In our case, it would be traffic tracking.', 'start': 2288.624, 'duration': 2.904}, {'end': 2299.11, 'text': 'We need to keep a track of the different cars that are moving around our autonomous vehicle.', 'start': 2291.848, 'duration': 7.262}], 'summary': 'Object tracking involves detecting and tracking the movement of objects, such as cars, around an autonomous vehicle.', 'duration': 25.175, 'max_score': 2273.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2273935.jpg'}, {'end': 2455.655, 'src': 'embed', 'start': 2428.277, 'weight': 3, 'content': [{'end': 2436.182, 'text': 'So how does Kalman filter works? Kalman filter assumes a linear velocity model and generates the predictions according to it.', 'start': 2428.277, 'duration': 7.905}, {'end': 2441.546, 'text': 'Once we get the real data where the object is, we input that again to the Kalman filter.', 'start': 2436.542, 'duration': 5.004}, {'end': 2449.711, 'text': 'The Kalman filter improves its predictions based on the real data that it got and again generates a new set of predictions.', 'start': 2442.306, 'duration': 7.405}, {'end': 2455.655, 'text': 'In this way, Kalman filter works in an iterative manner and keeps improving its predictions.', 'start': 2450.691, 'duration': 4.964}], 'summary': 'Kalman filter iteratively improves predictions based on real data.', 'duration': 27.378, 'max_score': 2428.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2428277.jpg'}, {'end': 2582.521, 'src': 'embed', 'start': 2552.16, 'weight': 2, 'content': [{'end': 2558.424, 'text': 'Solving this problem using brute force approach takes order complexity n factorial which is very large.', 'start': 2552.16, 'duration': 6.264}, {'end': 2562.808, 'text': 'In order to solve this, the SORT algorithm uses the Hungarian algorithm.', 'start': 2558.645, 'duration': 4.163}, {'end': 2570.933, 'text': 'The Hungarian algorithm solves this linear assignment problem in order complexity n cubed,', 'start': 2562.988, 'duration': 7.945}, {'end': 2574.215, 'text': 'which is a great improvement over the n factorial algorithm.', 'start': 2570.933, 'duration': 3.282}, {'end': 2582.521, 'text': 'Once we have assigned these n predictions to n IDs, we have in a sense solved the tracking problem for the ith frame.', 'start': 2574.876, 'duration': 7.645}], 'summary': 'Sort algorithm uses hungarian algorithm to solve tracking problem in o(n^3) time complexity, a great improvement over o(n!)', 'duration': 30.361, 'max_score': 2552.16, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2552160.jpg'}], 'start': 2071.395, 'title': '3d object detection and object tracking algorithms', 'summary': "Covers the results of yolov3 algorithm on lyft 3d object detection dataset, emphasizing the need for further research on 3d object detection. it also discusses the deep sort algorithm for object tracking, including the role of kalman filter in predicting object locations and the sort algorithm's use of the hungarian algorithm for solving the linear assignment problem.", 'chapters': [{'end': 2299.11, 'start': 2071.395, 'title': 'Yolov3 algorithm for 3d object detection', 'summary': 'Discusses the results of running the yolov3 algorithm on the lyft 3d object detection dataset, highlighting the need for further research on 3d object detection and the challenges in object tracking.', 'duration': 227.715, 'highlights': ['Results of using YOLOv3 algorithm on Lyft 3D object detection dataset The chapter discusses the results of running the YOLOv3 algorithm on the Lyft 3D object detection dataset, showcasing the detection of cars, buses, and trucks, while emphasizing the misclassification of a bus as a truck.', 'Importance of further research on 3D object detection The discussion emphasizes the need for further research on 3D object detection, highlighting it as a motivation behind the competition conducted by Lyft.', 'Challenges in object tracking The chapter outlines the challenges in object tracking, specifically the need to continuously detect and assign IDs to moving objects, using the example of tracking cars in a video dataset.']}, {'end': 2574.215, 'start': 2299.59, 'title': 'Deep sort algorithm for object tracking', 'summary': 'Discusses the deep sort algorithm for object tracking, covering the simple sort algorithm, yolov3 for bounding box predictions, the role of kalman filter in predicting object locations and the iou matching technique, with the sort algorithm utilizing the hungarian algorithm for solving the linear assignment problem.', 'duration': 274.625, 'highlights': ['The SORT algorithm uses the Hungarian algorithm to solve the linear assignment problem, reducing the order complexity from n factorial to n cubed. The Hungarian algorithm is utilized by the SORT algorithm to solve the linear assignment problem in order complexity n cubed, a significant improvement over the n factorial algorithm.', 'YOLOv3 algorithm is used for generating bounding box predictions, detecting objects in an image and determining their identity. YOLOv3 algorithm is employed to detect all objects in an image, drawing bounding boxes around them and identifying the objects, providing essential data for object tracking.', 'The Kalman filter is utilized to predict the future location of detected objects, iteratively improving its predictions by incorporating real data. Kalman filter predicts the future location of detected objects, addresses occlusion problems, and outputs a probability distribution of possible object locations based on iterative improvements using real data.']}], 'duration': 502.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2071395.jpg', 'highlights': ['Results of using YOLOv3 algorithm on Lyft 3D object detection dataset showcasing detection of cars, buses, and trucks.', 'Importance of further research on 3D object detection highlighted as a motivation behind the competition conducted by Lyft.', 'The SORT algorithm uses the Hungarian algorithm to solve the linear assignment problem, reducing the order complexity from n factorial to n cubed.', 'The Kalman filter is utilized to predict the future location of detected objects, iteratively improving its predictions by incorporating real data.', 'Challenges in object tracking outlined, specifically the need to continuously detect and assign IDs to moving objects.']}, {'end': 3146.001, 'segs': [{'end': 2628.876, 'src': 'embed', 'start': 2598.11, 'weight': 0, 'content': [{'end': 2603.211, 'text': 'Now the deep in the deep sort comes from this step called the deep appearance descriptor step.', 'start': 2598.11, 'duration': 5.101}, {'end': 2609.692, 'text': 'Along with that we also have a cascade matching step that is added to the simple sort algorithm.', 'start': 2604.891, 'duration': 4.801}, {'end': 2618.314, 'text': 'So deep appearance descriptor is a convolutional neural network which is trained to detect a similar object in different images.', 'start': 2610.092, 'duration': 8.222}, {'end': 2628.876, 'text': 'What this means? This network can tell given a number of images of the same person in different views whether it is the same person or not.', 'start': 2618.714, 'duration': 10.162}], 'summary': 'Deep appearance descriptor and cascade matching enhance object detection in images.', 'duration': 30.766, 'max_score': 2598.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2598110.jpg'}, {'end': 2747.078, 'src': 'embed', 'start': 2713.558, 'weight': 2, 'content': [{'end': 2721.421, 'text': 'We can say one is going to have a negative value, the other one is going to have a positive value if they are lying on the two sides of the y-axis.', 'start': 2713.558, 'duration': 7.863}, {'end': 2726.278, 'text': 'In these two cases, we need to use another notion of distance metric.', 'start': 2722.014, 'duration': 4.264}, {'end': 2729.901, 'text': 'For the deep descriptors, we use the cosine distance metric.', 'start': 2726.798, 'duration': 3.103}, {'end': 2747.078, 'text': 'How we compute this? If we are given two entities A and B and from the And then we calculate the angle between these two vectors.', 'start': 2730.121, 'duration': 16.957}], 'summary': 'Using cosine distance metric for deep descriptors to calculate angle between two vectors.', 'duration': 33.52, 'max_score': 2713.558, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2713558.jpg'}, {'end': 2817.308, 'src': 'embed', 'start': 2784.887, 'weight': 3, 'content': [{'end': 2788.311, 'text': 'For the Kalman filter case, we are not using the cosine similarity.', 'start': 2784.887, 'duration': 3.424}, {'end': 2795.217, 'text': 'The reason for this is the Kalman filter is going to output not a single point but a probability distribution.', 'start': 2788.331, 'duration': 6.886}, {'end': 2802.465, 'text': 'In order to compare the similarity between a point and a probability distribution, we use Melanovic distance.', 'start': 2795.618, 'duration': 6.847}, {'end': 2809.198, 'text': 'If we consider that this is the distribution that is output by the Kalman filter,', 'start': 2804.352, 'duration': 4.846}, {'end': 2817.308, 'text': 'this implies that the probability of the object being over here is very high compared to the values outwards.', 'start': 2809.198, 'duration': 8.11}], 'summary': 'Kalman filter uses melanovic distance for comparing probability distributions.', 'duration': 32.421, 'max_score': 2784.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2784887.jpg'}, {'end': 2933.694, 'src': 'embed', 'start': 2902.237, 'weight': 4, 'content': [{'end': 2907.959, 'text': 'For the assignment problem, the cascade matching takes into account the temporal dimension as well.', 'start': 2902.237, 'duration': 5.722}, {'end': 2912.442, 'text': 'In order to reduce the time complexity for the Hungarian algorithm,', 'start': 2908.559, 'duration': 3.883}, {'end': 2921.728, 'text': 'cascade matching tries and match the latest detections with the latest IDs and the later or old detections with the old IDs.', 'start': 2912.442, 'duration': 9.286}, {'end': 2926.431, 'text': 'All of this process combined gives us the deep sort algorithm.', 'start': 2922.408, 'duration': 4.023}, {'end': 2933.694, 'text': 'And we can again run this algorithm in a loop to track objects in a complete video.', 'start': 2927.331, 'duration': 6.363}], 'summary': 'Cascade matching in deep sort algorithm reduces time complexity for hungarian algorithm and enables object tracking in videos.', 'duration': 31.457, 'max_score': 2902.237, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2902237.jpg'}, {'end': 3078.152, 'src': 'embed', 'start': 3031.106, 'weight': 5, 'content': [{'end': 3039.21, 'text': 'So this is the first video, as we can see this one and two, these are the IDs that are assigned to these two cars respectively.', 'start': 3031.106, 'duration': 8.104}, {'end': 3050.576, 'text': 'This is the second video.', 'start': 3049.635, 'duration': 0.941}, {'end': 3066.269, 'text': 'As we can see the car over here was assigned the ID 9 and even in some frames the YOLO algorithm was not able to detect that car.', 'start': 3058.587, 'duration': 7.682}, {'end': 3078.152, 'text': 'Even after those missed detections the algorithm was able to determine that this was the car labeled ID 9.', 'start': 3069.81, 'duration': 8.342}], 'summary': 'Video analysis detected cars with ids 1, 2, and 9, despite some missed detections.', 'duration': 47.046, 'max_score': 3031.106, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3031106.jpg'}, {'end': 3153.304, 'src': 'embed', 'start': 3124.613, 'weight': 6, 'content': [{'end': 3131.215, 'text': 'Even after checking different hyperparameter values, I was not able to get rid of this ID assignment problem.', 'start': 3124.613, 'duration': 6.602}, {'end': 3139.178, 'text': 'This I believe would be due to assigning a negligible amount of weight to the Kalman filter predictions.', 'start': 3131.835, 'duration': 7.343}, {'end': 3146.001, 'text': 'Possibly, if we increase the weight for the Kalman filter predictions, this problem would be solved.', 'start': 3140.018, 'duration': 5.983}, {'end': 3153.304, 'text': 'But that is an extension to the project that we developed in this video and will be looked as a future work to this project.', 'start': 3146.821, 'duration': 6.483}], 'summary': 'Adjusting hyperparameters did not resolve id assignment problem; increasing weight for kalman filter may solve it - future work.', 'duration': 28.691, 'max_score': 3124.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3124613.jpg'}], 'start': 2574.876, 'title': 'Object tracking with deep sort algorithm', 'summary': 'Covers the deep sort algorithm, detailing its use of deep appearance descriptors, cosine and mahalanobis distance, kalman filter predictions, cascade matching, and object tracking, aiming for accurate and efficient similarity determination, and successful video analysis with id assignment to objects.', 'chapters': [{'end': 2858.19, 'start': 2574.876, 'title': 'Tracking objects with deep sort algorithm', 'summary': 'Explains the deep sort algorithm for object tracking, including the deep appearance descriptor step and the use of cosine distance and melanovic distance as similarity metrics, aiming to determine similarity between objects with a high level of accuracy and efficiency.', 'duration': 283.314, 'highlights': ['The deep SORT algorithm extends the simple sort algorithm and incorporates the deep appearance descriptor step and cascade matching step. The deep SORT algorithm builds upon the simple sort algorithm, adding the deep appearance descriptor step and cascade matching step to enhance object tracking.', 'The deep appearance descriptor is a convolutional neural network trained to detect similar objects in different images, enabling the comparison of different objects in a video. The deep appearance descriptor utilizes a convolutional neural network to identify similar objects in various frames of a video, allowing for efficient comparison of different objects.', 'The deep descriptor utilizes the cosine distance metric to determine the similarity between objects, providing a reliable measure of similarity. The deep descriptor employs the cosine distance metric to assess the similarity between objects, offering a dependable measure for determining object similarity.', 'The Kalman filter outputs a probability distribution, and the Melanovic distance is used to compare the similarity between a point and this distribution. The Kalman filter generates a probability distribution, and the Melanovic distance is utilized to evaluate the similarity between a point and this distribution, enabling effective comparison in object tracking.']}, {'end': 3146.001, 'start': 2859.271, 'title': 'Deepsort algorithm overview', 'summary': 'Explains the deepsort algorithm, covering the use of kalman filter predictions, cascade matching, and object tracking, and evaluates its performance in video analysis, including successful and challenging instances. the algorithm assigns ids to objects in videos, with examples of successful tracking and challenges due to occlusion.', 'duration': 286.73, 'highlights': ['The DeepSort algorithm combines Kalman filter predictions and cascade matching to assign IDs to objects in videos, with an evaluation of its performance in different scenarios. The DeepSort algorithm combines Kalman filter predictions and cascade matching to assign IDs to objects in videos, with an evaluation of its performance in different scenarios.', 'The algorithm successfully assigns IDs to cars in videos, demonstrating effective tracking in multiple instances. The algorithm successfully assigns IDs to cars in videos, demonstrating effective tracking in multiple instances.', 'Challenges arise in the algorithm due to occlusion, leading to incorrect ID assignments for objects, possibly caused by assigning a negligible weight to the Kalman filter predictions. Challenges arise in the algorithm due to occlusion, leading to incorrect ID assignments for objects, possibly caused by assigning a negligible weight to the Kalman filter predictions.']}], 'duration': 571.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI2574876.jpg', 'highlights': ['The deep SORT algorithm extends the simple sort algorithm and incorporates the deep appearance descriptor step and cascade matching step.', 'The deep appearance descriptor is a convolutional neural network trained to detect similar objects in different images, enabling the comparison of different objects in a video.', 'The deep descriptor utilizes the cosine distance metric to determine the similarity between objects, providing a reliable measure of similarity.', 'The Kalman filter outputs a probability distribution, and the Melanovic distance is used to compare the similarity between a point and this distribution.', 'The DeepSort algorithm combines Kalman filter predictions and cascade matching to assign IDs to objects in videos, with an evaluation of its performance in different scenarios.', 'The algorithm successfully assigns IDs to cars in videos, demonstrating effective tracking in multiple instances.', 'Challenges arise in the algorithm due to occlusion, leading to incorrect ID assignments for objects, possibly caused by assigning a negligible weight to the Kalman filter predictions.']}, {'end': 4740.577, 'segs': [{'end': 3184.92, 'src': 'embed', 'start': 3146.821, 'weight': 1, 'content': [{'end': 3153.304, 'text': 'But that is an extension to the project that we developed in this video and will be looked as a future work to this project.', 'start': 3146.821, 'duration': 6.483}, {'end': 3156.645, 'text': 'Alright, so this is it for this video.', 'start': 3154.064, 'duration': 2.581}, {'end': 3164.329, 'text': 'Okay, so these are the images that are part of the KITI 3D object detection dataset.', 'start': 3159.286, 'duration': 5.043}, {'end': 3174.798, 'text': 'As we can see, And these are the LiDAR data provided in the dataset as well.', 'start': 3164.589, 'duration': 10.209}, {'end': 3184.92, 'text': 'For instance, we can see a wall in this image over here.', 'start': 3181.379, 'duration': 3.541}], 'summary': 'Extension to project developed in video, part of kiti 3d object detection dataset.', 'duration': 38.099, 'max_score': 3146.821, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3146821.jpg'}, {'end': 3308.25, 'src': 'embed', 'start': 3277.035, 'weight': 2, 'content': [{'end': 3280.797, 'text': 'we need to understand the concept of homogeneous transformations.', 'start': 3277.035, 'duration': 3.762}, {'end': 3289.522, 'text': 'In a general self-driving car setting, we usually have two main sensors, which are LIDAR and the camera.', 'start': 3281.518, 'duration': 8.004}, {'end': 3297.976, 'text': 'Typically, these lidar and camera sensors are not going to be placed at the same location.', 'start': 3292.128, 'duration': 5.848}, {'end': 3301.38, 'text': 'They are going to have different locations on a car.', 'start': 3298.356, 'duration': 3.024}, {'end': 3308.25, 'text': 'For instance, in this case, the lidar is on top of the car and the camera is on the front head of the car.', 'start': 3301.741, 'duration': 6.509}], 'summary': 'Homogeneous transformations in self-driving cars involve lidar and camera sensors at different locations on the car.', 'duration': 31.215, 'max_score': 3277.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3277035.jpg'}, {'end': 3781.582, 'src': 'embed', 'start': 3754.258, 'weight': 6, 'content': [{'end': 3761.725, 'text': 'we can see how that LiDAR point cloud is going to look like when we put it on an image that is generated by a camera.', 'start': 3754.258, 'duration': 7.467}, {'end': 3769.075, 'text': 'Now, this complete concept of homogeneous transformations is very fundamental to robotics as well as computer vision.', 'start': 3762.952, 'duration': 6.123}, {'end': 3775.319, 'text': 'It is used in self-driving cars, mobile robots and even robot manipulators.', 'start': 3769.876, 'duration': 5.443}, {'end': 3781.582, 'text': 'In computer vision, apart from this application, it is also used in 3D scene reconstruction.', 'start': 3776.239, 'duration': 5.343}], 'summary': 'Homogeneous transformations crucial in robotics and computer vision for applications like self-driving cars, mobile robots, and 3d scene reconstruction.', 'duration': 27.324, 'max_score': 3754.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3754258.jpg'}, {'end': 3834.49, 'src': 'embed', 'start': 3804.434, 'weight': 3, 'content': [{'end': 3807.676, 'text': 'Intrinsic parameters depend on the internals of a camera.', 'start': 3804.434, 'duration': 3.242}, {'end': 3811.079, 'text': 'For instance, the focal length of the camera lens.', 'start': 3807.976, 'duration': 3.103}, {'end': 3819.874, 'text': 'There are a number of different ways we can use to determine what are the extrinsic and the intrinsic parameters of a given camera.', 'start': 3812.486, 'duration': 7.388}, {'end': 3824.879, 'text': 'One very popular way to determine these is the checkerboard method.', 'start': 3820.755, 'duration': 4.124}, {'end': 3834.49, 'text': 'In this method, we take a printed photo of a chess or a checkerboard and we click its images in different positions and angles.', 'start': 3825.86, 'duration': 8.63}], 'summary': 'Intrinsic camera parameters, such as focal length, can be determined using methods like the checkerboard technique.', 'duration': 30.056, 'max_score': 3804.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3804434.jpg'}, {'end': 3904.396, 'src': 'embed', 'start': 3873.795, 'weight': 4, 'content': [{'end': 3878.639, 'text': 'This array contains all the points of the point cloud that were captured by the lidar.', 'start': 3873.795, 'duration': 4.844}, {'end': 3884.442, 'text': 'We pass this points array to a function called velo2cam.', 'start': 3879.519, 'duration': 4.923}, {'end': 3889.546, 'text': 'Along with that, we also pass the extrinsic camera matrix.', 'start': 3885.463, 'duration': 4.083}, {'end': 3894.61, 'text': 'Velo2Cam over here stands for Velodyne to Camera.', 'start': 3890.768, 'duration': 3.842}, {'end': 3898.352, 'text': 'Velodyne is the LiDAR sensor that was used over here.', 'start': 3895.151, 'duration': 3.201}, {'end': 3904.396, 'text': 'All in all, what this Velo2Cam function does.', 'start': 3900.334, 'duration': 4.062}], 'summary': 'The function velo2cam processes lidar points with extrinsic camera matrix for velodyne to camera transformation.', 'duration': 30.601, 'max_score': 3873.795, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3873795.jpg'}, {'end': 3970.437, 'src': 'embed', 'start': 3949.925, 'weight': 8, 'content': [{'end': 3959.491, 'text': 'We simply take the points of the point cloud and scale them in such a way that they occupy the complete frame of a given image.', 'start': 3949.925, 'duration': 9.566}, {'end': 3966.635, 'text': 'And then finally fill the different colors that are present in over here.', 'start': 3963.093, 'duration': 3.542}, {'end': 3970.437, 'text': "And this is how we get the bird's eye view of the lidar data.", 'start': 3967.435, 'duration': 3.002}], 'summary': "Scaling points to fill image frame gives bird's eye view of lidar data", 'duration': 20.512, 'max_score': 3949.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3949925.jpg'}, {'end': 4071.296, 'src': 'embed', 'start': 4030.65, 'weight': 7, 'content': [{'end': 4034.494, 'text': 'The Cityscape dataset consists of different street view images.', 'start': 4030.65, 'duration': 3.844}, {'end': 4038.976, 'text': 'That is images taken from a car that is being driven on the road.', 'start': 4035.293, 'duration': 3.683}, {'end': 4045.4, 'text': 'This dataset is particularly used for depth estimation and semantic segmentation.', 'start': 4039.656, 'duration': 5.744}, {'end': 4050.423, 'text': "Let's have a look at the images present in the dataset and what these two tasks are.", 'start': 4045.88, 'duration': 4.543}, {'end': 4054.086, 'text': 'So these are the different images that are present in the dataset.', 'start': 4051.324, 'duration': 2.762}, {'end': 4071.296, 'text': 'Okay now let us discuss what is the depth estimation task.', 'start': 4068.214, 'duration': 3.082}], 'summary': 'Cityscape dataset contains street view images for depth estimation and semantic segmentation.', 'duration': 40.646, 'max_score': 4030.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4030650.jpg'}, {'end': 4293.155, 'src': 'embed', 'start': 4268.485, 'weight': 0, 'content': [{'end': 4277.433, 'text': "The advantage it has over distributed networks is that we don't have to train each and every network individually for the different tasks,", 'start': 4268.485, 'duration': 8.948}, {'end': 4279.115, 'text': 'resulting in saving some time.', 'start': 4277.433, 'duration': 1.682}, {'end': 4284.66, 'text': 'So the technique that we are going to discuss for this multitask learning is MTAN.', 'start': 4279.575, 'duration': 5.085}, {'end': 4289.084, 'text': 'MTAN stands for Multitask Attention Network.', 'start': 4285.08, 'duration': 4.004}, {'end': 4293.155, 'text': 'And this is the architecture that MTAN follows.', 'start': 4290.054, 'duration': 3.101}], 'summary': 'Mtan saves time by training tasks collectively.', 'duration': 24.67, 'max_score': 4268.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4268485.jpg'}, {'end': 4746.165, 'src': 'embed', 'start': 4720.225, 'weight': 9, 'content': [{'end': 4728.026, 'text': 'Hence, for this particular experiment, the epochs are taken to be 100 and the learning rate is taken to be 10 raised to power minus 3.', 'start': 4720.225, 'duration': 7.801}, {'end': 4730.187, 'text': 'The results of the training are decent and good.', 'start': 4728.026, 'duration': 2.161}, {'end': 4737.848, 'text': 'However, it would have been better if we use the parameters that were provided by the authors.', 'start': 4731.367, 'duration': 6.481}, {'end': 4740.577, 'text': 'Let us have a look at some of the results.', 'start': 4738.854, 'duration': 1.723}, {'end': 4746.165, 'text': 'The left one is the original segmentation map and the right one is the predicted segmentation map.', 'start': 4740.997, 'duration': 5.168}], 'summary': "Epochs: 100, lr: 10^-3. decent training results. author's parameters preferred.", 'duration': 25.94, 'max_score': 4720.225, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4720225.jpg'}], 'start': 3146.821, 'title': 'Visualizing lidar data and multitask learning in computer vision', 'summary': "Explores visualizing lidar data in kiti 3d object detection dataset and discusses the mtan technique for multitask learning in computer vision, emphasizing the datasets' content, tasks involved, and architecture of mtan.", 'chapters': [{'end': 3357.171, 'start': 3146.821, 'title': 'Visualizing lidar data in 3d object detection', 'summary': 'Explores the visualizations of lidar data in the kiti 3d object detection dataset, demonstrating the correspondence between lidar and camera images and explaining the concept of homogeneous transformations in the context of self-driving car sensors.', 'duration': 210.35, 'highlights': ['The visualizations of LiDAR data in the KITI 3D object detection dataset, including the correspondence between LiDAR and camera images, are presented, offering insights into the perception of a 3D environment. (Relevance: 5)', 'The concept of homogeneous transformations in the context of self-driving car sensors, involving the positioning and functions of LiDAR and camera sensors, is explained. (Relevance: 4)', 'The LiDAR data in the dataset is demonstrated through visualizations, showcasing the projections of LiDAR points onto the road and the correspondence of LiDAR data with objects in camera images. (Relevance: 3)']}, {'end': 3775.319, 'start': 3357.171, 'title': 'Homogeneous transformations in robotics', 'summary': 'Explains the concept of homogeneous transformations, consisting of translation and rotation, represented by matrices, to convert 3d data into a 2d representation, crucial in robotics and computer vision applications.', 'duration': 418.148, 'highlights': ['Homogeneous transformation combines translation and rotation into a single matrix multiplication, simplifying the process. Homogeneous transformation reduces translation and rotation into a single matrix multiplication, simplifying the process, crucial in robotics and computer vision applications.', 'Explanation of how a camera works in converting 3D scenes into 2D images using homogeneous transformations and projection operations. The chapter explains how a camera converts 3D scenes into 2D images using homogeneous transformations and projection operations, essential in robotics and computer vision.', 'The application of homogeneous transformations in robotics, computer vision, self-driving cars, mobile robots, and robot manipulators. Homogeneous transformations are fundamental in robotics, computer vision, self-driving cars, mobile robots, and robot manipulators, showcasing their wide application in the field.']}, {'end': 3998.58, 'start': 3776.239, 'title': '3d visualization and parameters in computer vision', 'summary': 'Discusses the extrinsic and intrinsic parameters in computer vision, the checkerboard method to determine these parameters, and the process of generating 3d visualizations from lidar data using extrinsic and intrinsic camera parameters.', 'duration': 222.341, 'highlights': ['The checkerboard method is a popular way to determine extrinsic and intrinsic camera parameters. The checkerboard method is commonly used to determine extrinsic and intrinsic parameters of a camera.', 'The process of generating 3D visualizations from lidar data involves using extrinsic and intrinsic camera parameters. The process involves converting the lidar points array using the velo2cam function with extrinsic camera matrix, followed by matrix multiplication with intrinsic camera metrics to generate the visualization.', "Generating the bird's eye view of lidar data is achieved by scaling the points to occupy the complete frame of a given image and filling in different colors. The bird's eye view is created by scaling the point cloud to occupy the entire frame of an image and applying color filling."]}, {'end': 4338.269, 'start': 4002.99, 'title': 'Multitask learning in computer vision', 'summary': "Discusses the cityscapes dataset for depth estimation and semantic segmentation, the problem of multitask learning, and the mtan technique for multitask learning in computer vision, emphasizing the dataset's content, the tasks involved, and the architecture of mtan.", 'duration': 335.279, 'highlights': ['The Cityscapes dataset contains street view images used for depth estimation and semantic segmentation, and the tasks involve calculating depth maps and segmentation labels respectively.', 'Multitask learning aims to derive a single network for different computer vision tasks like object detection, classification, and segmentation, with the MTAN technique using shared features and task-specific attention modules to address multiple tasks efficiently.', 'The MTAN technique involves calculating shared features for input images and utilizing task-specific attention modules to cater to different tasks, providing an efficient approach for multitask learning in computer vision.']}, {'end': 4740.577, 'start': 4338.87, 'title': 'Mtan network and training results', 'summary': 'Discusses the architecture of the segnet network for image segmentation, with a focus on the attention sub-modules and dynamic weight averaging of the mtan network, along with insights into dataset loading, training, and results using the cityscapes dataset.', 'duration': 401.707, 'highlights': ['The chapter covers the architecture of the Segnet network for image segmentation, emphasizing the attention sub-modules and dynamic weight averaging of the MTAN network. Segnet network, attention sub-modules, MTAN network, dynamic weight averaging, image segmentation', 'Insights into dataset loading, training functions, and network architecture are provided, along with the specifics of training the network on the Cityscapes dataset. dataset loading, training functions, network architecture, Cityscapes dataset', 'Due to time constraints, the training of the network was limited to 100 epochs with a learning rate of 10^-3, resulting in decent but suboptimal results compared to the original parameters provided by the authors. training epochs, learning rate, results, time constraints']}], 'duration': 1593.756, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI3146821.jpg', 'highlights': ['The MTAN technique involves calculating shared features for input images and utilizing task-specific attention modules to cater to different tasks, providing an efficient approach for multitask learning in computer vision.', 'The visualizations of LiDAR data in the KITI 3D object detection dataset, including the correspondence between LiDAR and camera images, are presented, offering insights into the perception of a 3D environment.', 'The concept of homogeneous transformations in the context of self-driving car sensors, involving the positioning and functions of LiDAR and camera sensors, is explained.', 'The checkerboard method is commonly used to determine extrinsic and intrinsic parameters of a camera.', 'The process involves converting the LiDAR points array using the velo2cam function with extrinsic camera matrix, followed by matrix multiplication with intrinsic camera metrics to generate the visualization.', 'Multitask learning aims to derive a single network for different computer vision tasks like object detection, classification, and segmentation, with the MTAN technique using shared features and task-specific attention modules to address multiple tasks efficiently.', 'The application of homogeneous transformations in robotics, computer vision, self-driving cars, mobile robots, and robot manipulators.', 'The Cityscapes dataset contains street view images used for depth estimation and semantic segmentation, and the tasks involve calculating depth maps and segmentation labels respectively.', "The bird's eye view is created by scaling the point cloud to occupy the entire frame of an image and applying color filling.", 'Due to time constraints, the training of the network was limited to 100 epochs with a learning rate of 10^-3, resulting in decent but suboptimal results compared to the original parameters provided by the authors.']}, {'end': 6453.342, 'segs': [{'end': 4889.763, 'src': 'embed', 'start': 4859.269, 'weight': 1, 'content': [{'end': 4865.395, 'text': 'All right, so the data set that we are going to use for this video is the Kite 3D object detection data set.', 'start': 4859.269, 'duration': 6.126}, {'end': 4868.458, 'text': 'We will quickly go over this data set in this video as well.', 'start': 4865.595, 'duration': 2.863}, {'end': 4876.125, 'text': 'This data set has images that are taken from the front camera of a self-driving car as we can see them over here.', 'start': 4869.359, 'duration': 6.766}, {'end': 4889.763, 'text': 'We are also provided with a LiDAR data that was captured by LiDAR present on the self-driving car as we can see them over here.', 'start': 4881.994, 'duration': 7.769}], 'summary': 'Using kite 3d object detection dataset with front camera and lidar data.', 'duration': 30.494, 'max_score': 4859.269, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4859269.jpg'}, {'end': 5007.931, 'src': 'embed', 'start': 4980.285, 'weight': 0, 'content': [{'end': 4988.352, 'text': 'So, SFA3D as a complete technique does 3D object detection using only the LiDAR data.', 'start': 4980.285, 'duration': 8.067}, {'end': 4995.678, 'text': "As an input, we provide SFA3D as a bird's eye view input image to the network.", 'start': 4989.213, 'duration': 6.465}, {'end': 5003.649, 'text': 'The network predicts the different objects that are present in that image and also classifies them as well.', 'start': 4996.687, 'duration': 6.962}, {'end': 5007.931, 'text': 'The data collected using a lidar is a 3D data.', 'start': 5004.65, 'duration': 3.281}], 'summary': "Sfa3d detects 3d objects using lidar, providing bird's eye view input image to network for prediction and classification.", 'duration': 27.646, 'max_score': 4980.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4980285.jpg'}, {'end': 5061.806, 'src': 'embed', 'start': 5033.548, 'weight': 2, 'content': [{'end': 5044.652, 'text': 'the distance of the center of other objects from our ego car or the self-driving car, the angles in which the other objects are facing,', 'start': 5033.548, 'duration': 11.104}, {'end': 5053.455, 'text': 'the dimensions of the objects that we have detected, the length, breadth and the height, and the z coordinate of the center of that object,', 'start': 5044.652, 'duration': 8.803}, {'end': 5054.495, 'text': 'as we can see over here.', 'start': 5053.455, 'duration': 1.04}, {'end': 5061.806, 'text': 'So now we have discussed the input as well as the list of outputs that SFA3D provides us.', 'start': 5056.021, 'duration': 5.785}], 'summary': 'Sfa3d provides input on object distance, angles, dimensions, and coordinates.', 'duration': 28.258, 'max_score': 5033.548, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5033548.jpg'}, {'end': 5126.832, 'src': 'embed', 'start': 5094.591, 'weight': 3, 'content': [{'end': 5097.933, 'text': 'FPN stands for Feature Pyramid Network.', 'start': 5094.591, 'duration': 3.342}, {'end': 5108.759, 'text': 'Feature Pyramid Network is a feature extraction technique that given a single input image outputs different feature maps that vary in size.', 'start': 5098.294, 'duration': 10.465}, {'end': 5111.261, 'text': "It's something as we can see over here.", 'start': 5109.399, 'duration': 1.862}, {'end': 5116.885, 'text': 'Given a single input image, the outputs are different feature maps of different sizes.', 'start': 5111.561, 'duration': 5.324}, {'end': 5126.832, 'text': 'So what is feature extraction and feature maps? All the important data that is contained inside an image is what are called features.', 'start': 5117.265, 'duration': 9.567}], 'summary': 'Feature pyramid network extracts various feature maps from a single input image.', 'duration': 32.241, 'max_score': 5094.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5094591.jpg'}, {'end': 5351.127, 'src': 'embed', 'start': 5300.304, 'weight': 4, 'content': [{'end': 5306.712, 'text': 'For tasks like these, scale invariance is an important property that neural networks are supposed to have.', 'start': 5300.304, 'duration': 6.408}, {'end': 5314.221, 'text': 'However, for detecting points in a given image, scale invariance does not play a major role.', 'start': 5307.83, 'duration': 6.391}, {'end': 5325.786, 'text': 'If we want to detect a single point in a given image, that point is going to remain at the same location the scaling that we apply on that image.', 'start': 5315.022, 'duration': 10.764}, {'end': 5333.152, 'text': 'Hence, directly applying feature pyramid networks to detect points in an image is not an efficient technique.', 'start': 5326.246, 'duration': 6.906}, {'end': 5342.139, 'text': 'Keypoint feature pyramid networks are an extension to the given FPN networks that help us detect points in a given image.', 'start': 5333.893, 'duration': 8.246}, {'end': 5346.503, 'text': 'The overall architecture of keypoint FPN remains the same.', 'start': 5342.76, 'duration': 3.743}, {'end': 5351.127, 'text': 'The outputs of the FPN can be visualized as we can see over here.', 'start': 5347.364, 'duration': 3.763}], 'summary': 'Neural networks need scale invariance, but for point detection in images, direct application of feature pyramid networks is inefficient.', 'duration': 50.823, 'max_score': 5300.304, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5300304.jpg'}, {'end': 5461.523, 'src': 'embed', 'start': 5433.484, 'weight': 6, 'content': [{'end': 5437.867, 'text': 'Alright, so this is the complete network architecture of the SFA3D technique.', 'start': 5433.484, 'duration': 4.383}, {'end': 5443.631, 'text': 'Now, let us discuss the different loss functions that are employed to train this network.', 'start': 5438.247, 'duration': 5.384}, {'end': 5449.034, 'text': 'First off, we are going to discuss the outputs that are generated by this network.', 'start': 5444.571, 'duration': 4.463}, {'end': 5453.337, 'text': 'The first output is a heatmap of the different classes.', 'start': 5449.735, 'duration': 3.602}, {'end': 5461.523, 'text': 'Heatmap is a data visualization technique that tells us the magnitude of a given phenomena occurring on a given input.', 'start': 5453.878, 'duration': 7.645}], 'summary': 'Sfa3d technique: network architecture, loss functions, class heatmap visualization.', 'duration': 28.039, 'max_score': 5433.484, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5433484.jpg'}, {'end': 5564.444, 'src': 'embed', 'start': 5533.11, 'weight': 7, 'content': [{'end': 5539.594, 'text': 'As we move away from that peak, that solid red color is going to decay into a white color.', 'start': 5533.11, 'duration': 6.484}, {'end': 5547.778, 'text': 'Again, as we approach another peak, that white color is going to converge into a solid red color again.', 'start': 5540.655, 'duration': 7.123}, {'end': 5555.581, 'text': 'A heat map can also be visualized in terms of confidence of a network that it has on its predictions.', 'start': 5548.798, 'duration': 6.783}, {'end': 5564.444, 'text': "In our particular case, if we want to detect three objects, let's say a car, a truck and a pedestrian,", 'start': 5556.721, 'duration': 7.723}], 'summary': 'Heat map transitions from solid red to white and back, reflecting network confidence.', 'duration': 31.334, 'max_score': 5533.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5533110.jpg'}, {'end': 5629, 'src': 'embed', 'start': 5601.333, 'weight': 8, 'content': [{'end': 5607.398, 'text': 'In order to encourage the network to learn the right classes, we use what is called a focal loss.', 'start': 5601.333, 'duration': 6.065}, {'end': 5613.082, 'text': 'The mathematical expression of the focal loss is something like this.', 'start': 5609.799, 'duration': 3.283}, {'end': 5619.573, 'text': 'Focal loss is usually used when there exists a class imbalance in the data.', 'start': 5614.309, 'duration': 5.264}, {'end': 5629, 'text': 'Class imbalance usually occurs when there is a less amount of data for a particular class and a very high amount of data for another particular class.', 'start': 5620.253, 'duration': 8.747}], 'summary': 'Using focal loss to address class imbalance in network training.', 'duration': 27.667, 'max_score': 5601.333, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5601333.jpg'}, {'end': 5683.051, 'src': 'embed', 'start': 5655.68, 'weight': 9, 'content': [{'end': 5659.564, 'text': 'In order to learn this, we use what is called an L1 loss.', 'start': 5655.68, 'duration': 3.884}, {'end': 5662.266, 'text': 'The mathematical expression is something like this.', 'start': 5659.924, 'duration': 2.342}, {'end': 5672.316, 'text': 'L1 loss simply takes the difference of the prediction and the actual value and computes the absolute value of that difference.', 'start': 5664.048, 'duration': 8.268}, {'end': 5683.051, 'text': 'The next output that we have are the dimensions of the object we are trying to detect and its Z coordinate the distance or the height from the ground.', 'start': 5673.263, 'duration': 9.788}], 'summary': 'Using l1 loss for prediction and detecting object dimensions and z coordinate.', 'duration': 27.371, 'max_score': 5655.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5655680.jpg'}, {'end': 5761.912, 'src': 'embed', 'start': 5737.723, 'weight': 10, 'content': [{'end': 5745.245, 'text': 'In order to efficiently deal with such an imbalance of loss values, we use what is called a balanced L1 loss.', 'start': 5737.723, 'duration': 7.522}, {'end': 5757.808, 'text': 'The balanced L1 loss balances the loss values that are generated by both of these inliers and outliers to get a network that performs well on both these types of data.', 'start': 5746.265, 'duration': 11.543}, {'end': 5761.912, 'text': 'Alright, so these are all the losses that are used in this network.', 'start': 5758.846, 'duration': 3.066}], 'summary': 'Balanced l1 loss balances inliers and outliers for improved network performance.', 'duration': 24.189, 'max_score': 5737.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5737723.jpg'}, {'end': 5806.372, 'src': 'embed', 'start': 5780.267, 'weight': 11, 'content': [{'end': 5788.874, 'text': 'During the training process, we tend to decrease or increase the learning rate values in order to fine-tune our optimization process.', 'start': 5780.267, 'duration': 8.607}, {'end': 5796.347, 'text': 'The learning rate decay technique that we use in this network is the cosine annealing.', 'start': 5789.864, 'duration': 6.483}, {'end': 5801.429, 'text': 'The graph of the learning rate in cosine annealing looks something like this.', 'start': 5797.107, 'duration': 4.322}, {'end': 5806.372, 'text': 'Towards the start the learning rate decreases slowly.', 'start': 5802.95, 'duration': 3.422}], 'summary': 'Cosine annealing technique is used to adjust learning rate during training.', 'duration': 26.105, 'max_score': 5780.267, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5780267.jpg'}, {'end': 5935.37, 'src': 'embed', 'start': 5908.813, 'weight': 13, 'content': [{'end': 5915.457, 'text': 'the training of these networks is really time consuming and require very expensive and good quality hardware.', 'start': 5908.813, 'duration': 6.644}, {'end': 5924.703, 'text': 'Since we have a time limit and some constraints on hardware on Kaggle, I was not able to train the network by myself.', 'start': 5916.678, 'duration': 8.025}, {'end': 5931.708, 'text': 'Hence, I have used the pre-trained weights that are provided along with this network architecture.', 'start': 5925.764, 'duration': 5.944}, {'end': 5935.37, 'text': 'So now let us have a look at the final output.', 'start': 5932.928, 'duration': 2.442}], 'summary': 'Training neural networks is time-consuming and requires expensive hardware. pre-trained weights were used due to time and hardware constraints.', 'duration': 26.557, 'max_score': 5908.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI5908813.jpg'}, {'end': 6089.493, 'src': 'embed', 'start': 6060.752, 'weight': 14, 'content': [{'end': 6063.993, 'text': 'In this case, we can see this is our self-driving car.', 'start': 6060.752, 'duration': 3.241}, {'end': 6071.376, 'text': 'And in this case, for our particular problem, we have four cameras in the front, left, right and the back.', 'start': 6064.353, 'duration': 7.023}, {'end': 6080.651, 'text': "Given these four images, we are required to process them and output the bird's eye view of the environment around the car.", 'start': 6072.588, 'duration': 8.063}, {'end': 6089.493, 'text': 'As we can see over here, the red rectangle is the ego car or our self-driving car and the other blue rectangles are the other cars.', 'start': 6081.231, 'duration': 8.262}], 'summary': "Self-driving car with 4 cameras processes images to output bird's eye view of environment.", 'duration': 28.741, 'max_score': 6060.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI6060752.jpg'}, {'end': 6199.869, 'src': 'embed', 'start': 6175.926, 'weight': 15, 'content': [{'end': 6184.253, 'text': "We already have the different classes in these images semantically segmented and we have to derive a bird's eye view image for the same.", 'start': 6175.926, 'duration': 8.327}, {'end': 6190.34, 'text': 'The cool thing about this dataset is that the dataset is derived from a simulation.', 'start': 6185.276, 'duration': 5.064}, {'end': 6199.869, 'text': 'The self-driving car drives around the simulation environment and we capture the different images from the four different cameras that are present on the self-driving car.', 'start': 6190.601, 'duration': 9.268}], 'summary': "Deriving bird's eye view images from semantically segmented images captured by four different cameras on a self-driving car in a simulation environment.", 'duration': 23.943, 'max_score': 6175.926, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI6175926.jpg'}, {'end': 6382.685, 'src': 'embed', 'start': 6357.684, 'weight': 16, 'content': [{'end': 6364.729, 'text': 'As the name suggests, this technique is an extension of the UNET semantic segmentation architecture.', 'start': 6357.684, 'duration': 7.045}, {'end': 6371.573, 'text': 'Therefore, the UNETXST has two components, the UNET architecture itself and its extension.', 'start': 6365.509, 'duration': 6.064}, {'end': 6374.155, 'text': 'Let us discuss the two components one by one.', 'start': 6371.933, 'duration': 2.222}, {'end': 6377.217, 'text': 'So the first component is the UNET network.', 'start': 6374.715, 'duration': 2.502}, {'end': 6382.685, 'text': 'UNET is a network that is used for semantic segmentation.', 'start': 6378.26, 'duration': 4.425}], 'summary': 'Unetxst is an extension of the unet semantic segmentation architecture with two components: unet architecture and its extension.', 'duration': 25.001, 'max_score': 6357.684, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI6357684.jpg'}], 'start': 4740.997, 'title': 'Sfa3d for 3d object detection', 'summary': "Covers the sfa3d technique for 3d object detection using lidar data, outlining the kiti 3d object detection dataset, discussing the architecture, keypoint fpn, loss functions, heatmap application, and bird's eye view generation, aiming to predict seven outputs from a bird's eye view input image.", 'chapters': [{'end': 5061.806, 'start': 4740.997, 'title': 'Sfa3d: lidar-based 3d object detection', 'summary': "Discusses the sfa3d technique for 3d object detection using lidar data, showcasing segmentation and depth map results, and outlining the kiti 3d object detection dataset and technical details of sfa3d, which predicts seven outputs from a bird's eye view input image.", 'duration': 320.809, 'highlights': ['Segmentation and Depth Map Results The network faces slight issues while segmenting bicycles and depth map has slight errors, but overall results look good.', 'KITI 3D Object Detection Dataset The dataset consists of images from the front camera of a self-driving car and LiDAR data captured by the car, with the SFA3D technique utilizing only the LiDAR data for the 3D object detection task.', 'Technical Details of SFA3D SFA3D does 3D object detection using only LiDAR data, providing seven outputs including heat maps of different classes, distance of objects from the ego car, object angles, dimensions, and z coordinate of object centers.']}, {'end': 5453.337, 'start': 5062.567, 'title': 'Sfa3d architecture & keypoint fpn', 'summary': 'Discusses the architecture of sfa3d, including key point fpn, a feature extraction technique for scale-invariant networks, and the different loss functions employed for training, aiming to generate a heatmap of different classes.', 'duration': 390.77, 'highlights': ['Feature Pyramid Network (FPN) is a feature extraction technique that generates different feature maps varying in size from a single input image. It provides an overview of FPN as a feature extraction technique, generating different feature maps from a single input image.', 'Keypoint FPN is an extension of FPN that helps detect points in an image, employing post-processing techniques to generate a final single output. Explains the concept of Keypoint FPN as an extension to FPN for detecting points in an image and the post-processing techniques involved.', 'Scale invariance is crucial for object detection, while Keypoint FPN is specifically designed to efficiently detect points in an image. Emphasizes the importance of scale invariance for object detection and the efficiency of Keypoint FPN in detecting points in an image.', 'The different loss functions employed to train the network include generating a heatmap of different classes as one of the network outputs. Discusses the use of different loss functions for training, specifically focusing on the generation of a heatmap of different classes as one of the network outputs.']}, {'end': 5831.093, 'start': 5453.878, 'title': 'Understanding heatmap & sfa3d technique', 'summary': 'Explains the concept of heatmap as a data visualization technique and its application in object detection, along with the use of focal loss, l1 loss, balanced l1 loss, and learning rate scheduling in the sfa3d technique for neural network training.', 'duration': 377.215, 'highlights': ["Heatmap Visualization Technique Heatmap is described as a visualization technique to represent the magnitude of a phenomena, resembling a mountain terrain, and its application in visualizing the confidence of a network's predictions, as well as in object detection by generating heat maps for different objects.", 'Focal Loss for Class Imbalance The use of focal loss is explained as a method to address class imbalance in data, where the network focuses more on less confident predictions and less on more confident predictions, thereby encouraging the learning of correct classes.', 'L1 Loss for Learning Object Distance and Direction The concept of L1 loss is outlined as a method for learning the distance and direction of objects, where it computes the absolute difference between the prediction and the actual value, aiding in the training of neural networks for this purpose.', 'Balanced L1 Loss for Inliers and Outliers The balanced L1 loss is introduced as a technique to balance the loss values generated by inliers and outliers in the training data, ensuring that the neural network performs well on both types of data points.', 'Cosine Annealing for Learning Rate Scheduling The use of cosine annealing for learning rate decay during training is mentioned as a technique to fine-tune the optimization process, where the learning rate decreases slowly at the start, then linearly, and finally slows down towards the end.']}, {'end': 6453.342, 'start': 5831.894, 'title': "3d object detection and bird's eye view", 'summary': "Provides an overview of the sfa3d technique for 3d object detection and bird's eye view generation, discussing the problem definition, data set, and the unetxst technique, emphasizing the significance of bird's eye view for object detection and path planning.", 'duration': 621.448, 'highlights': ["The SFA3D technique focuses on 3D object detection and bird's eye view generation for self-driving cars, utilizing pre-trained weights due to the time-consuming nature of training these networks. The technique emphasizes the challenges of training 3D object detection networks and the use of pre-trained weights for the FPN ResNet network.", "The problem statement involves processing images from multiple cameras on a self-driving car to produce a bird's eye view, essential for object detection and path planning. The problem statement entails processing images from different cameras to generate a bird's eye view, crucial for object detection and path planning on self-driving cars.", "The dataset consists of images captured from four different cameras on a self-driving car, along with corresponding bird's eye view images derived from a simulation environment. The dataset includes images from four car cameras and corresponding bird's eye view images derived from a simulation, providing semantically segmented images and ground truth data for training.", 'The UNETXST technique, an extension of the UNET architecture, is employed for semantic segmentation and is particularly useful in the context of self-driving cars. The UNETXST technique, an extension of UNET, is utilized for semantic segmentation, featuring an encoder-decoder structure with skip connections, originally designed for biomedical image segmentation and applicable to self-driving cars.']}], 'duration': 1712.345, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI4740997.jpg', 'highlights': ['SFA3D technique utilizes only LiDAR data for 3D object detection task', 'KITI 3D Object Detection Dataset consists of front camera images and LiDAR data', 'SFA3D provides seven outputs including heat maps, object distance, angles, and dimensions', 'FPN generates different feature maps from a single input image', 'Keypoint FPN helps detect points in an image and employs post-processing techniques', 'Importance of scale invariance for object detection and efficiency of Keypoint FPN', 'Different loss functions employed for training including generating heat maps', "Heatmap visualization technique for representing confidence of network's predictions", 'Focal loss method to address class imbalance in data', 'L1 loss for learning object distance and direction', 'Balanced L1 loss technique to balance loss values for inliers and outliers', 'Cosine annealing for learning rate decay during training', "SFA3D technique focuses on 3D object detection and bird's eye view generation", 'Utilization of pre-trained weights for FPN ResNet network due to time-consuming training', "Problem statement involves processing images from multiple cameras to produce bird's eye view", "Dataset includes images from four car cameras and corresponding bird's eye view images", 'UNETXST technique is employed for semantic segmentation in the context of self-driving cars']}, {'end': 7176.188, 'segs': [{'end': 6730.001, 'src': 'embed', 'start': 6704.824, 'weight': 0, 'content': [{'end': 6713.188, 'text': 'The spatial transformer learns the different parameters of the homography matrix in order to aid the convolution network.', 'start': 6704.824, 'duration': 8.364}, {'end': 6715.354, 'text': 'In the same example.', 'start': 6714.133, 'duration': 1.221}, {'end': 6727.319, 'text': 'if we have an object over here and we take its image from this point and let us assume for some reason the CNN architecture is not able to classify this object accurately', 'start': 6715.354, 'duration': 11.965}, {'end': 6730.001, 'text': 'How spatial transformers are going to help?', 'start': 6727.339, 'duration': 2.662}], 'summary': 'Spatial transformer aids cnn by learning homography matrix parameters.', 'duration': 25.177, 'max_score': 6704.824, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI6704824.jpg'}, {'end': 7073.831, 'src': 'embed', 'start': 7000.615, 'weight': 1, 'content': [{'end': 7004.778, 'text': 'The image shape has been reduced in order to get a faster learning curve.', 'start': 7000.615, 'duration': 4.163}, {'end': 7009.622, 'text': 'The batch size has also been increased in order to get the same results.', 'start': 7005.899, 'duration': 3.723}, {'end': 7017.779, 'text': 'Also, the algorithm is only trained for a total of 40 epochs, whereas we would have required to train for 100 epochs.', 'start': 7010.453, 'duration': 7.326}, {'end': 7024.505, 'text': 'In this next section, we have the different data loader functions that load the data for our machine learning model.', 'start': 7018.7, 'duration': 5.805}, {'end': 7028.828, 'text': 'In this next section, we define the different network architecture.', 'start': 7025.706, 'duration': 3.122}, {'end': 7031.551, 'text': 'First, we define the spatial transformer.', 'start': 7029.509, 'duration': 2.042}, {'end': 7036.214, 'text': 'Next, we implement UNET along with the spatial transformer extension.', 'start': 7032.071, 'duration': 4.143}, {'end': 7038.877, 'text': 'And finally, we train the network.', 'start': 7037.295, 'duration': 1.582}, {'end': 7042.276, 'text': 'Now, let us have a look at the final output predictions.', 'start': 7039.815, 'duration': 2.461}, {'end': 7049.541, 'text': 'Just like before, we have images from the four cameras on the self-driving car, the front, rear, left and right.', 'start': 7043.037, 'duration': 6.504}, {'end': 7054.463, 'text': 'This next image is the output prediction generated by our model.', 'start': 7050.721, 'duration': 3.742}, {'end': 7058.085, 'text': "And the next image is the ground truth bird's eye view image.", 'start': 7055.284, 'duration': 2.801}, {'end': 7064.089, 'text': "As we can see, the model has learned to some extent to represent the bird's eye view image.", 'start': 7059.426, 'duration': 4.663}, {'end': 7070.76, 'text': 'However, there is some slight noise that is present in the output.', 'start': 7067.349, 'duration': 3.411}, {'end': 7073.831, 'text': 'Let us have a look at other images as well.', 'start': 7071.884, 'duration': 1.947}], 'summary': 'Reduced image shape, increased batch size, trained for 40 epochs, implemented unet with spatial transformer extension, and obtained output predictions for self-driving car images.', 'duration': 73.216, 'max_score': 7000.615, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI7000615.jpg'}, {'end': 7175.006, 'src': 'embed', 'start': 7146.337, 'weight': 4, 'content': [{'end': 7151.321, 'text': 'Read and implement new research papers related to perception for self-driving cars.', 'start': 7146.337, 'duration': 4.984}, {'end': 7158.767, 'text': 'Or you can jump to and learn more about different modules for a self-driving car like localization or motion planning.', 'start': 7151.882, 'duration': 6.885}, {'end': 7166.294, 'text': 'As far as computer vision is concerned, you can also try out different other projects like similarity learning,', 'start': 7159.408, 'duration': 6.886}, {'end': 7168.756, 'text': 'image captioning and generative modeling.', 'start': 7166.294, 'duration': 2.462}, {'end': 7175.006, 'text': 'This is all that we had to discuss for this course and I would see you in another video or another course.', 'start': 7169.437, 'duration': 5.569}], 'summary': 'Explore perception research for self-driving cars, including modules like localization and motion planning, as well as computer vision projects like similarity learning, image captioning, and generative modeling.', 'duration': 28.669, 'max_score': 7146.337, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI7146337.jpg'}], 'start': 6454.466, 'title': 'Unet, spatial transformers, and unetxst implementation', 'summary': "Covers the use of unet and spatial transformers in image processing, including the concept of homography and practical applications, as well as the implementation of unetxst, detailing the model's output predictions and the need for further training, concluding with a recommendation for further exploration in perception for self-driving cars and computer vision projects.", 'chapters': [{'end': 6933.207, 'start': 6454.466, 'title': 'Understanding unet and spatial transformers', 'summary': 'Discusses the use of unet and spatial transformers in image processing, detailing the concept of homography, its practical applications, and the role of spatial transformers in aiding convolution networks, as well as the limitations of inverse perspective mapping in self-driving car scenarios.', 'duration': 478.741, 'highlights': ['Spatial transformers aid convolution networks in learning different spatial properties by learning the parameters of the homography matrix, enabling more accurate object classification. Spatial transformers learn the parameters of the homography matrix to present images from different perspectives to aid convolution networks in accurate object classification.', 'The limitations of inverse perspective mapping in self-driving car scenarios due to the assumption of a flat world, leading to inaccuracies in the obtained top view of the environment. Inverse perspective mapping assumes a flat world, resulting in inaccuracies for objects with considerable height, necessitating the use of complex architectures like unit XST.', 'The combination of UNET and Spatial Transformers involves four inputs for different images, each processed through an encoder network and combined using spatial transformers before being fed through the skip connection for the final output. The combination of UNET and Spatial Transformers utilizes four inputs, each processed through an encoder network and combined using spatial transformers before being fed through the skip connection for the final output.']}, {'end': 7176.188, 'start': 6934.047, 'title': 'Unetxst implementation and results', 'summary': "Discusses the implementation of unetxst, covering the utility functions, configuration parameters, data loader functions, network architecture, and the model's output predictions, revealing the model's learning progress and the need for further training, concluding the course with a recommendation for further exploration in perception for self-driving cars and computer vision projects.", 'duration': 242.141, 'highlights': ["The model has learned to some extent to represent the bird's eye view image, but there is some slight noise present in the output. The model's output predictions show the learning progress in representing the bird's eye view image, with some slight noise present in the output.", "The algorithm is only trained for a total of 40 epochs, whereas it would have required training for 100 epochs, indicating the need for further training to converge to better results. The algorithm's training for 40 epochs, instead of the required 100, reflects the need for further training to achieve better results.", 'The chapter recommends further exploration in perception for self-driving cars, including learning about different modules for a self-driving car like localization or motion planning, and trying out different computer vision projects like similarity learning, image captioning, and generative modeling. The recommendation includes exploring different modules for self-driving cars and computer vision projects, such as localization, motion planning, similarity learning, image captioning, and generative modeling.']}], 'duration': 721.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/cPOtULagNnI/pics/cPOtULagNnI6454466.jpg', 'highlights': ['Spatial transformers aid convolution networks in learning different spatial properties by learning the parameters of the homography matrix, enabling more accurate object classification.', 'The combination of UNET and Spatial Transformers utilizes four inputs, each processed through an encoder network and combined using spatial transformers before being fed through the skip connection for the final output.', "The model's output predictions show the learning progress in representing the bird's eye view image, with some slight noise present in the output.", "The algorithm's training for 40 epochs, instead of the required 100, reflects the need for further training to achieve better results.", 'The recommendation includes exploring different modules for self-driving cars and computer vision projects, such as localization, motion planning, similarity learning, image captioning, and generative modeling.']}], 'highlights': ["1 Course covers projects related to perception for self-driving cars, including road segmentation, 2D object detection, 3D data visualization, multi-task learning, 3D object detection, and bird's eye view visualization.", '1 Projects involve techniques such as fully convolutional networks for road segmentation and YOLO algorithm for 2D object detection.', '1 Fully convolutional networks (FCN) revolutionized road segmentation by enabling end-to-end segmentation without hand-crafted techniques.', '2 Interpolation method outperformed transposed convolutions, highlighting its practical advantage.', '2 FCN8 architecture combines knowledge from FCN32 and FCN16 to identify coarser and finer structures in the image.', '3 YOLO algorithm for 2D object detection detailed, showcasing its ability to predict multiple objects in a single cell and its fast and accurate performance in self-driving cars.', '3 Distinction between 2D and 3D object detection highlighted, emphasizing the additional data representation and the ability of 3D object detection to predict 3D bounding boxes with labels.', '4 Results of using YOLOv3 algorithm on Lyft 3D object detection dataset showcasing detection of cars, buses, and trucks.', '4 SORT algorithm uses the Hungarian algorithm to solve the linear assignment problem, reducing the order complexity from n factorial to n cubed.', '5 DeepSort algorithm combines Kalman filter predictions and cascade matching to assign IDs to objects in videos, with an evaluation of its performance in different scenarios.', '6 MTAN technique involves calculating shared features for input images and utilizing task-specific attention modules to cater to different tasks, providing an efficient approach for multitask learning in computer vision.', '6 Visualizations of LiDAR data in the KITI 3D object detection dataset, including the correspondence between LiDAR and camera images, are presented, offering insights into the perception of a 3D environment.', '7 SFA3D technique utilizes only LiDAR data for 3D object detection task.', '7 FPN generates different feature maps from a single input image.', '8 Spatial transformers aid convolution networks in learning different spatial properties by learning the parameters of the homography matrix, enabling more accurate object classification.', '8 Combination of UNET and Spatial Transformers utilizes four inputs, each processed through an encoder network and combined using spatial transformers before being fed through the skip connection for the final output.']}