title
Lecture 15 | Efficient Methods and Hardware for Deep Learning
description
In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning workloads. We discuss pruning, weight sharing, quantization, and other techniques for accelerating inference, as well as parallelization, mixed precision, and other techniques for accelerating training. We discuss specialized hardware for deep learning such as GPUs, FPGAs, and ASICs, including the Tensor Cores in NVIDIA’s latest Volta GPUs as well as Google’s Tensor Processing Units (TPUs).
Keywords: Hardware, CPU, GPU, ASIC, FPGA, pruning, weight sharing, quantization, low-rank approximations, binary networks, ternary networks, Winograd transformations, EIE, data parallelism, model parallelism, mixed precision, FP16, FP32, model distillation, Dense-Sparse-Dense training, NVIDIA Volta, Tensor Core, Google TPU, Google Cloud TPU
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture15.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/
detail
{'title': 'Lecture 15 | Efficient Methods and Hardware for Deep Learning', 'heatmap': [{'end': 985.534, 'start': 780.806, 'weight': 0.888}, {'end': 1113.691, 'start': 1060.008, 'weight': 0.788}, {'end': 1438.626, 'start': 1242.017, 'weight': 0.88}, {'end': 3238.793, 'start': 3179.626, 'weight': 0.768}], 'summary': 'Discusses efficient deep learning hardware, neural network pruning, and techniques for achieving significant savings in model size, training speed, and energy efficiency, with examples achieving up to 16 times saving using two-bit indices, 10x to 510x size reduction with speedup, and 24,000x energy efficiency over cpu. it covers tpu design, gpu evolution, and the future of ai hardware, including challenges, opportunities, and trade-offs in hardware design for machine learning.', 'chapters': [{'end': 793.158, 'segs': [{'end': 70.789, 'src': 'embed', 'start': 44.591, 'weight': 1, 'content': [{'end': 53.034, 'text': 'It is changing our lives, but there is a recent trend that in order to achieve such high accuracy, the models are getting larger and larger.', 'start': 44.591, 'duration': 8.443}, {'end': 62.005, 'text': 'For example, for ImageNet recognition, the winner from 2012 to 2015, the model size increased by 16 NICs.', 'start': 54.002, 'duration': 8.003}, {'end': 70.789, 'text': "And just in one year, for Baidu's deep speech, just in one year, the number of training operations increased by 10 NICs.", 'start': 63.006, 'duration': 7.783}], 'summary': "Increasing model sizes for high accuracy: imagenet model increased by 16 nics, baidu's deep speech operations by 10 nics in a year.", 'duration': 26.198, 'max_score': 44.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo44591.jpg'}, {'end': 143.963, 'src': 'embed', 'start': 115.617, 'weight': 2, 'content': [{'end': 122.705, 'text': 'For example, the ResNet-152, which is only a few less than 1% actually more accurate than ResNet-101,,', 'start': 115.617, 'duration': 7.088}, {'end': 124.588, 'text': 'takes 1.5 weeks to train on four different models.', 'start': 122.705, 'duration': 1.883}, {'end': 137.639, 'text': 'Maxwell 40, M40 GPUs, for example, which greatly limits the either we are doing homework or the researchers designing new models.', 'start': 128.711, 'duration': 8.928}, {'end': 138.439, 'text': "it's getting pretty slow.", 'start': 137.639, 'duration': 0.8}, {'end': 143.963, 'text': 'And the third challenge for those bulky model is the energy efficiency.', 'start': 140.08, 'duration': 3.883}], 'summary': 'Resnet-152 is 1% more accurate than resnet-101, takes 1.5 weeks to train on four maxwell 40 gpus, posing limitations for researchers and impacting energy efficiency.', 'duration': 28.346, 'max_score': 115.617, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo115617.jpg'}, {'end': 245.307, 'src': 'embed', 'start': 221.505, 'weight': 3, 'content': [{'end': 229.612, 'text': 'So how to make deep learning more efficient? So we have to improve the energy efficiency by this algorithm and hardware co-design.', 'start': 221.505, 'duration': 8.107}, {'end': 233.265, 'text': 'So this is the previous way we design hardware.', 'start': 231.104, 'duration': 2.161}, {'end': 245.307, 'text': 'For example, we have some benchmark, say, spec 2006, and then run those benchmarks and tune your CPU architectures for those benchmarks.', 'start': 233.345, 'duration': 11.962}], 'summary': 'Improving energy efficiency in deep learning through algorithm and hardware co-design.', 'duration': 23.802, 'max_score': 221.505, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo221505.jpg'}, {'end': 358.71, 'src': 'embed', 'start': 329.603, 'weight': 4, 'content': [{'end': 333.566, 'text': 'So the general purpose hardware includes the CPU or the GPGPU.', 'start': 329.603, 'duration': 3.963}, {'end': 338.249, 'text': 'And their difference is that CPU is latency oriented, single threaded.', 'start': 334.366, 'duration': 3.883}, {'end': 339.99, 'text': "It's like a big elephant.", 'start': 338.809, 'duration': 1.181}, {'end': 344.399, 'text': 'Well, the GPU is throughput oriented.', 'start': 340.816, 'duration': 3.583}, {'end': 352.966, 'text': 'It has many small weak threads, but there are thousands of such small weak cores, like a group of small ants, but there are so many ants.', 'start': 344.739, 'duration': 8.227}, {'end': 358.71, 'text': 'And specialized hardware, roughly, there are FPGAs and ASICs.', 'start': 354.807, 'duration': 3.903}], 'summary': 'General purpose hardware includes cpu/gpgpu; cpu is latency-oriented, single-threaded, while gpu is throughput-oriented with many small weak threads and cores. specialized hardware includes fpgas and asics.', 'duration': 29.107, 'max_score': 329.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo329603.jpg'}, {'end': 642.958, 'src': 'embed', 'start': 616.417, 'weight': 0, 'content': [{'end': 624.499, 'text': "So I'm gonna train the connective first, and then prove some of the connections, and train the remaining weights, and do this process iteratively.", 'start': 616.417, 'duration': 8.082}, {'end': 633.742, 'text': 'And, as a result, I can reduce the number of connections in AlexNet from 60 million parameters to only six million parameters,', 'start': 625.22, 'duration': 8.522}, {'end': 636.683, 'text': 'which is 10 times less the computation.', 'start': 633.742, 'duration': 2.941}, {'end': 642.958, 'text': 'So, So this is the accuracy.', 'start': 638.264, 'duration': 4.694}], 'summary': 'Iterative training reduces alexnet connections to 6m parameters, 10x less computation, improving accuracy.', 'duration': 26.541, 'max_score': 616.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo616417.jpg'}], 'start': 4.918, 'title': 'Efficient deep learning hardware and neural network pruning', 'summary': 'Discusses challenges of large deep learning models impacting deployment, training speed, and energy consumption, including examples like model size increase and training time. it explores improving energy efficiency through algorithm and hardware co-design, and introduces concepts such as general purpose and specialized hardware. additionally, it covers the process of pruning neural networks to reduce parameters and computation, demonstrating a 90% reduction in parameters while maintaining accuracy and highlighting the similarity between the pruning procedure and the development of the human brain.', 'chapters': [{'end': 542.614, 'start': 4.918, 'title': 'Efficient deep learning hardware', 'summary': 'Discusses the challenges of large deep learning models, including their impact on deployment, training speed, and energy consumption, with examples such as model size increase and training time. it also explores the approach to improving energy efficiency through algorithm and hardware co-design, and introduces hardware concepts such as general purpose and specialized hardware, as well as different number representations for efficient deep learning.', 'duration': 537.696, 'highlights': ["The model size for ImageNet recognition increased by 16 NICs from 2012 to 2015, and for Baidu's deep speech, the number of training operations increased by 10 NICs in just one year.", 'The training time for the ResNet-152, which is slightly more accurate than ResNet-101, takes 1.5 weeks to train on four M40 GPUs, significantly limiting the development and research pace.', 'The energy consumption for tasks such as AlphaGo required 2,000 CPUs and 300 GPUs, resulting in a $3,000 electric bill, emphasizing the high energy cost associated with large models and the impact on battery power and data center expenses.', 'Efforts to improve energy efficiency in deep learning involve algorithm and hardware co-design to optimize energy consumption, including exploring the algorithm-hardware boundary for overall efficiency enhancement.', 'Introduction to the concept of general purpose hardware (CPU, GPGPU) and specialized hardware (FPGAs, ASICs), highlighting their differences and applications in deep learning.']}, {'end': 793.158, 'start': 543.334, 'title': 'Efficient neural network pruning', 'summary': 'Discusses the process of pruning neural networks to reduce parameters and computation, demonstrating a 90% reduction in parameters while maintaining accuracy and highlighting the similarity between the pruning procedure and the development of the human brain.', 'duration': 249.824, 'highlights': ['Pruning neural networks can reduce the number of connections in AlexNet from 60 million parameters to only six million parameters, resulting in a 10 times reduction in computation.', 'Retraining the remaining weights after pruning can fully recover the accuracy, demonstrating the effectiveness of the iterative process.', "Pruning away 90% of the weights didn't hurt the blue score in RNNs and RSTMs, showcasing the robustness of the pruning approach.", 'The process of neural network pruning bears similarity to the natural pruning procedure that occurs in the development of the human brain, as evidenced by the reduction in synapses from 50 trillion at birth to 500 trillion in adolescence.']}], 'duration': 788.24, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4918.jpg', 'highlights': ['Pruning neural networks reduces AlexNet connections from 60M to 6M parameters, 10x computation reduction', 'Model size for ImageNet recognition increased by 16 NICs from 2012 to 2015', 'Training time for ResNet-152 takes 1.5 weeks on four M40 GPUs, limiting research pace', 'Efforts to improve energy efficiency involve algorithm and hardware co-design', 'Introduction to general purpose hardware (CPU, GPGPU) and specialized hardware (FPGAs, ASICs)']}, {'end': 1255.268, 'segs': [{'end': 865.382, 'src': 'embed', 'start': 837.101, 'weight': 2, 'content': [{'end': 841.863, 'text': 'Yeah So the next idea, weight sharing.', 'start': 837.101, 'duration': 4.762}, {'end': 843.965, 'text': 'So now we have remember.', 'start': 842.384, 'duration': 1.581}, {'end': 852.532, 'text': 'our end goal is to remove connections so that we can have less memory footprint, so that we can have more energy efficient deployment.', 'start': 843.965, 'duration': 8.567}, {'end': 855.314, 'text': 'Now we have less number of parameters by pruning.', 'start': 853.092, 'duration': 2.222}, {'end': 862.539, 'text': 'What we wanna have is we wanna have less number of bits per parameter so they multiply together to get a small model.', 'start': 855.734, 'duration': 6.805}, {'end': 865.382, 'text': 'So the idea is like this.', 'start': 863.68, 'duration': 1.702}], 'summary': 'Weight sharing reduces parameters for smaller, energy-efficient models.', 'duration': 28.281, 'max_score': 837.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo837101.jpg'}, {'end': 994.441, 'src': 'embed', 'start': 969.08, 'weight': 1, 'content': [{'end': 978.208, 'text': 'So remember previously after pruning, this is what the weight distribution like, and after weight sharing, they become discrete.', 'start': 969.08, 'duration': 9.128}, {'end': 985.534, 'text': 'There are only 16 different values here, meaning we can use four bits, four bits to represent each number.', 'start': 979.008, 'duration': 6.526}, {'end': 994.441, 'text': 'And by training on such with shared neural network, training on such with shared neural network, these weights can adjust.', 'start': 987.048, 'duration': 7.393}], 'summary': 'After weight sharing, weights become discrete with 16 different values, allowing representation with four bits per number.', 'duration': 25.361, 'max_score': 969.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo969080.jpg'}, {'end': 1113.691, 'src': 'heatmap', 'start': 1060.008, 'weight': 0.788, 'content': [{'end': 1064.331, 'text': 'And compared with the cheap SVD method, it has a better compression ratio.', 'start': 1060.008, 'duration': 4.323}, {'end': 1068.88, 'text': 'And the final idea is,', 'start': 1067.239, 'duration': 1.641}, {'end': 1082.108, 'text': 'we can apply those Huffman coding to use more number of bits for those infrequent numbers infrequent appearing weights and less number of bits for those more frequently appearing weights.', 'start': 1068.88, 'duration': 13.228}, {'end': 1086.591, 'text': 'So, by combining these three methods pruning,', 'start': 1083.829, 'duration': 2.762}, {'end': 1096.596, 'text': 'reassuring and also Huffman coding we can compress the neural networks state of the art neural networks ranging from 10x to 49x,', 'start': 1086.591, 'duration': 10.005}, {'end': 1101.641, 'text': '10x to 49x and without hurting the prediction accuracy.', 'start': 1096.596, 'duration': 5.045}, {'end': 1104.583, 'text': 'Sometimes a little bit better, but maybe that is noise.', 'start': 1101.721, 'duration': 2.862}, {'end': 1113.691, 'text': 'So the next question is, these models are just pre-trained models by, say, Google, Microsoft.', 'start': 1106.445, 'duration': 7.246}], 'summary': 'Combining pruning, reassurance, and huffman coding can compress neural networks by 10x to 49x without impacting prediction accuracy.', 'duration': 53.683, 'max_score': 1060.008, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1060008.jpg'}, {'end': 1096.596, 'src': 'embed', 'start': 1068.88, 'weight': 3, 'content': [{'end': 1082.108, 'text': 'we can apply those Huffman coding to use more number of bits for those infrequent numbers infrequent appearing weights and less number of bits for those more frequently appearing weights.', 'start': 1068.88, 'duration': 13.228}, {'end': 1086.591, 'text': 'So, by combining these three methods pruning,', 'start': 1083.829, 'duration': 2.762}, {'end': 1096.596, 'text': 'reassuring and also Huffman coding we can compress the neural networks state of the art neural networks ranging from 10x to 49x,', 'start': 1086.591, 'duration': 10.005}], 'summary': 'Combining pruning, reassigning, and huffman coding can compress neural networks by 10x to 49x.', 'duration': 27.716, 'max_score': 1068.88, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1068880.jpg'}, {'end': 1189.065, 'src': 'embed', 'start': 1161.43, 'weight': 0, 'content': [{'end': 1165.655, 'text': 'The last layer, the last layer is a global pooling, global pooling.', 'start': 1161.43, 'duration': 4.225}, {'end': 1174.904, 'text': 'So what if we apply the deep compression algorithm on such already compact model?', 'start': 1167.777, 'duration': 7.127}, {'end': 1176.586, 'text': 'Will it be getting even smaller??', 'start': 1174.984, 'duration': 1.602}, {'end': 1181.016, 'text': 'Okay so this is AlexNet.', 'start': 1178.573, 'duration': 2.443}, {'end': 1182.618, 'text': 'after compression, this is SqueezeNet.', 'start': 1181.016, 'duration': 1.602}, {'end': 1189.065, 'text': "Even before compression, it's 50x smaller, 50x smaller than AlexNet, but has the same accuracy.", 'start': 1182.918, 'duration': 6.147}], 'summary': 'Squeezenet is 50x smaller than alexnet but maintains the same accuracy.', 'duration': 27.635, 'max_score': 1161.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1161430.jpg'}, {'end': 1255.268, 'src': 'embed', 'start': 1221.12, 'weight': 4, 'content': [{'end': 1228.443, 'text': 'fully connected layers only for now on the CPU GPU and the mobile GPU, before pruning and after pruning the weights.', 'start': 1221.12, 'duration': 7.323}, {'end': 1240.515, 'text': 'And on average, I observed 3x speedup on a CPU, about 3x speedup on a GPU, and roughly 5x speedup on the mobile GPU, which is a TK1.', 'start': 1229.386, 'duration': 11.129}, {'end': 1250.864, 'text': 'And so is the energy efficiency, an average improvement from 3x to 6x on a CPU, GPU, and mobile GPU.', 'start': 1242.017, 'duration': 8.847}, {'end': 1255.268, 'text': 'And these ideas are used in these companies.', 'start': 1252.686, 'duration': 2.582}], 'summary': '3x speedup on cpu, 3x on gpu, 5x on mobile gpu post-pruning, with 3x to 6x energy efficiency improvement across devices', 'duration': 34.148, 'max_score': 1221.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1221120.jpg'}], 'start': 794.4, 'title': 'Neural network efficiency techniques', 'summary': 'Discusses weight pruning, sharing, and compression techniques for neural networks, achieving up to 16 times saving using two-bit indices, and 10x to 510x size reduction with speedup and energy efficiency gains.', 'chapters': [{'end': 966.678, 'start': 794.4, 'title': 'Weight pruning and sharing for efficient neural networks', 'summary': 'Discusses weight pruning and sharing techniques to reduce memory footprint and improve energy efficiency in neural networks, including the potential 16 times saving by using two-bit indices instead of 32-bit floating point numbers.', 'duration': 172.278, 'highlights': ['Weight sharing through k-means clustering can lead to 16 times saving by using two-bit indices instead of 32-bit floating point numbers.', 'Pruning reduces the number of parameters, aiming for a smaller model and more energy-efficient deployment.', 'Pruning and weight sharing techniques are used to remove small connections and achieve energy-efficient deployment in neural networks.']}, {'end': 1255.268, 'start': 969.08, 'title': 'Neural network compression', 'summary': 'Explores the methods of weight sharing, pruning, and huffman coding to compress neural networks, achieving a 10x to 510x reduction in size while maintaining or improving accuracy, leading to significant speedup and energy efficiency gains.', 'duration': 286.188, 'highlights': ['By combining weight sharing and pruning, the model can be compressed to 3% of its original size without impacting accuracy, achieving a 10% improvement over individual methods.', 'Applying Huffman coding to weights further compresses state-of-the-art neural networks by 10x to 49x without sacrificing prediction accuracy.', 'SqueezeNet, a compact model with no fully connected layers, achieves a 50x reduction in size compared to AlexNet before compression, and further compresses to 510x smaller after compression while maintaining the same accuracy.', 'Pruning results in an average speedup of 3x on CPU, 3x on GPU, and 5x on mobile GPU, with a similar improvement in energy efficiency.']}], 'duration': 460.868, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo794400.jpg', 'highlights': ['SqueezeNet compresses to 510x smaller after compression while maintaining accuracy', 'Weight sharing through k-means clustering leads to 16 times saving using two-bit indices', 'Pruning and weight sharing techniques remove small connections for energy-efficient deployment', 'Huffman coding compresses state-of-the-art neural networks by 10x to 49x without accuracy loss', 'Pruning results in an average speedup of 3x on CPU, 3x on GPU, and 5x on mobile GPU']}, {'end': 1680.634, 'segs': [{'end': 1307.448, 'src': 'embed', 'start': 1257.49, 'weight': 0, 'content': [{'end': 1263.311, 'text': 'OK, Having talked about weight pruning and weight sharing, which is a non-linear quantization method,', 'start': 1257.49, 'duration': 5.821}, {'end': 1268.733, 'text': "I'm going to talk about quantization which is widely used, widely used in the TPU design.", 'start': 1263.311, 'duration': 5.422}, {'end': 1272.715, 'text': 'All the TPU design is using only 8-bit, or 8-bit for inference.', 'start': 1268.933, 'duration': 3.782}, {'end': 1276.837, 'text': 'And the way, how they can use that is because of the quantization.', 'start': 1273.095, 'duration': 3.742}, {'end': 1278.277, 'text': "And let's see how that will work.", 'start': 1277.177, 'duration': 1.1}, {'end': 1286.6, 'text': 'So quantization has, this is a complicated figure, but the intuition is very simple.', 'start': 1280.738, 'duration': 5.862}, {'end': 1293.424, 'text': 'the neural network and train it with the normal floating point numbers, train with floating point numbers.', 'start': 1288.342, 'duration': 5.082}, {'end': 1300.006, 'text': 'And quantize the weight and activations by gather the statistics for each layer.', 'start': 1294.364, 'duration': 5.642}, {'end': 1302.907, 'text': 'For example, what is the maximum number? Maximum number.', 'start': 1300.166, 'duration': 2.741}, {'end': 1307.448, 'text': 'And how many bits are enough to represent this dynamic range?', 'start': 1303.427, 'duration': 4.021}], 'summary': 'Tpu design uses 8-bit quantization for inference, gathering statistics for each layer.', 'duration': 49.958, 'max_score': 1257.49, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1257490.jpg'}, {'end': 1367.192, 'src': 'embed', 'start': 1337.847, 'weight': 1, 'content': [{'end': 1343.488, 'text': 'And this is the result for the number of bits versus what is the accuracy.', 'start': 1337.847, 'duration': 5.641}, {'end': 1348.809, 'text': "For example, using fixed 8-bit, the accuracy for GoogleNet doesn't drop significantly.", 'start': 1343.528, 'duration': 5.281}, {'end': 1353.99, 'text': 'And for VGG16, it also remains pretty well for the accuracy.', 'start': 1349.389, 'duration': 4.601}, {'end': 1359.091, 'text': "Well, if you're going down to 6-bit, the accuracy begins to drop pretty dramatically.", 'start': 1354.57, 'duration': 4.521}, {'end': 1365.032, 'text': 'OK, next idea, low-rank approximation.', 'start': 1362.171, 'duration': 2.861}, {'end': 1367.192, 'text': 'Low-rank approximation.', 'start': 1366.192, 'duration': 1}], 'summary': 'Accuracy remains high with 8-bit, drops dramatically with 6-bit. low-rank approximation is mentioned.', 'duration': 29.345, 'max_score': 1337.847, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1337847.jpg'}, {'end': 1536.86, 'src': 'embed', 'start': 1506.108, 'weight': 2, 'content': [{'end': 1508.809, 'text': 'Some of them are corner detector like here, this filter.', 'start': 1506.108, 'duration': 2.701}, {'end': 1514.652, 'text': "Actually we don't need such fine grain resolution, just three weights are enough.", 'start': 1509.73, 'duration': 4.922}, {'end': 1521.156, 'text': 'So this is the validation accuracy on ImageNet with the AlexNet.', 'start': 1515.573, 'duration': 5.583}, {'end': 1527.2, 'text': 'So the dash line is the baseline accuracy with floating point 32.', 'start': 1521.797, 'duration': 5.403}, {'end': 1529.101, 'text': 'And the red line is our result.', 'start': 1527.2, 'duration': 1.901}, {'end': 1536.86, 'text': 'Pretty much the same accuracy converged compared with the full precision weights.', 'start': 1530.975, 'duration': 5.885}], 'summary': 'Validation accuracy on imagenet with alexnet: our result matched full precision weights.', 'duration': 30.752, 'max_score': 1506.108, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1506108.jpg'}, {'end': 1680.634, 'src': 'embed', 'start': 1645.48, 'weight': 3, 'content': [{'end': 1656.165, 'text': 'So, in order to get four output, we need nine times channel times four, which is 36 times channel multiplications,', 'start': 1645.48, 'duration': 10.685}, {'end': 1658.626, 'text': 'originally for the direct convolution.', 'start': 1656.165, 'duration': 2.461}, {'end': 1666.561, 'text': 'But now we need 16 times C for four outputs, 16 times number of channels for four outputs.', 'start': 1659.537, 'duration': 7.024}, {'end': 1673.946, 'text': 'So that is 2.25x less number of multiplications to perform the exact same multiplication.', 'start': 1667.222, 'duration': 6.724}, {'end': 1680.634, 'text': 'And here is the speedup.', 'start': 1678.752, 'duration': 1.882}], 'summary': 'The new method reduces the number of multiplications by 2.25x for the same result.', 'duration': 35.154, 'max_score': 1645.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1645480.jpg'}], 'start': 1257.49, 'title': 'Tpu design and neural network optimization techniques', 'summary': 'Discusses 8-bit quantization in tpu design, low-rank approximation for speedup, ternary weights, and grad transformation for neural network optimization, achieving similar accuracy with fewer weights and a reduction in convolutions.', 'chapters': [{'end': 1425.121, 'start': 1257.49, 'title': 'Tpu design and quantization techniques', 'summary': 'Discusses the widely used 8-bit quantization technique in tpu design, showing how quantization works, its impact on accuracy for different bit configurations, and the concept of low-rank approximation for achieving speedup with minimal accuracy loss.', 'duration': 167.631, 'highlights': ['The TPU design extensively utilizes 8-bit quantization for inference, maintaining accuracy for GoogleNet and VGG16 with fixed 8-bit while seeing a significant drop in accuracy at 6-bit quantization.', 'Low-rank approximation technique provides about 5x speedup with roughly a 6% loss of accuracy, achieved by breaking down convolution layers and fully connected layers into smaller separate problems using SVD and tensor train.', 'Quantization involves gathering statistics for each layer to determine the number of bits required for integer part and the rest for representation, enabling better accuracy through various techniques such as fine-tuning in floating point format or using feed forward with fixed point and back propagation with floating point numbers.']}, {'end': 1680.634, 'start': 1427.842, 'title': 'Efficient neural network optimization', 'summary': 'Discusses the use of ternary weights and grad transformation to optimize neural networks, achieving similar accuracy with fewer weights and a significant reduction in the number of multiplications during convolution implementation.', 'duration': 252.792, 'highlights': ['Validation accuracy on ImageNet with AlexNet showed similar accuracy with ternary weights compared to full precision weights.', 'Implementation of grad transformation resulted in a 2.25x reduction in the number of multiplications during convolution.', 'Efficient neural network optimization achieved by maintaining full precision weights during training and using ternary weights during inference, resulting in a very small model.']}], 'duration': 423.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1257490.jpg', 'highlights': ['TPU design extensively utilizes 8-bit quantization for inference, maintaining accuracy for GoogleNet and VGG16 with fixed 8-bit', 'Low-rank approximation technique provides about 5x speedup with roughly a 6% loss of accuracy', 'Validation accuracy on ImageNet with AlexNet showed similar accuracy with ternary weights compared to full precision weights', 'Implementation of grad transformation resulted in a 2.25x reduction in the number of multiplications during convolution', 'Quantization involves gathering statistics for each layer to determine the number of bits required for integer part and the rest for representation']}, {'end': 2463.55, 'segs': [{'end': 1888.052, 'src': 'embed', 'start': 1810.64, 'weight': 1, 'content': [{'end': 1813.806, 'text': "for the Google TPU, don't be overwhelmed.", 'start': 1810.64, 'duration': 3.166}, {'end': 1822.722, 'text': "It's actually the kernel part here is this giant matrix multiplication unit.", 'start': 1813.826, 'duration': 8.896}, {'end': 1828.542, 'text': "So it's a 256 by 256 matrix multiplication unit.", 'start': 1823.9, 'duration': 4.642}, {'end': 1840.528, 'text': 'So one single cycle, in one single cycle, it can perform 64K, 64 kilo, those number of multiplication and accumulate operations.', 'start': 1829.143, 'duration': 11.385}, {'end': 1847.432, 'text': 'So running at 700 megahertz, the throughput is 92 teraflops per second.', 'start': 1841.529, 'duration': 5.903}, {'end': 1854.51, 'text': "because it's actually integer operation.", 'start': 1852.888, 'duration': 1.622}, {'end': 1861.097, 'text': 'So which is about 25x as a GPU and more than 100x as a CPU.', 'start': 1855.852, 'duration': 5.245}, {'end': 1868.866, 'text': 'And notice TPU has a really large software managed on chip buffer.', 'start': 1862.359, 'duration': 6.507}, {'end': 1873.161, 'text': 'It is 24 megabytes, 24 megabytes.', 'start': 1870.199, 'duration': 2.962}, {'end': 1879.646, 'text': 'The CPU, the memory for the CPU, the cache for the CPU, the R3 cache is really 16 megabytes.', 'start': 1873.842, 'duration': 5.804}, {'end': 1882.428, 'text': 'This is 24 megabytes, which is pretty large.', 'start': 1880.106, 'duration': 2.322}, {'end': 1888.052, 'text': "And it's powered by two DDR3 DRAM channels.", 'start': 1884.589, 'duration': 3.463}], 'summary': 'Google tpu performs 92 teraflops per second at 700 mhz, 25x gpu, 100x cpu, with a 24mb on-chip buffer.', 'duration': 77.412, 'max_score': 1810.64, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1810640.jpg'}, {'end': 1992.223, 'src': 'embed', 'start': 1925.54, 'weight': 0, 'content': [{'end': 1934.324, 'text': "So this is a comparison about Google's TPU compared with the CPU, GPU, this K80 GPU, by the way, and the TPU.", 'start': 1925.54, 'duration': 8.784}, {'end': 1944.509, 'text': 'So the area is pretty much smaller, like half the size of a CPU and GPU, and the power consumption is roughly 75 watts.', 'start': 1936.325, 'duration': 8.184}, {'end': 1956.022, 'text': 'And you see this number, the peak, Tera Ops per second is much higher than the CPU and GPU.', 'start': 1948.391, 'duration': 7.631}, {'end': 1960.889, 'text': "It's about 90 Tera Ops per second, which is pretty high.", 'start': 1956.062, 'duration': 4.827}, {'end': 1967.375, 'text': 'So here is a workload, thanks to David sharing the slide.', 'start': 1963.07, 'duration': 4.305}, {'end': 1970.098, 'text': 'This is the workload at Google.', 'start': 1968.376, 'duration': 1.722}, {'end': 1974.083, 'text': 'They did the benchmark on these TPUs.', 'start': 1971.48, 'duration': 2.603}, {'end': 1979.849, 'text': "So it's a little interesting that convolution neural nets only account for 5% of data center workload.", 'start': 1974.924, 'duration': 4.925}, {'end': 1992.223, 'text': "Most of them is multi-layer perceptron, those fully connected layers, about 61%, maybe for ads, I'm not sure.", 'start': 1984.218, 'duration': 8.005}], 'summary': "Google's tpu outperforms cpu and gpu, 90 tera ops/sec, smaller size, lower power, 5% conv nets, 61% mlp", 'duration': 66.683, 'max_score': 1925.54, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1925540.jpg'}, {'end': 2141.33, 'src': 'embed', 'start': 2110.441, 'weight': 6, 'content': [{'end': 2119.969, 'text': 'when it is one here, this region it happens to be the same, as the turning point is the actual memory bandwidth of your system.', 'start': 2110.441, 'duration': 9.528}, {'end': 2129.597, 'text': "So let's see, what is it like for the TPU? The TPU's peak performance is really high, about 90 TOps per second.", 'start': 2121.931, 'duration': 7.666}, {'end': 2141.33, 'text': 'For those convolution nets, convolution nets, they are pretty much saturating the peak performance.', 'start': 2131.059, 'duration': 10.271}], 'summary': 'Tpu achieves peak performance of 90 tops per second.', 'duration': 30.889, 'max_score': 2110.441, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2110441.jpg'}, {'end': 2345.887, 'src': 'embed', 'start': 2318.908, 'weight': 5, 'content': [{'end': 2327.954, 'text': "So I'm gonna introduce my design of EIE, the Efficient Inference Engine, which deals with those sparse and compressed model,", 'start': 2318.908, 'duration': 9.046}, {'end': 2329.075, 'text': 'to save the memory bandwidth.', 'start': 2327.954, 'duration': 1.121}, {'end': 2334.019, 'text': 'And the rule of thumb, like we mentioned before, is taking advantage of sparsity first.', 'start': 2330.216, 'duration': 3.803}, {'end': 2336.72, 'text': 'Anything times zero is zero.', 'start': 2334.479, 'duration': 2.241}, {'end': 2339.182, 'text': "So don't store it, don't compute on it.", 'start': 2336.761, 'duration': 2.421}, {'end': 2345.887, 'text': "And second idea is this kind of, you don't need that much of full precision, but you can approximate it.", 'start': 2340.123, 'duration': 5.764}], 'summary': 'Eie design aims to save memory bandwidth by leveraging sparsity and approximation.', 'duration': 26.979, 'max_score': 2318.908, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2318908.jpg'}], 'start': 1681.314, 'title': 'Deep learning algorithms & hardware comparison', 'summary': "Covers efficient algorithms such as winograd convolution, pruning, quantization, and hardware optimizations including google tpu's 8-bit integer representation and eie architecture, with tpu achieving 92 teraflops per second and 24 megabytes of on-chip buffer memory. it also details a comparison of google's tpu with cpu and gpu, highlighting tpu's smaller size, lower power consumption of 75 watts, and peak performance of 90 tera ops per second, while also discussing the workload distribution in google's data center and the tpu's bottleneck issues. it further delves into the design of eie, the efficient inference engine, for handling sparse and compressed models to save memory bandwidth.", 'chapters': [{'end': 1922.638, 'start': 1681.314, 'title': 'Efficient algorithms & optimal hardware for deep learning', 'summary': "Covers efficient algorithms such as winograd convolution, pruning, quantization, and hardware optimizations including google tpu's 8-bit integer representation and eie architecture, with tpu achieving 92 teraflops per second and 24 megabytes of on-chip buffer memory.", 'duration': 241.324, 'highlights': ['Google TPU achieves 92 teraflops per second with 8-bit integer representation, providing 25x better performance than GPU and over 100x better performance than CPU.', "Google TPU features a 24 megabyte on-chip buffer memory, larger than the CPU's 16 megabyte cache.", "Google TPU's architecture includes a 256x256 matrix multiplication unit, capable of performing 64 kilo operations in a single cycle at 700 megahertz."]}, {'end': 2463.55, 'start': 1925.54, 'title': 'Google tpu vs cpu & gpu comparison', 'summary': "Details a comparison of google's tpu with cpu and gpu, highlighting tpu's smaller size, lower power consumption of 75 watts, and peak performance of 90 tera ops per second, while also discussing the workload distribution in google's data center and the tpu's bottleneck issues. it further delves into the design of eie, the efficient inference engine, for handling sparse and compressed models to save memory bandwidth.", 'duration': 538.01, 'highlights': ["Google's TPU has a peak performance of 90 Tera Ops per second, surpassing CPU and GPU performance significantly.", 'Convolution neural nets only account for 5% of data center workload, while most of the workload (61%) consists of multi-layer perceptron, and 29% is dedicated to long-shot term memory.', "The TPU's bottleneck is attributed to low utilization of neural networks, leading to achieving only 3 to 12 Tera Ops per second in real cases, primarily due to low operations per byte and memory footprint.", 'EIE, the Efficient Inference Engine, is designed to handle sparse and compressed models to save memory bandwidth, achieving significant savings in computation and memory footprint.']}], 'duration': 782.236, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo1681314.jpg', 'highlights': ['Google TPU achieves 92 teraflops per second with 8-bit integer representation, providing 25x better performance than GPU and over 100x better performance than CPU.', "Google TPU's architecture includes a 256x256 matrix multiplication unit, capable of performing 64 kilo operations in a single cycle at 700 megahertz.", "Google TPU features a 24 megabyte on-chip buffer memory, larger than the CPU's 16 megabyte cache.", "Google's TPU has a peak performance of 90 Tera Ops per second, surpassing CPU and GPU performance significantly.", 'Convolution neural nets only account for 5% of data center workload, while most of the workload (61%) consists of multi-layer perceptron, and 29% is dedicated to long-shot term memory.', 'EIE, the Efficient Inference Engine, is designed to handle sparse and compressed models to save memory bandwidth, achieving significant savings in computation and memory footprint.', "The TPU's bottleneck is attributed to low utilization of neural networks, leading to achieving only 3 to 12 Tera Ops per second in real cases, primarily due to low operations per byte and memory footprint."]}, {'end': 3289.366, 'segs': [{'end': 2557.941, 'src': 'embed', 'start': 2493.087, 'weight': 0, 'content': [{'end': 2494.328, 'text': 'Now EIE is here.', 'start': 2493.087, 'duration': 1.241}, {'end': 2501.791, 'text': '189 times faster than the CPU and about 13 times faster than the GPU.', 'start': 2495.689, 'duration': 6.102}, {'end': 2507.192, 'text': 'So this is the energy efficiency on the log scale.', 'start': 2503.811, 'duration': 3.381}, {'end': 2514.794, 'text': "It's about 24, 000x more energy efficient than the CPU and about 3, 000x more energy efficient than the GPU.", 'start': 2507.412, 'duration': 7.382}, {'end': 2522.896, 'text': 'It means, for example, previously, if your battery can last for one hour, now it can last for 3, 000 hours, for example.', 'start': 2515.574, 'duration': 7.322}, {'end': 2532.118, 'text': "So if you say ASIC is always better than CPU and GPUs because, you know, it's customized hardware.", 'start': 2526.673, 'duration': 5.445}, {'end': 2538.504, 'text': 'So this is comparing EIE with the peer ASICs, for example, and the TrueNorth.', 'start': 2532.679, 'duration': 5.825}, {'end': 2551.316, 'text': 'It has a better throughput, better energy efficiency by an order of magnitude compared with other ASICs, not to mention the CPU, GPU, and FPGAs.', 'start': 2539.585, 'duration': 11.731}, {'end': 2557.941, 'text': 'Okay, so we have covered half of the journey.', 'start': 2553.638, 'duration': 4.303}], 'summary': 'Eie is 189x faster than cpu, 13x faster than gpu, 24,000x more energy efficient than cpu, and 3,000x more efficient than gpu, offering potential battery life of 3,000 hours.', 'duration': 64.854, 'max_score': 2493.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2493087.jpg'}, {'end': 2611.359, 'src': 'embed', 'start': 2580.976, 'weight': 2, 'content': [{'end': 2584.278, 'text': "So for efficient training algorithms, I'm going to mention four topics.", 'start': 2580.976, 'duration': 3.302}, {'end': 2595.002, 'text': "The first one is parallelization and then mixed precision training, which is just released about one month ago at NVIDIA GTC, so it's fresh knowledge.", 'start': 2584.618, 'duration': 10.384}, {'end': 2601.805, 'text': 'And then model distillation, followed by my work on dense-spark-stance training, or better regularization technique.', 'start': 2596.483, 'duration': 5.322}, {'end': 2605.367, 'text': "OK, so let's start with parallelization.", 'start': 2603.186, 'duration': 2.181}, {'end': 2611.359, 'text': "So this figure shows, if you're in hardware community, you must be very familiar with this figure.", 'start': 2606.636, 'duration': 4.723}], 'summary': 'Efficient training algorithms cover parallelization, mixed precision training, model distillation, and dense-spark-stance training for better regularization technique.', 'duration': 30.383, 'max_score': 2580.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2580976.jpg'}, {'end': 2798.932, 'src': 'embed', 'start': 2769.829, 'weight': 6, 'content': [{'end': 2776.553, 'text': 'For example, you can tune your learning rate, your weight decay on different machines for those coarse-grained parallelism.', 'start': 2769.829, 'duration': 6.724}, {'end': 2779.314, 'text': 'So there are so many alternatives you have to tune.', 'start': 2776.753, 'duration': 2.561}, {'end': 2783.421, 'text': 'Okay, small summary of the parallelism.', 'start': 2781.26, 'duration': 2.161}, {'end': 2786.924, 'text': 'So there are lots of parallelisms in deep neural networks.', 'start': 2783.962, 'duration': 2.962}, {'end': 2789.866, 'text': 'For example, the data parallelism.', 'start': 2787.504, 'duration': 2.362}, {'end': 2798.932, 'text': 'you can run multiple training images, but you cannot have a limited number of threads or processors because you are limited by batch size.', 'start': 2789.866, 'duration': 9.066}], 'summary': 'Tune learning rate, weight decay for parallelism in deep neural networks.', 'duration': 29.103, 'max_score': 2769.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2769829.jpg'}, {'end': 3047.792, 'src': 'embed', 'start': 3017.568, 'weight': 3, 'content': [{'end': 3030.336, 'text': 'But in the end, after you train the model, this is result for AlexNet Inception V3, and ResNet50, with FP32 versus the FP16 mixed precision training.', 'start': 3017.568, 'duration': 12.768}, {'end': 3034.318, 'text': 'the accuracy is pretty much the same for these two methods.', 'start': 3030.336, 'duration': 3.982}, {'end': 3036.54, 'text': 'A little bit worse, but not by too much.', 'start': 3034.378, 'duration': 2.162}, {'end': 3047.792, 'text': 'Okay, so having talked about the mixed precision training, the next idea is to train with model distillation.', 'start': 3039.525, 'duration': 8.267}], 'summary': 'Alexnet, inception v3, and resnet50 achieved similar accuracy with fp32 and fp16 mixed precision training.', 'duration': 30.224, 'max_score': 3017.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3017568.jpg'}, {'end': 3238.793, 'src': 'heatmap', 'start': 3154.813, 'weight': 4, 'content': [{'end': 3161.258, 'text': 'And the result is that, starting with a trained model that classifies 58.9% of the test frames correctly, the new model converges to 57%,', 'start': 3154.813, 'duration': 6.445}, {'end': 3162.319, 'text': 'only trained on 3%, 3% of the data.', 'start': 3161.258, 'duration': 1.061}, {'end': 3177.203, 'text': "So that's the magic for model distillation using this kind of soft label.", 'start': 3173.157, 'duration': 4.046}, {'end': 3185.895, 'text': 'And the last idea is my recent paper using a better regularization to train deep neural nets.', 'start': 3179.626, 'duration': 6.269}, {'end': 3188.325, 'text': 'We have seen these two figures before.', 'start': 3186.704, 'duration': 1.621}, {'end': 3192.147, 'text': 'We proved the neural network having less number of weights, but have the same accuracy.', 'start': 3188.385, 'duration': 3.762}, {'end': 3204.274, 'text': 'Now what I did is to recover and retrain those weights showing in red and make everything train out together to increase the model capacity after it is trained at a low dimensional space.', 'start': 3192.848, 'duration': 11.426}, {'end': 3210.298, 'text': "It's like you learn the trunk first and then gradually add those leaves and learn everything together.", 'start': 3205.355, 'duration': 4.943}, {'end': 3220.74, 'text': 'It turns out on ImageNet dataset, it performs relatively about 1% to 4% absolute improvement of accuracy.', 'start': 3211.593, 'duration': 9.147}, {'end': 3228.065, 'text': 'And it also general purpose works on short-term memory and also recurrent neural nets collaborated with Baidu.', 'start': 3221.44, 'duration': 6.625}, {'end': 3238.793, 'text': 'So I also open sourced this better trained model on the DSD model, zoo, where there are trained all these models GoogleNet, VGG,', 'start': 3229.827, 'duration': 8.966}], 'summary': 'New model achieves 57% accuracy, trained on 3% of data. better regularization yields 1-4% accuracy boost on imagenet.', 'duration': 33.512, 'max_score': 3154.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3154813.jpg'}], 'start': 2464.25, 'title': 'Efficient hardware and parallelism in neural networks', 'summary': "Discusses efficient hardware architecture for neural networks, emphasizing eie's 189x speedup over cpu, 13x over gpu, and 24,000x energy efficiency over cpu, as well as parallelism in deep neural networks, mixed precision training, and model distillation for improved performance and reduced energy overhead.", 'chapters': [{'end': 2631.714, 'start': 2464.25, 'title': 'Efficient hardware architecture for neural networks', 'summary': "Discusses the efficient hardware architecture for neural networks, highlighting eie's 189x speedup over cpu and 13x over gpu, as well as its 24,000x energy efficiency over cpu and 3,000x over gpu. it also compares eie with peer asics and truenorth, showcasing its better throughput and energy efficiency. additionally, it introduces topics for efficient training algorithms, including parallelization, mixed precision training, model distillation, and dense-sparseness training, addressing the plateauing of single threaded performance and frequency due to power constraints in recent years.", 'duration': 167.464, 'highlights': ['EIE achieves 189 times speedup over CPU and about 13 times faster than GPU, with 24,000x more energy efficiency than the CPU and about 3,000x more energy efficient than the GPU, resulting in a potential increase in battery life from 1 hour to 3,000 hours.', 'EIE shows better throughput and energy efficiency by an order of magnitude compared to other ASICs, including peer ASICs and TrueNorth, surpassing the capabilities of CPU, GPU, and FPGAs.', 'The chapter introduces topics for efficient training algorithms, including parallelization, mixed precision training, model distillation, and dense-sparseness training, addressing the plateauing of single threaded performance and frequency due to power constraints in recent years.']}, {'end': 3289.366, 'start': 2632.793, 'title': 'Parallelism and mixed precision in deep neural networks', 'summary': 'Discusses the various forms of parallelism in deep neural networks, including data parallelism, model parallelism, and hyperparameter parallelism, as well as the implementation of mixed precision training and model distillation. it also covers the benefits of mixed precision training, demonstrating its effectiveness in maintaining accuracy while significantly reducing energy and area overhead. furthermore, it explores the concept of model distillation, where multiple neural networks are utilized as teachers to train a smaller neural network using soft labels, resulting in improved performance with minimal data requirement. additionally, it introduces a better regularization method for deep neural networks, showcasing its ability to improve accuracy by 1% to 4% on the imagenet dataset and its applicability to short-term memory and recurrent neural networks.', 'duration': 656.573, 'highlights': ['Mixed precision training with FP16 or FP32', 'Model distillation using soft labels', 'Better regularization method for deep neural networks', 'Forms of parallelism in deep neural networks']}], 'duration': 825.116, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo2464250.jpg', 'highlights': ['EIE achieves 189 times speedup over CPU and about 13 times faster than GPU, with 24,000x more energy efficiency than the CPU and about 3,000x more energy efficient than the GPU, resulting in a potential increase in battery life from 1 hour to 3,000 hours.', 'EIE shows better throughput and energy efficiency by an order of magnitude compared to other ASICs, including peer ASICs and TrueNorth, surpassing the capabilities of CPU, GPU, and FPGAs.', 'The chapter introduces topics for efficient training algorithms, including parallelization, mixed precision training, model distillation, and dense-sparseness training, addressing the plateauing of single threaded performance and frequency due to power constraints in recent years.', 'Mixed precision training with FP16 or FP32', 'Model distillation using soft labels', 'Better regularization method for deep neural networks', 'Forms of parallelism in deep neural networks']}, {'end': 4010.526, 'segs': [{'end': 3345.536, 'src': 'embed', 'start': 3313.154, 'weight': 4, 'content': [{'end': 3320.677, 'text': 'how are the hardware designed to actually take advantage of such features we can obtain?', 'start': 3313.154, 'duration': 7.523}, {'end': 3323.538, 'text': 'First, GPUs.', 'start': 3322.417, 'duration': 1.121}, {'end': 3331.381, 'text': 'This is the NVIDIA Pascal GPU, GP100, which was released last year.', 'start': 3325.699, 'duration': 5.682}, {'end': 3336.414, 'text': 'So it supports actually 20 teraflops on FP16.', 'start': 3332.813, 'duration': 3.601}, {'end': 3342.335, 'text': 'It has 16 gigabytes of high bandwidth memory.', 'start': 3336.494, 'duration': 5.841}, {'end': 3345.536, 'text': "It's 750 gigabytes per second.", 'start': 3342.615, 'duration': 2.921}], 'summary': 'Nvidia pascal gpu supports 20 teraflops on fp16 with 16gb high bandwidth memory and 750gb/s.', 'duration': 32.382, 'max_score': 3313.154, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3313154.jpg'}, {'end': 3446.744, 'src': 'embed', 'start': 3411.312, 'weight': 0, 'content': [{'end': 3413.213, 'text': 'Just released less than a month ago.', 'start': 3411.312, 'duration': 1.901}, {'end': 3415.014, 'text': 'So it has.', 'start': 3414.154, 'duration': 0.86}, {'end': 3428.05, 'text': '15 FP32 teraflops and what is new here, there is 120 tensor TLOPs specifically designed for deep learning.', 'start': 3417.203, 'duration': 10.847}, {'end': 3433.073, 'text': "And we'll later cover what is the tensor core and what is this 120 coming from.", 'start': 3428.63, 'duration': 4.443}, {'end': 3446.744, 'text': "And rather than 750 gigabytes per second, this year, the HBM2, they're using 900 gigabytes per second memory bandwidth, very exciting.", 'start': 3436.877, 'duration': 9.867}], 'summary': 'New release with 15 fp32 teraflops and 120 tensor tlops for deep learning, utilizing 900 gigabytes per second memory bandwidth.', 'duration': 35.432, 'max_score': 3411.312, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3411312.jpg'}, {'end': 3532.334, 'src': 'embed', 'start': 3502.393, 'weight': 3, 'content': [{'end': 3511.322, 'text': 'So we are having FP16 for the multiplication, but for accumulation we are doing it with FP32.', 'start': 3502.393, 'duration': 8.929}, {'end': 3514.786, 'text': "That's where the mixed precision come from.", 'start': 3511.362, 'duration': 3.424}, {'end': 3524.644, 'text': "So you are, let's see how many operations, it's four by four by four, it's 64, 64 multiplications done just in one single cycle.", 'start': 3516.394, 'duration': 8.25}, {'end': 3532.334, 'text': "That's 12x increase in the speedup of the Volta, all compared with the pass call, which is released just last year.", 'start': 3525.445, 'duration': 6.889}], 'summary': 'Mixed precision allows 64 multiplications per cycle, achieving 12x speedup compared to the previous release.', 'duration': 29.941, 'max_score': 3502.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3502393.jpg'}, {'end': 3595.898, 'src': 'embed', 'start': 3563.587, 'weight': 1, 'content': [{'end': 3571.211, 'text': 'And for training a ResNet-50, by taking advantage of this tensor core in this V100, it is 2.4x faster than the P100 using FP32.', 'start': 3563.587, 'duration': 7.624}, {'end': 3589.836, 'text': 'So on the right hand side, it compares the inference speedup, giving a seven microsecond latency requirement.', 'start': 3579.332, 'duration': 10.504}, {'end': 3595.898, 'text': 'What is the number of images per second it can process? It has a measurement of throughput.', 'start': 3590.856, 'duration': 5.042}], 'summary': 'Training resnet-50 on v100 is 2.4x faster than p100 using fp32, with a throughput of images per second.', 'duration': 32.311, 'max_score': 3563.587, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3563587.jpg'}, {'end': 3883.697, 'src': 'embed', 'start': 3846.888, 'weight': 2, 'content': [{'end': 3849.109, 'text': 'That is the Google Cloud TPU.', 'start': 3846.888, 'duration': 2.221}, {'end': 3855.053, 'text': 'So now TPU not only support inference, but also support training.', 'start': 3849.769, 'duration': 5.284}, {'end': 3860.617, 'text': 'So there is very limited information we can get beyond this Google blog.', 'start': 3855.873, 'duration': 4.744}, {'end': 3871.369, 'text': 'So the Cloud TPU delivers up to 180 teraflops, 180 teraflops to train and run machine learning models.', 'start': 3861.423, 'duration': 9.946}, {'end': 3883.697, 'text': 'And this is multiple Cloud TPU, making it into a TPU pod TPU pod which is built with 16,', 'start': 3873.931, 'duration': 9.766}], 'summary': 'Google cloud tpu supports both inference and training, delivering up to 180 teraflops for machine learning models and forming tpu pods with 16 units.', 'duration': 36.809, 'max_score': 3846.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3846888.jpg'}], 'start': 3289.486, 'title': 'Gpu evolution and efficiency', 'summary': "Covers nvidia pascal and volta gpus with a 12x speedup in matrix multiplication, 2.4x faster training for resnet-50 using tensor cores, and the evolution of nvidia gpus from kepler to volta, showcasing improvements in clock speed, memory, power consumption, and manufacturing process. it also compares nvidia gpus with google cloud tpu, highlighting the tpu's capability of delivering up to 180 teraflops for training and efficiency gains in training large-scale translation models.", 'chapters': [{'end': 3676.728, 'start': 3289.486, 'title': 'Hardware for efficient training', 'summary': 'Discusses the nvidia pascal and volta gpus, highlighting their specifications and performance improvements, such as 12x speedup in matrix multiplication and 2.4x faster training for resnet-50 using tensor cores.', 'duration': 387.242, 'highlights': ['The Volta GPU features 15 FP32 teraflops and 120 tensor TLOPs specifically designed for deep learning, offering 900 gigabytes per second memory bandwidth, and a 12x speedup in matrix multiplication compared to the Pascal GPU.', 'For training a ResNet-50, the V100 with tensor cores is 2.4x faster than the P100 using FP32, demonstrating significant performance improvements in deep learning tasks.', 'The NVIDIA Pascal GPU, GP100, supports 20 teraflops on FP16, 16 gigabytes of high bandwidth memory, and 750 gigabytes per second memory bandwidth, providing a foundation for efficient training.', 'The tensor core in the Volta GPU can perform 64 multiplications in one clock cycle, resulting in a 12x speedup compared to the Pascal GPU for matrix multiplication.', 'The communication, computation, and memory bandwidth are highlighted as the three factors that need to be balanced to achieve good performance in hardware designed for efficient training.']}, {'end': 4010.526, 'start': 3679.437, 'title': 'Evolution of nvidia gpus and introduction of google cloud tpu', 'summary': "Discusses the evolution of nvidia gpus from kepler to volta, highlighting improvements in clock speed, memory, power consumption, and manufacturing process. it also compares nvidia gpus with tpus, showcasing the google cloud tpu's capability of delivering up to 180 teraflops for training and running machine learning models, as well as the efficiency gains in training large-scale translation models.", 'duration': 331.089, 'highlights': ['The Google Cloud TPU delivers up to 180 teraflops for training and running machine learning models, and the TPU pod, built with 16 second generation TPUs, delivers up to 11.5 petaflops of machine learning acceleration.', 'Comparison of TPUs and NVIDIA GPUs shows that the Google Cloud TPU supports both inference and training, with efficiency gains demonstrated in training large-scale translation models, reducing the training time from a full day on 32 GPUs to just one afternoon with one eighth of a TPU pod.', 'Evolution of NVIDIA GPUs from Kepler to Volta is characterized by improvements in clock speed, memory technology (GDDR5 to HBM), power consumption, and manufacturing process (28nm to 12nm), while maintaining relatively constant memory size.']}], 'duration': 721.04, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo3289486.jpg', 'highlights': ['The Volta GPU features 15 FP32 teraflops and 120 tensor TLOPs, offering 900GB/s memory bandwidth, and a 12x speedup in matrix multiplication compared to Pascal.', 'The V100 with tensor cores is 2.4x faster than the P100 for training a ResNet-50, demonstrating significant performance improvements in deep learning tasks.', 'The Google Cloud TPU delivers up to 180 teraflops for training and running machine learning models, with the TPU pod delivering up to 11.5 petaflops of machine learning acceleration.', 'The tensor core in the Volta GPU can perform 64 multiplications in one clock cycle, resulting in a 12x speedup compared to the Pascal GPU for matrix multiplication.', 'The NVIDIA Pascal GPU, GP100, supports 20 teraflops on FP16, 16GB of high bandwidth memory, and 750GB/s memory bandwidth, providing a foundation for efficient training.']}, {'end': 4608.3, 'segs': [{'end': 4040.569, 'src': 'embed', 'start': 4011.516, 'weight': 4, 'content': [{'end': 4017.879, 'text': 'And finally, we covered the hardware for efficient training and introduced two nuclear bombs.', 'start': 4011.516, 'duration': 6.363}, {'end': 4019.36, 'text': 'One is the Volta GPU.', 'start': 4018.139, 'duration': 1.221}, {'end': 4032.606, 'text': 'The other is the TPU version 2, the Cloud TPU, and also the amazing Tensor cores in the newest generation of NVIDIA GPUs.', 'start': 4019.58, 'duration': 13.026}, {'end': 4040.569, 'text': 'and we also reviewed the progression of a wide range the recent NVIDIA GPUs from the Kepler K40,', 'start': 4033.266, 'duration': 7.303}], 'summary': 'Covered efficient training hardware with volta gpu, tpu v2, and tensor cores in nvidia gpus.', 'duration': 29.053, 'max_score': 4011.516, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4011516.jpg'}, {'end': 4092.211, 'src': 'embed', 'start': 4065.486, 'weight': 2, 'content': [{'end': 4075.636, 'text': 'So, in the future of the city, we can imagine there are a lot of AI applications used in, say, smart society, smart care, IoT devices, smart retail,', 'start': 4065.486, 'duration': 10.15}, {'end': 4079.96, 'text': 'for example the Amazon Go, and also smart home lots of scenarios.', 'start': 4075.636, 'duration': 4.324}, {'end': 4083.884, 'text': 'And it poses a lot of challenges on the hardware design.', 'start': 4080.721, 'duration': 3.163}, {'end': 4089.409, 'text': 'that requires the low latency, privacy, mobility, and energy efficiency.', 'start': 4084.384, 'duration': 5.025}, {'end': 4092.211, 'text': "You don't want your battery to drain very quickly.", 'start': 4089.789, 'duration': 2.422}], 'summary': 'Ai applications in smart city pose challenges on hardware design for low latency, privacy, mobility, and energy efficiency.', 'duration': 26.725, 'max_score': 4065.486, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4065486.jpg'}, {'end': 4225.625, 'src': 'embed', 'start': 4197.301, 'weight': 1, 'content': [{'end': 4206.822, 'text': 'Yeah, so those are the algorithm I discussed in the beginning about inference here.', 'start': 4197.301, 'duration': 9.521}, {'end': 4217.221, 'text': 'These are the techniques that can enable such inference or AI running on embedded devices by having less number of weights,', 'start': 4208.837, 'duration': 8.384}, {'end': 4225.625, 'text': 'fewer bits per weight and also quantization, low rank approximation, the small matrix, same accuracy, even going into binary or ternary weights,', 'start': 4217.221, 'duration': 8.404}], 'summary': 'Techniques for enabling ai on embedded devices with fewer weights and bits.', 'duration': 28.324, 'max_score': 4197.301, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4197301.jpg'}, {'end': 4455.762, 'src': 'embed', 'start': 4426.743, 'weight': 3, 'content': [{'end': 4437.327, 'text': "But now, since you know the computation pattern, there's no need to do out of order execution, to do branch prediction, no such things.", 'start': 4426.743, 'duration': 10.584}, {'end': 4440.309, 'text': 'Everything is determined so you can take the amount of it.', 'start': 4437.468, 'duration': 2.841}, {'end': 4448.275, 'text': 'and maintain a fully software managed scratch pad to reduce the data movement.', 'start': 4441.73, 'duration': 6.545}, {'end': 4454.221, 'text': 'And remember, data movement is the key for reducing the memory footprint and energy consumption.', 'start': 4448.856, 'duration': 5.365}, {'end': 4455.762, 'text': 'So, yeah.', 'start': 4454.941, 'duration': 0.821}], 'summary': 'Fully software managed scratch pad reduces data movement, minimizing memory footprint and energy consumption.', 'duration': 29.019, 'max_score': 4426.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4426743.jpg'}, {'end': 4574.064, 'src': 'embed', 'start': 4542.313, 'weight': 0, 'content': [{'end': 4555.622, 'text': 'doing the face alignment and detecting the eyes, the nose, and the mouth at a pretty high frame rate, consuming only three watts, only three watts.', 'start': 4542.313, 'duration': 13.309}, {'end': 4564.748, 'text': 'This is a project I did at Facebook doing the deep neural nets on the mobile phone to do image classification.', 'start': 4557.263, 'duration': 7.485}, {'end': 4566.229, 'text': "For example, it says it's a laptop.", 'start': 4564.768, 'duration': 1.461}, {'end': 4574.064, 'text': "or you can feed it with a image and says it's a selfie, has person face, et cetera.", 'start': 4567.119, 'duration': 6.945}], 'summary': 'Achieved high frame rate face alignment and detection, consuming only three watts, for image classification on mobile phones.', 'duration': 31.751, 'max_score': 4542.313, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4542313.jpg'}], 'start': 4011.516, 'title': 'Ai hardware and machine learning design', 'summary': 'Covers the future of ai hardware, including volta gpu and tpu version 2, challenges and opportunities in hardware design, techniques for enabling ai on embedded devices, and the trade-offs in hardware design for machine learning, with emphasis on performance, power, accuracy, and programmability. it also presents examples of applying deep neural nets to low power embedded devices with quantifiable data such as real-time tracking and detection consuming only three watts.', 'chapters': [{'end': 4302.616, 'start': 4011.516, 'title': 'Ai hardware and future challenges', 'summary': 'Discusses the future of ai hardware, including the introduction of volta gpu and tpu version 2, as well as the challenges and opportunities in designing hardware for ai applications, and the techniques for enabling ai on embedded devices. it also explores the shift to the ai-first era and the excitement for brain-inspired cognitive computing research.', 'duration': 291.1, 'highlights': ['The chapter discusses the future of AI hardware, including the introduction of Volta GPU and TPU version 2.', 'The challenges and opportunities in designing hardware for AI applications are explored, emphasizing the need for low latency, privacy, mobility, and energy efficiency.', 'Techniques for enabling AI on embedded devices, such as quantization, low rank approximation, and binary or ternary weights, are detailed.', 'The shift to the AI-first era and the excitement for brain-inspired cognitive computing research are discussed.']}, {'end': 4608.3, 'start': 4303.897, 'title': 'Hardware design for machine learning', 'summary': 'Discusses the trade-offs in hardware design for machine learning, focusing on performance, power, accuracy, and programmability, and highlights the importance of data movement in reducing memory footprint and energy consumption. it also showcases examples of applying deep neural nets to low power embedded devices with quantifiable data such as real-time tracking and detection consuming only three watts.', 'duration': 304.403, 'highlights': ['The importance of data movement in reducing memory footprint and energy consumption is emphasized, showcasing the impact of efficient hardware design in machine learning applications.', 'Examples of applying deep neural nets to low power embedded devices are highlighted, including real-time tracking and detection consuming only three watts, demonstrating the practicality and efficiency of the hardware design.', 'The discussion focuses on the trade-offs in hardware design for machine learning, emphasizing the considerations of performance, power, accuracy, and programmability in optimizing hardware for specific applications.', 'The chapter also mentions the potential for specialized architectures for deep learning in contrast to GPUs designed for scientific computing or graphics, indicating the diverse hardware requirements for different applications.', 'The concept of software managed memory, referred to as a scratch pad, is introduced, emphasizing its role in reducing data movement and optimizing energy consumption in hardware design for machine learning applications.']}], 'duration': 596.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eZdOkDtYMoo/pics/eZdOkDtYMoo4011516.jpg', 'highlights': ['Examples of applying deep neural nets to low power embedded devices with real-time tracking and detection consuming only three watts demonstrate practicality and efficiency of hardware design.', 'Techniques for enabling AI on embedded devices, such as quantization, low rank approximation, and binary or ternary weights, are detailed.', 'The challenges and opportunities in designing hardware for AI applications are explored, emphasizing the need for low latency, privacy, mobility, and energy efficiency.', 'The importance of data movement in reducing memory footprint and energy consumption is emphasized, showcasing the impact of efficient hardware design in machine learning applications.', 'The chapter discusses the future of AI hardware, including the introduction of Volta GPU and TPU version 2.']}], 'highlights': ['EIE achieves 24,000x more energy efficiency than CPU', 'Google TPU achieves 92 teraflops per second with 8-bit integer representation', 'SqueezeNet compresses to 510x smaller after compression while maintaining accuracy', 'Pruning neural networks reduces AlexNet connections from 60M to 6M parameters', 'TPU design extensively utilizes 8-bit quantization for inference', 'The Volta GPU features 15 FP32 teraflops and 120 tensor TLOPs', 'Efficient Inference Engine (EIE) designed to handle sparse and compressed models', "Google TPU's architecture includes a 256x256 matrix multiplication unit", 'Model size for ImageNet recognition increased by 16 NICs from 2012 to 2015', 'Google Cloud TPU delivers up to 180 teraflops for training and running machine learning models', 'Efforts to improve energy efficiency involve algorithm and hardware co-design', 'Low-rank approximation technique provides about 5x speedup with roughly a 6% loss of accuracy', 'Validation accuracy on ImageNet with AlexNet showed similar accuracy with ternary weights compared to full precision weights', 'Examples of applying deep neural nets to low power embedded devices with real-time tracking and detection consuming only three watts', 'The importance of data movement in reducing memory footprint and energy consumption is emphasized', 'Introduction to general purpose hardware (CPU, GPGPU) and specialized hardware (FPGAs, ASICs)']}