title

An Observation on Generalization

description

Ilya Sutskever (OpenAI)
https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023-08-14
Large Language Models and Transformers

detail

{'title': 'An Observation on Generalization', 'heatmap': [{'end': 1037.198, 'start': 998.742, 'weight': 0.721}, {'end': 1446.719, 'start': 1272.801, 'weight': 0.925}, {'end': 1549.98, 'start': 1513.674, 'weight': 0.701}, {'end': 1792.411, 'start': 1757.153, 'weight': 0.717}, {'end': 1996.09, 'start': 1892.534, 'weight': 0.761}, {'end': 2175.075, 'start': 2133.664, 'weight': 0.842}], 'summary': "Titled 'an observation on generalization' covers chapters on unsupervised learning theory, unsupervised learning, kolmogorov complexity, neural nets, low-regret algorithms, next token prediction, and energy-based models, emphasizing mathematical conditions for learning success, challenges, implications on compression quality, and applications to gpt models and maximum likelihood training.", 'chapters': [{'end': 419.715, 'segs': [{'end': 175.158, 'src': 'embed', 'start': 144.071, 'weight': 0, 'content': [{'end': 151.558, 'text': 'Why would data have regularity that our machine learning models can capture??', 'start': 144.071, 'duration': 7.487}, {'end': 154.341, 'text': "So that's not an obvious question.", 'start': 151.578, 'duration': 2.763}, {'end': 170.075, 'text': 'And one important conceptual advance that has taken place in machine learning many years ago by multiple people was the discovery and the formalization of supervised learning.', 'start': 155.482, 'duration': 14.593}, {'end': 175.158, 'text': 'So it goes under the name of PAC learning or statistical learning theory.', 'start': 171.396, 'duration': 3.762}], 'summary': 'Supervised learning formalized by pac learning or statistical learning theory.', 'duration': 31.087, 'max_score': 144.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A144071.jpg'}, {'end': 343.559, 'src': 'embed', 'start': 314.963, 'weight': 1, 'content': [{'end': 322.73, 'text': 'yeah, i forgot to mention a very important piece of this results, the test distribution and training distribution need to be the same.', 'start': 314.963, 'duration': 7.767}, {'end': 329.897, 'text': 'if they are the same, then your theory of supervised learning kicks in and works and will be successful.', 'start': 322.73, 'duration': 7.167}, {'end': 333.12, 'text': 'so conceptually it is trivial.', 'start': 329.897, 'duration': 3.223}, {'end': 343.559, 'text': 'we have an answer for why supervised learning works, why speech recognition should work, why image categorization should work,', 'start': 333.12, 'duration': 10.439}], 'summary': 'For supervised learning to succeed, test and training distributions must match.', 'duration': 28.596, 'max_score': 314.963, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A314963.jpg'}, {'end': 396.381, 'src': 'embed', 'start': 366.966, 'weight': 2, 'content': [{'end': 374.111, 'text': 'A lot of writings about statistical learning theory emphasize the VC dimension as a key component.', 'start': 366.966, 'duration': 7.145}, {'end': 379.675, 'text': 'But the main reason the VC dimension.', 'start': 375.993, 'duration': 3.682}, {'end': 389.883, 'text': 'in fact, the only reason the VC dimension was invented was to allow us to handle parameters which have infinite precision.', 'start': 379.675, 'duration': 10.208}, {'end': 396.381, 'text': 'VC dimension was invented to handle precision, like parameters with infinite precision.', 'start': 391.917, 'duration': 4.464}], 'summary': 'Vc dimension invented to handle parameters with infinite precision.', 'duration': 29.415, 'max_score': 366.966, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A366966.jpg'}], 'start': 0.209, 'title': 'Unsupervised learning theory', 'summary': 'Discusses the concept of unsupervised learning, its relation to supervised learning theory, and emphasizes the mathematical conditions for learning success and the importance of the training and test distributions being the same.', 'chapters': [{'end': 419.715, 'start': 0.209, 'title': 'Unsupervised learning theory', 'summary': 'Discusses the concept of unsupervised learning and its relation to supervised learning theory, emphasizing the mathematical conditions for learning success and the importance of the training and test distributions being the same.', 'duration': 419.506, 'highlights': ['The discovery and formalization of supervised learning under the name of PAC learning or statistical learning theory provides precise mathematical conditions for learning success, guaranteeing low test error if the training loss is low and the degrees of freedom are smaller than the training set.', 'The concept of VC dimension was invented to handle parameters with infinite precision, which is essential for handling parameters with finite precision in reality.', 'The training and test distributions need to be the same for the theory of supervised learning to work and be successful, providing a mathematical guarantee for the success of supervised learning.']}], 'duration': 419.506, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A209.jpg', 'highlights': ['The discovery and formalization of supervised learning under the name of PAC learning or statistical learning theory provides precise mathematical conditions for learning success, guaranteeing low test error if the training loss is low and the degrees of freedom are smaller than the training set.', 'The training and test distributions need to be the same for the theory of supervised learning to work and be successful, providing a mathematical guarantee for the success of supervised learning.', 'The concept of VC dimension was invented to handle parameters with infinite precision, which is essential for handling parameters with finite precision in reality.']}, {'end': 1234.911, 'segs': [{'end': 501.417, 'src': 'embed', 'start': 469.675, 'weight': 5, 'content': [{'end': 470.936, 'text': 'But what is unsupervised learning?', 'start': 469.675, 'duration': 1.261}, {'end': 474.238, 'text': 'What can you say at all about unsupervised learning?', 'start': 470.996, 'duration': 3.242}, {'end': 481.586, 'text': "And I'll say that at least I have not seen an exposition of unsupervised learning which I found satisfying.", 'start': 475.233, 'duration': 6.353}, {'end': 483.41, 'text': 'How to reason about it mathematically?', 'start': 481.987, 'duration': 1.423}, {'end': 486.837, 'text': 'We can reason about it intuitively, but can we reason about it mathematically?', 'start': 483.43, 'duration': 3.407}, {'end': 493.714, 'text': 'And for some context, what is the old dream of unsupervised learning?', 'start': 489.332, 'duration': 4.382}, {'end': 495.254, 'text': 'Which, by the way, this dream has been fulfilled.', 'start': 493.734, 'duration': 1.52}, {'end': 497.255, 'text': "But it's fulfilled empirically.", 'start': 496.075, 'duration': 1.18}, {'end': 501.417, 'text': 'Can we go just a tiny bit beyond the empirical results?', 'start': 497.816, 'duration': 3.601}], 'summary': "Unsupervised learning's challenges and fulfillment of old dream, needing to go beyond empirical results.", 'duration': 31.742, 'max_score': 469.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A469675.jpg'}, {'end': 551.019, 'src': 'embed', 'start': 519.022, 'weight': 0, 'content': [{'end': 519.642, 'text': 'Should it happen??', 'start': 519.022, 'duration': 0.62}, {'end': 521.903, 'text': 'Should we expect it to happen??', 'start': 520.703, 'duration': 1.2}, {'end': 528.205, 'text': "You don't have anything remotely similar to the supervised learning guarantee.", 'start': 523.283, 'duration': 4.922}, {'end': 533.327, 'text': "The supervised learning guarantee says, yeah, get your low training error, and you're going to get your learning.", 'start': 529.045, 'duration': 4.282}, {'end': 534.267, 'text': "It's going to be great success.", 'start': 533.347, 'duration': 0.92}, {'end': 539.089, 'text': "On supervised learning, it appears that it's not this way.", 'start': 536.008, 'duration': 3.081}, {'end': 543.576, 'text': 'You know, like, people were talking about it for a long time in the 80s.', 'start': 540.174, 'duration': 3.402}, {'end': 546.577, 'text': 'The Bolson machine was already talking about unsupervised learning.', 'start': 543.636, 'duration': 2.941}, {'end': 551.019, 'text': 'And unsupervised learning also did not work at small scale.', 'start': 547.617, 'duration': 3.402}], 'summary': 'Supervised learning guarantees success with low training error, but unsupervised learning has not worked well at small scale.', 'duration': 31.997, 'max_score': 519.022, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A519022.jpg'}, {'end': 610.565, 'src': 'embed', 'start': 584.419, 'weight': 3, 'content': [{'end': 593.484, 'text': "let's optimize some kind of reconstruction error or let's optimize some kind of denoising error or some kind of self-supervised learning error.", 'start': 584.419, 'duration': 9.065}, {'end': 595.145, 'text': 'You optimize one objective.', 'start': 593.864, 'duration': 1.281}, {'end': 597.907, 'text': 'Oh, yes, I just said that.', 'start': 595.165, 'duration': 2.742}, {'end': 599.908, 'text': 'But you care about a different objective.', 'start': 598.607, 'duration': 1.301}, {'end': 610.565, 'text': "So, then doesn't it mean that you have no reason to expect that you will get any kind of good unsupervised learning results?", 'start': 601.482, 'duration': 9.083}], 'summary': 'Optimizing different objectives may not yield good unsupervised learning results.', 'duration': 26.146, 'max_score': 584.419, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A584419.jpg'}, {'end': 730.456, 'src': 'embed', 'start': 699.518, 'weight': 2, 'content': [{'end': 702.801, 'text': 'similarly to supervised learning, it has to work.', 'start': 699.518, 'duration': 3.283}, {'end': 714.787, 'text': "So what kind of mysterious unsupervised learning procedure where you're not given any labels To any of your inputs, it's still guaranteed to work.", 'start': 704.342, 'duration': 10.445}, {'end': 716.968, 'text': 'Distribution matching.', 'start': 716.067, 'duration': 0.901}, {'end': 720.77, 'text': 'Distribution matching.', 'start': 719.91, 'duration': 0.86}, {'end': 725.373, 'text': "So what is distribution matching? Say I've got my data.", 'start': 722.451, 'duration': 2.922}, {'end': 728.435, 'text': "I've got x and I've got y.", 'start': 725.713, 'duration': 2.722}, {'end': 729.035, 'text': 'Data sources.', 'start': 728.435, 'duration': 0.6}, {'end': 730.456, 'text': "There's no correspondence between them.", 'start': 729.055, 'duration': 1.401}], 'summary': 'Unsupervised learning guarantees work without labels, using distribution matching.', 'duration': 30.938, 'max_score': 699.518, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A699518.jpg'}, {'end': 862.813, 'src': 'embed', 'start': 839.824, 'weight': 4, 'content': [{'end': 850.428, 'text': 'i independently discovered this in 2015 and i got really fascinated by it because i thought wow, maybe there is something meaningful,', 'start': 839.824, 'duration': 10.604}, {'end': 856.571, 'text': 'mathematically meaningful, that we can say about unsupervised learning, and so.', 'start': 850.428, 'duration': 6.143}, {'end': 857.991, 'text': "but let's see this.", 'start': 856.571, 'duration': 1.42}, {'end': 861.312, 'text': 'the thing about this setup is it still is a little bit artificial.', 'start': 857.991, 'duration': 3.321}, {'end': 862.813, 'text': "it's still real.", 'start': 861.312, 'duration': 1.501}], 'summary': 'Discovered unsupervised learning in 2015, fascinated by its mathematical implications.', 'duration': 22.989, 'max_score': 839.824, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A839824.jpg'}, {'end': 1037.198, 'src': 'heatmap', 'start': 998.742, 'weight': 0.721, 'content': [{'end': 1003.905, 'text': "You could make the same claim about prediction, but somehow it's more intuitive when you say it about compression.", 'start': 998.742, 'duration': 5.163}, {'end': 1006.847, 'text': "I don't know why that is, but I find it to be the case.", 'start': 1004.525, 'duration': 2.322}, {'end': 1008.308, 'text': "So that's a clue.", 'start': 1007.707, 'duration': 0.601}, {'end': 1017.849, 'text': "And you can make an equation like this where you say hey, if your compression is good enough, if it's like a real great compressor,", 'start': 1009.485, 'duration': 8.364}, {'end': 1027.473, 'text': 'it should say that the compression of your concatenation of your giant files should be no worse than the separate compression of your two files.', 'start': 1017.849, 'duration': 9.624}, {'end': 1037.198, 'text': 'So any additional compression that was gained by concatenation is some kind of shared structure that your compressor noticed.', 'start': 1028.414, 'duration': 8.784}], 'summary': 'Compression can share structure in concatenated files, improving overall compression quality.', 'duration': 38.456, 'max_score': 998.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A998742.jpg'}, {'end': 1037.198, 'src': 'embed', 'start': 1009.485, 'weight': 1, 'content': [{'end': 1017.849, 'text': "And you can make an equation like this where you say hey, if your compression is good enough, if it's like a real great compressor,", 'start': 1009.485, 'duration': 8.364}, {'end': 1027.473, 'text': 'it should say that the compression of your concatenation of your giant files should be no worse than the separate compression of your two files.', 'start': 1017.849, 'duration': 9.624}, {'end': 1037.198, 'text': 'So any additional compression that was gained by concatenation is some kind of shared structure that your compressor noticed.', 'start': 1028.414, 'duration': 8.784}], 'summary': 'Good compression should maintain or improve file compression when concatenated.', 'duration': 27.713, 'max_score': 1009.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1009485.jpg'}], 'start': 419.715, 'title': 'Unsupervised learning', 'summary': 'Delves into the mysteries and insights of unsupervised learning, addressing challenges, such as lack of mathematical reasoning and historical context, while providing novel insights through distribution matching and compression.', 'chapters': [{'end': 643.66, 'start': 419.715, 'title': 'Unsupervised learning mysteries', 'summary': 'Discusses the challenges and mysteries of unsupervised learning, including the lack of mathematical reasoning, the historical context, and the confusion surrounding optimizing one objective while caring about another objective empirically.', 'duration': 223.945, 'highlights': ['Unsatisfactory exposition of unsupervised learning', 'Historical context of unsupervised learning', 'Confusion surrounding optimizing one objective and caring about another']}, {'end': 1234.911, 'start': 645.741, 'title': 'Unsupervised learning insights', 'summary': 'Explores a novel approach to unsupervised learning through distribution matching and compression, providing insights on the functioning, guarantees, and formalization of unsupervised learning.', 'duration': 589.17, 'highlights': ['Compression to the rescue', 'Distribution matching', 'Guaranteed working unsupervised learning']}], 'duration': 815.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A419715.jpg', 'highlights': ['Guaranteed working unsupervised learning', 'Compression to the rescue', 'Distribution matching', 'Confusion surrounding optimizing one objective and caring about another', 'Historical context of unsupervised learning', 'Unsatisfactory exposition of unsupervised learning']}, {'end': 1562.23, 'segs': [{'end': 1265.835, 'src': 'embed', 'start': 1236.031, 'weight': 0, 'content': [{'end': 1237.892, 'text': 'And no one could have done better than me.', 'start': 1236.031, 'duration': 1.861}, {'end': 1245.503, 'text': 'Now I want to take a detour to theory land, which is a little obscure.', 'start': 1240.939, 'duration': 4.564}, {'end': 1253.251, 'text': 'I think it is interesting Kolmogorov complexity, as the ultimate compressor gives us the ultimate low-regret algorithm,', 'start': 1246.024, 'duration': 7.227}, {'end': 1255.453, 'text': "which is actually not an algorithm because it's not computable.", 'start': 1253.251, 'duration': 2.202}, {'end': 1258.116, 'text': "But you'll see what I mean really quickly.", 'start': 1255.914, 'duration': 2.202}, {'end': 1265.835, 'text': 'So Kolmogorov, first of all, for some context, who here is familiar with Kolmogorov complexity? Okay, about 50%.', 'start': 1258.756, 'duration': 7.079}], 'summary': 'Discussing kolmogorov complexity and its relevance to algorithms, with 50% audience familiarity.', 'duration': 29.804, 'max_score': 1236.031, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1236031.jpg'}, {'end': 1446.719, 'src': 'heatmap', 'start': 1272.801, 'weight': 0.925, 'content': [{'end': 1275.163, 'text': "So I'll just do it.", 'start': 1272.801, 'duration': 2.362}, {'end': 1287.515, 'text': "It's like imagine, I give you some data or you give me some data and I'm gonna compress it by giving you the shortest program.", 'start': 1276.244, 'duration': 11.271}, {'end': 1304.298, 'text': 'can possibly exist, the shortest program that exists which, if, which, if you run it, outputs your data.', 'start': 1288.994, 'duration': 15.304}, {'end': 1307.039, 'text': 'yes, that is correct.', 'start': 1304.298, 'duration': 2.741}, {'end': 1310.66, 'text': 'you got me.', 'start': 1307.039, 'duration': 3.621}, {'end': 1314.621, 'text': 'it is the length of the shortest program which outputs x.', 'start': 1310.66, 'duration': 3.961}, {'end': 1327.866, 'text': 'yes, intuitively, you can see that this compressor is quite good, because you can prove this theorem, which is also really easy to prove.', 'start': 1314.621, 'duration': 13.245}, {'end': 1331.729, 'text': "Or rather, it's easy to feel it.", 'start': 1329.227, 'duration': 2.502}, {'end': 1334.672, 'text': "And once you feel it, you could kind of believe me that it's easy to prove.", 'start': 1332.35, 'duration': 2.322}, {'end': 1342.499, 'text': "And you can basically say that your Kolmogorov compressor, if you use that to compress your strings, you'll have very low regret.", 'start': 1335.493, 'duration': 7.006}, {'end': 1345.763, 'text': 'about your compression quality.', 'start': 1343.502, 'duration': 2.261}, {'end': 1347.343, 'text': 'You can prove this result.', 'start': 1346.523, 'duration': 0.82}, {'end': 1351.585, 'text': 'It says that if you got your string x, your data set database,', 'start': 1347.964, 'duration': 3.621}, {'end': 1364.509, 'text': 'whatever the shortest program which output x is shorter than whatever your compressor needed output and however well your compressor compressed your data,', 'start': 1351.585, 'duration': 12.924}, {'end': 1371.692, 'text': 'plus a little term, which is, however many characters of code you need to implement your compressor.', 'start': 1364.509, 'duration': 7.183}, {'end': 1376.259, 'text': 'Intuitively, you can see how it makes sense, the simulation argument.', 'start': 1373.018, 'duration': 3.241}, {'end': 1378.059, 'text': 'The simulation argument.', 'start': 1377.119, 'duration': 0.94}, {'end': 1383.66, 'text': "if you tell me hey, I've got this really great compressor C, I'm going to say cool, does it come with a computer program?", 'start': 1378.059, 'duration': 5.601}, {'end': 1391.662, 'text': 'Can you give this computer program to K and K is going to run your compressor? Because it runs computer programs.', 'start': 1384.82, 'duration': 6.842}, {'end': 1392.882, 'text': 'You just need to pay for the program length.', 'start': 1391.682, 'duration': 1.2}, {'end': 1396.503, 'text': 'So without giving you the details, I think I gave you the feel of it.', 'start': 1393.982, 'duration': 2.521}, {'end': 1399.638, 'text': 'Kolmogorov complexity.', 'start': 1398.097, 'duration': 1.541}, {'end': 1403.601, 'text': 'the Kolmogorov compressor can simulate other computer programs, simulate other compressors.', 'start': 1399.638, 'duration': 3.963}, {'end': 1405.582, 'text': "this is also why it's not computable.", 'start': 1403.601, 'duration': 1.981}, {'end': 1407.644, 'text': "it's not computable because it simulates.", 'start': 1405.582, 'duration': 2.062}, {'end': 1413.648, 'text': 'it feels very much at liberty to simulate all computer problems,', 'start': 1407.644, 'duration': 6.004}, {'end': 1420.572, 'text': 'but it is the best compressor that exists and we were talking about good compression for unsupervised learning.', 'start': 1413.648, 'duration': 6.924}, {'end': 1429.825, 'text': 'now let us generalize a Kolmogorov complexity, a Kolmogorov compressor to be allowed to use side information?', 'start': 1420.572, 'duration': 9.253}, {'end': 1435.009, 'text': 'Oh, more in detail.', 'start': 1433.928, 'duration': 1.081}, {'end': 1437.111, 'text': "So I'll make this detail.", 'start': 1435.669, 'duration': 1.442}, {'end': 1440.874, 'text': "I'll reiterate this point several times because this point is important.", 'start': 1437.231, 'duration': 3.643}, {'end': 1445.037, 'text': 'Obviously, the Kolmogorov compressor is not computable.', 'start': 1441.334, 'duration': 3.703}, {'end': 1446.719, 'text': "It's undecidable.", 'start': 1445.918, 'duration': 0.801}], 'summary': 'Kolmogorov compressor achieves low regret in data compression by outputting the shortest program, simulating other compressors, and using side information, but it is undecidable and not computable.', 'duration': 173.918, 'max_score': 1272.801, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1272801.jpg'}, {'end': 1429.825, 'src': 'embed', 'start': 1399.638, 'weight': 2, 'content': [{'end': 1403.601, 'text': 'the Kolmogorov compressor can simulate other computer programs, simulate other compressors.', 'start': 1399.638, 'duration': 3.963}, {'end': 1405.582, 'text': "this is also why it's not computable.", 'start': 1403.601, 'duration': 1.981}, {'end': 1407.644, 'text': "it's not computable because it simulates.", 'start': 1405.582, 'duration': 2.062}, {'end': 1413.648, 'text': 'it feels very much at liberty to simulate all computer problems,', 'start': 1407.644, 'duration': 6.004}, {'end': 1420.572, 'text': 'but it is the best compressor that exists and we were talking about good compression for unsupervised learning.', 'start': 1413.648, 'duration': 6.924}, {'end': 1429.825, 'text': 'now let us generalize a Kolmogorov complexity, a Kolmogorov compressor to be allowed to use side information?', 'start': 1420.572, 'duration': 9.253}], 'summary': 'The kolmogorov compressor is the best but not computable, can simulate other programs and compressors, and is being generalized for use with side information.', 'duration': 30.187, 'max_score': 1399.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1399638.jpg'}, {'end': 1549.98, 'src': 'heatmap', 'start': 1473.543, 'weight': 1, 'content': [{'end': 1479.178, 'text': "It's kind of magical, right? Neural networks can simulate little programs.", 'start': 1473.543, 'duration': 5.635}, {'end': 1480.079, 'text': 'They are little computers.', 'start': 1479.218, 'duration': 0.861}, {'end': 1480.96, 'text': "They're circuits.", 'start': 1480.42, 'duration': 0.54}, {'end': 1483.262, 'text': 'Circuits are computers, computing machines.', 'start': 1481.4, 'duration': 1.862}, {'end': 1485.864, 'text': 'And SGD searches over the program.', 'start': 1483.843, 'duration': 2.021}, {'end': 1496.434, 'text': 'And all deep learning is hinges on top of the SGD miracle, that we can actually train these computers with SGD.', 'start': 1487.886, 'duration': 8.548}, {'end': 1498.575, 'text': 'That works.', 'start': 1498.155, 'duration': 0.42}, {'end': 1502.439, 'text': 'Actually find the circuits from data.', 'start': 1500.317, 'duration': 2.122}, {'end': 1508.523, 'text': 'Therefore, we can compute our miniature Kolmogorov compressor.', 'start': 1504.013, 'duration': 4.51}, {'end': 1513.654, 'text': 'The simulation argument applies here as well, by the way.', 'start': 1511.533, 'duration': 2.121}, {'end': 1514.875, 'text': 'I just want to mention this one fact.', 'start': 1513.674, 'duration': 1.201}, {'end': 1518.817, 'text': "I don't know if you've ever tried to design a better neural network architecture.", 'start': 1515.195, 'duration': 3.622}, {'end': 1523.819, 'text': "What you'd find is that it's kind of hard to find a better neural network architecture.", 'start': 1519.337, 'duration': 4.482}, {'end': 1528.322, 'text': "You say, well, let's add this connection, let's add that connection, and let's modify this and that.", 'start': 1524.36, 'duration': 3.962}, {'end': 1534.785, 'text': 'Why is it hard? The simulation argument, because your new architecture can be pretty straightforwardly simulated by your old architecture.', 'start': 1528.742, 'duration': 6.043}, {'end': 1537.911, 'text': "Except when it can't, those are rare cases.", 'start': 1535.549, 'duration': 2.362}, {'end': 1543.395, 'text': 'And in those rare cases, you have a big improvement, such as when you switch from the little RNN to the transformer.', 'start': 1538.291, 'duration': 5.104}, {'end': 1547.338, 'text': 'The RNN has a bottleneck, the hidden state.', 'start': 1543.916, 'duration': 3.422}, {'end': 1549.98, 'text': "So it's a hard time implementing the transformer.", 'start': 1548.259, 'duration': 1.721}], 'summary': 'Neural networks simulate programs, trained by sgd, with rare cases for significant improvement like the transformer architecture.', 'duration': 50.276, 'max_score': 1473.543, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1473543.jpg'}], 'start': 1236.031, 'title': 'Kolmogorov complexity, neural nets, and low-regret algorithms', 'summary': "Discusses kolmogorov complexity, an ultimate low-regret algorithm, and neural networks' simulation of programs. it explores the implications of these concepts on compression quality, unsupervised learning, and the challenges in designing better neural network architectures.", 'chapters': [{'end': 1450.287, 'start': 1236.031, 'title': 'Kolmogorov complexity and low-regret algorithms', 'summary': "Discusses kolmogorov complexity, which gives the ultimate low-regret algorithm, allowing for very good compression quality and simulation of other compressors, but it's not computable and is undecidable, making it the best compressor for unsupervised learning.", 'duration': 214.256, 'highlights': ['Kolmogorov complexity provides the ultimate low-regret algorithm for compression', 'Kolmogorov complexity is not computable and undecidable', 'Kolmogorov complexity is the best compressor for unsupervised learning']}, {'end': 1562.23, 'start': 1450.928, 'title': 'Neural nets and program search', 'summary': 'Explains that neural networks can simulate little programs and are automatically searched by sgd over the parameters, which is a fundamental aspect of deep learning, and it also highlights the difficulty in designing better neural network architectures due to the simulation argument.', 'duration': 111.302, 'highlights': ['Neural networks simulate little programs and are automatically searched by SGD over the parameters, a fundamental aspect of deep learning.', 'The difficulty in designing better neural network architectures is due to the simulation argument, where new architectures can be simulated by old ones except in rare cases, leading to big improvements.']}], 'duration': 326.199, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1236031.jpg', 'highlights': ['Kolmogorov complexity provides ultimate low-regret algorithm for compression', 'Neural networks simulate little programs and are automatically searched by SGD', 'Kolmogorov complexity is the best compressor for unsupervised learning', 'The difficulty in designing better neural network architectures is due to the simulation argument']}, {'end': 1855.838, 'segs': [{'end': 1593.836, 'src': 'embed', 'start': 1562.27, 'weight': 0, 'content': [{'end': 1568.874, 'text': 'You start to see how we switch from the formal land to neural network land.', 'start': 1562.27, 'duration': 6.604}, {'end': 1570.375, 'text': 'But you see the similarity.', 'start': 1569.394, 'duration': 0.981}, {'end': 1577.178, 'text': 'So conditional Kolmogorov complexity as the solution to unsupervised learning.', 'start': 1572.896, 'duration': 4.282}, {'end': 1586.383, 'text': "You can basically have a similar theorem where I'm not going to define what, well, I'm going to define what k of y given x is.", 'start': 1578.919, 'duration': 7.464}, {'end': 1593.836, 'text': "It's like the shortest program which outputs y if it's allowed to probe x.", 'start': 1586.423, 'duration': 7.413}], 'summary': 'Switching from formal land to neural network land, using conditional kolmogorov complexity for unsupervised learning.', 'duration': 31.566, 'max_score': 1562.27, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1562270.jpg'}, {'end': 1652.577, 'src': 'embed', 'start': 1618.55, 'weight': 3, 'content': [{'end': 1621.552, 'text': "Low regret solution to unsupervised learning, except that it's not computable.", 'start': 1618.55, 'duration': 3.002}, {'end': 1623.934, 'text': "But I do think it's a useful framework.", 'start': 1622.593, 'duration': 1.341}, {'end': 1627.917, 'text': 'And here we condition on a data set, not an example.', 'start': 1625.395, 'duration': 2.522}, {'end': 1634.762, 'text': 'And this thing will extract all the value out of x for predicting y.', 'start': 1629.638, 'duration': 5.124}, {'end': 1635.402, 'text': 'The data set.', 'start': 1634.762, 'duration': 0.64}, {'end': 1637.424, 'text': 'The data set, not the example.', 'start': 1635.843, 'duration': 1.581}, {'end': 1640.166, 'text': 'So this is the solution to unsupervised learning.', 'start': 1638.505, 'duration': 1.661}, {'end': 1643.068, 'text': 'Done Success.', 'start': 1641.046, 'duration': 2.022}, {'end': 1652.577, 'text': 'And there is one little technicality which I need to spend a little bit of time talking about,', 'start': 1646.992, 'duration': 5.585}], 'summary': 'Unsupervised learning solution not computable, but a useful framework for extracting value from a dataset for predicting y.', 'duration': 34.027, 'max_score': 1618.55, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1618550.jpg'}, {'end': 1716.371, 'src': 'embed', 'start': 1690.683, 'weight': 1, 'content': [{'end': 1704.925, 'text': 'So this result says that, hey, if you care about making predictions about your supervised task, y, using The good old-fashioned Kolmogorov compressor,', 'start': 1690.683, 'duration': 14.242}, {'end': 1711.469, 'text': 'which just compresses the concatenation of x and y, is going to be just as good as using your conditional compressor.', 'start': 1704.925, 'duration': 6.544}, {'end': 1716.371, 'text': 'There are more details and a few subtleties to what I just said.', 'start': 1712.529, 'duration': 3.842}], 'summary': 'Using kolmogorov compressor for predictions is as good as conditional compressor.', 'duration': 25.688, 'max_score': 1690.683, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1690683.jpg'}, {'end': 1783.528, 'src': 'embed', 'start': 1757.153, 'weight': 2, 'content': [{'end': 1761.938, 'text': 'And yeah, so the solution to unsupervised learning, just give it all to your Kolmogorov complexity, to your Kolmogorov compressor.', 'start': 1757.153, 'duration': 4.785}, {'end': 1773.228, 'text': "And the final thing is I'll mention that this kind of joint compression is maximum likelihood if you don't overfit.", 'start': 1764.32, 'duration': 8.908}, {'end': 1780.447, 'text': 'If you have a data set, Then the sum of the likelihood, given your parameters, is the cost of compressing the data set.', 'start': 1773.869, 'duration': 6.578}, {'end': 1783.528, 'text': 'You also need to pay the cost of compressing the parameters.', 'start': 1780.927, 'duration': 2.601}], 'summary': 'Unsupervised learning solution: use kolmogorov compressor for joint compression and maximum likelihood if not overfit.', 'duration': 26.375, 'max_score': 1757.153, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1757153.jpg'}, {'end': 1792.411, 'src': 'heatmap', 'start': 1757.153, 'weight': 0.717, 'content': [{'end': 1761.938, 'text': 'And yeah, so the solution to unsupervised learning, just give it all to your Kolmogorov complexity, to your Kolmogorov compressor.', 'start': 1757.153, 'duration': 4.785}, {'end': 1773.228, 'text': "And the final thing is I'll mention that this kind of joint compression is maximum likelihood if you don't overfit.", 'start': 1764.32, 'duration': 8.908}, {'end': 1780.447, 'text': 'If you have a data set, Then the sum of the likelihood, given your parameters, is the cost of compressing the data set.', 'start': 1773.869, 'duration': 6.578}, {'end': 1783.528, 'text': 'You also need to pay the cost of compressing the parameters.', 'start': 1780.927, 'duration': 2.601}, {'end': 1786.749, 'text': 'But you can kind of see if you now want to compress two data sets, no problem.', 'start': 1783.828, 'duration': 2.921}, {'end': 1792.411, 'text': 'Just add more points to your training set, to your data set, and add the terms to the sum.', 'start': 1786.789, 'duration': 5.622}], 'summary': 'Unsupervised learning solution: use kolmogorov compressor for maximum likelihood joint compression without overfitting.', 'duration': 35.258, 'max_score': 1757.153, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1757153.jpg'}, {'end': 1845.429, 'src': 'embed', 'start': 1820.54, 'weight': 4, 'content': [{'end': 1826.562, 'text': 'Because then it says, well, OK, if you squint hard enough, you can say that this explains what our neural networks are doing.', 'start': 1820.54, 'duration': 6.022}, {'end': 1831.904, 'text': 'You can say, hey, SGD over big neural networks is our big program search.', 'start': 1827.042, 'duration': 4.862}, {'end': 1836.406, 'text': 'Bigger neural networks approximate the Kolmogorov compressor more and more and better and better.', 'start': 1832.645, 'duration': 3.761}, {'end': 1845.429, 'text': 'And so maybe this is also why we like big neural nets, because we approach the unapproachable idea of the Kolmogorov compressor,', 'start': 1837.726, 'duration': 7.703}], 'summary': 'Neural networks approximate kolmogorov compressor, big nets approach unapproachable idea.', 'duration': 24.889, 'max_score': 1820.54, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1820540.jpg'}], 'start': 1562.27, 'title': 'Unsupervised learning and kolmogorov complexity', 'summary': 'Discusses the use of conditional kolmogorov complexity as the ultimate low regret solution to unsupervised learning, and the equivalence of using conditional kolmogorov compressor and regular kolmogorov compressor for making predictions on supervised tasks, emphasizing joint compression in machine learning and its relevance to big neural networks.', 'chapters': [{'end': 1643.068, 'start': 1562.27, 'title': 'Unsupervised learning: ultimate low regret solution', 'summary': "Discusses conditional kolmogorov complexity as the ultimate low regret solution to unsupervised learning, providing a framework that extracts all the value out of the dataset for predicting y, although it's not computable.", 'duration': 80.798, 'highlights': ['Conditional Kolmogorov complexity as the ultimate low regret solution to unsupervised learning, providing a framework that extracts all the value out of the dataset for predicting y.', 'Using conditional Kolmogorov complexity for unsupervised learning allows one to sleep soundly at night knowing that no one does unsupervised learning better.', 'Conditioning on a dataset, not an example, and extracting all the value out of x for predicting y.']}, {'end': 1855.838, 'start': 1646.992, 'title': 'Kolmogorov complexity in machine learning', 'summary': 'Discusses the equivalence of using a conditional kolmogorov compressor and a regular kolmogorov compressor for making predictions on supervised tasks, highlighting the potential of joint compression in machine learning and its relevance to the use of big neural networks.', 'duration': 208.846, 'highlights': ['Using regular Kolmogorov compressor for compressing concatenated data can make great predictions on supervised tasks, equivalent to using conditional compressor.', 'Joint compression is maximum likelihood if not overfit, allowing natural applicability in machine learning and explaining the appeal of big neural networks.', 'Stating that SGD over big neural networks approximates the Kolmogorov compressor and helps in approaching the unapproachable idea of Kolmogorov compressor.']}], 'duration': 293.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1562270.jpg', 'highlights': ['Conditional Kolmogorov complexity as the ultimate low regret solution to unsupervised learning, providing a framework that extracts all the value out of the dataset for predicting y.', 'Using regular Kolmogorov compressor for compressing concatenated data can make great predictions on supervised tasks, equivalent to using conditional compressor.', 'Joint compression is maximum likelihood if not overfit, allowing natural applicability in machine learning and explaining the appeal of big neural networks.', 'Conditioning on a dataset, not an example, and extracting all the value out of x for predicting y.', 'Stating that SGD over big neural networks approximates the Kolmogorov compressor and helps in approaching the unapproachable idea of Kolmogorov compressor.', 'Using conditional Kolmogorov complexity for unsupervised learning allows one to sleep soundly at night knowing that no one does unsupervised learning better.']}, {'end': 2509.677, 'segs': [{'end': 1996.09, 'src': 'heatmap', 'start': 1892.534, 'weight': 0.761, 'content': [{'end': 1898.778, 'text': 'At least their few-shot behavior can definitely be explained without alluding to this theory.', 'start': 1892.534, 'duration': 6.244}, {'end': 1903.68, 'text': 'And so I thought it would be nice.', 'start': 1899.738, 'duration': 3.942}, {'end': 1915.687, 'text': 'Can we find some other direct validation of this theory? Can we find a different domain, like vision? Because vision, you have pixels.', 'start': 1903.72, 'duration': 11.967}, {'end': 1925.659, 'text': 'Can you show that doing this on pixels will lead to good unsupervised learning? And the answer is yes, you can.', 'start': 1918.029, 'duration': 7.63}, {'end': 1930.387, 'text': "This is work we've done in 2020.", 'start': 1926.921, 'duration': 3.466}, {'end': 1932.088, 'text': 'which is called the IGPT.', 'start': 1930.387, 'duration': 1.701}, {'end': 1934.19, 'text': "And it's an expensive proof of concept.", 'start': 1932.488, 'duration': 1.702}, {'end': 1937.432, 'text': 'It was not meant to be a practical procedure.', 'start': 1934.29, 'duration': 3.142}, {'end': 1943.076, 'text': "It meant to be a paper that showed that if you have really good next-step predictor, you're going to do great on supervised data.", 'start': 1937.452, 'duration': 5.624}, {'end': 1945.537, 'text': 'And it was proved in the image domain.', 'start': 1943.576, 'duration': 1.961}, {'end': 1948.159, 'text': "And there, I'll just spell it out.", 'start': 1945.957, 'duration': 2.202}, {'end': 1950.861, 'text': 'You have your image.', 'start': 1948.619, 'duration': 2.242}, {'end': 1954.543, 'text': 'You turn it into a sequence of pixels.', 'start': 1952.542, 'duration': 2.001}, {'end': 1958.426, 'text': 'Each pixel should be given some discrete value of intensity.', 'start': 1955.544, 'duration': 2.882}, {'end': 1962.103, 'text': 'And then just do next pixel prediction.', 'start': 1960.02, 'duration': 2.083}, {'end': 1964.086, 'text': 'Be the same transformer.', 'start': 1963.085, 'duration': 1.001}, {'end': 1965.429, 'text': "That's it.", 'start': 1965.068, 'duration': 0.361}, {'end': 1966.871, 'text': 'Different from Bert.', 'start': 1966.21, 'duration': 0.661}, {'end': 1971.939, 'text': 'Just next token prediction because this maximizes the likelihood, therefore compresses.', 'start': 1968.013, 'duration': 3.926}, {'end': 1978.822, 'text': 'And we see one immediate result we see.', 'start': 1976.681, 'duration': 2.141}, {'end': 1982.244, 'text': 'So these are results on CIFAR-10.', 'start': 1978.982, 'duration': 3.262}, {'end': 1984.665, 'text': 'You have models of different sizes.', 'start': 1982.284, 'duration': 2.381}, {'end': 1990.508, 'text': 'This is their next step prediction accuracy on their pixel prediction task, on their unsupervised learning task.', 'start': 1984.745, 'duration': 5.763}, {'end': 1992.769, 'text': 'And this is linear probe accuracy.', 'start': 1990.788, 'duration': 1.981}, {'end': 1996.09, 'text': 'Linear probe, when you pick some layer inside your neural net, the best layer.', 'start': 1992.809, 'duration': 3.281}], 'summary': 'In 2020, the igpt demonstrated good unsupervised learning through next-step prediction on pixels in the image domain, leading to great performance on supervised data and linear probe accuracy.', 'duration': 103.556, 'max_score': 1892.534, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1892534.jpg'}, {'end': 1937.432, 'src': 'embed', 'start': 1903.72, 'weight': 0, 'content': [{'end': 1915.687, 'text': 'Can we find some other direct validation of this theory? Can we find a different domain, like vision? Because vision, you have pixels.', 'start': 1903.72, 'duration': 11.967}, {'end': 1925.659, 'text': 'Can you show that doing this on pixels will lead to good unsupervised learning? And the answer is yes, you can.', 'start': 1918.029, 'duration': 7.63}, {'end': 1930.387, 'text': "This is work we've done in 2020.", 'start': 1926.921, 'duration': 3.466}, {'end': 1932.088, 'text': 'which is called the IGPT.', 'start': 1930.387, 'duration': 1.701}, {'end': 1934.19, 'text': "And it's an expensive proof of concept.", 'start': 1932.488, 'duration': 1.702}, {'end': 1937.432, 'text': 'It was not meant to be a practical procedure.', 'start': 1934.29, 'duration': 3.142}], 'summary': 'In 2020, the igpt demonstrated successful unsupervised learning on pixels.', 'duration': 33.712, 'max_score': 1903.72, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1903720.jpg'}, {'end': 2175.636, 'src': 'heatmap', 'start': 2128.14, 'weight': 1, 'content': [{'end': 2132.123, 'text': 'and I think we might be able to crisply articulate it at some point.', 'start': 2128.14, 'duration': 3.983}, {'end': 2139.107, 'text': 'One thing which I thought was interesting is that these next pixel prediction models, autoregressive models,', 'start': 2133.664, 'duration': 5.443}, {'end': 2141.168, 'text': 'seem to have better linear representations than BERT.', 'start': 2139.107, 'duration': 2.061}, {'end': 2146.252, 'text': 'And like the blue, the blue accuracy is BERT versus autoregressive.', 'start': 2142.249, 'duration': 4.003}, {'end': 2147.854, 'text': "I'm not sure why that is.", 'start': 2147.014, 'duration': 0.84}, {'end': 2150.995, 'text': 'Or rather, I can speculate, have some speculations.', 'start': 2148.354, 'duration': 2.641}, {'end': 2159.536, 'text': 'But I think it would be nice to gain more understanding for really why those linear representations are formed.', 'start': 2151.055, 'duration': 8.481}, {'end': 2163.557, 'text': 'And yeah, this is the end.', 'start': 2161.657, 'duration': 1.9}, {'end': 2165.157, 'text': 'Thank you for your attention.', 'start': 2164.117, 'duration': 1.04}, {'end': 2175.075, 'text': 'Could you provide the speculation? Yeah.', 'start': 2172.519, 'duration': 2.556}, {'end': 2175.636, 'text': 'Oh, yeah, yeah.', 'start': 2175.195, 'duration': 0.441}], 'summary': 'Autoregressive models show better linear representations than bert, with implications for pixel prediction accuracy.', 'duration': 47.496, 'max_score': 2128.14, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2128140.jpg'}, {'end': 2316.051, 'src': 'embed', 'start': 2264.503, 'weight': 2, 'content': [{'end': 2273.985, 'text': "so the diffusion models that people use in, like high quality image generators, don't really maximize the likelihood of their input steps.", 'start': 2264.503, 'duration': 9.482}, {'end': 2279.866, 'text': 'They have a different objective, but the most original formulation is maximizing likelihood.', 'start': 2274.045, 'duration': 5.821}, {'end': 2285.107, 'text': 'And, by the way, the diffusion model is a counter argument to my.', 'start': 2280.806, 'duration': 4.301}, {'end': 2295.049, 'text': 'well, rather, the diffusion model also, I would claim, should have worse representations than an x-token prediction model,', 'start': 2285.107, 'duration': 9.942}, {'end': 2296.83, 'text': "for the same reason that Bird doesn't.", 'start': 2295.049, 'duration': 1.781}, {'end': 2304.193, 'text': 'So this further, in my mind, increases the mystery of what cause leads to linear representations to form.', 'start': 2297.37, 'duration': 6.823}, {'end': 2309.175, 'text': 'Yeah, thanks for the talk.', 'start': 2307.815, 'duration': 1.36}, {'end': 2316.051, 'text': 'I like the analogy between Kolmogorov complexity and And neural networks.', 'start': 2309.235, 'duration': 6.816}], 'summary': "Diffusion models in high quality image generators don't maximize input likelihood, raising mystery about linear representations.", 'duration': 51.548, 'max_score': 2264.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2264503.jpg'}, {'end': 2475.82, 'src': 'embed', 'start': 2430.608, 'weight': 3, 'content': [{'end': 2436.729, 'text': 'So if you sort of backtrack from cryptography, this sort of theory goes back to the 80s,', 'start': 2430.608, 'duration': 6.121}, {'end': 2442.911, 'text': 'where they talk about compression being equivalent to next-deck, prediction being equivalent to being able to distinguish.', 'start': 2436.729, 'duration': 6.182}, {'end': 2450.774, 'text': 'So if you have an algorithm that can predict, then you have an algorithm that can compress.', 'start': 2442.911, 'duration': 7.863}, {'end': 2454.355, 'text': "I mean, cryptography's the other way, right? You say there is no way to compress.", 'start': 2450.874, 'duration': 3.481}, {'end': 2475.82, 'text': 'So I wonder if this idea of being able to distinguish, would that translate to anything natural? I think I understand the question.', 'start': 2455.553, 'duration': 20.267}], 'summary': 'Cryptography theory from the 80s suggests prediction is equivalent to compression, and distinguishing is crucial.', 'duration': 45.212, 'max_score': 2430.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2430608.jpg'}], 'start': 1860.498, 'title': 'Unsupervised learning and next token prediction', 'summary': 'Explores the application of compression theory and supervised learning to gpt models, as demonstrated by the igpt work, leading to improved unsupervised learning on cifar-10 and imagenet. it also delves into the concepts of next token prediction, diffusion models, and their impact on neural network representations.', 'chapters': [{'end': 2236.914, 'start': 1860.498, 'title': 'Unsupervised learning through pixel prediction', 'summary': 'Discusses how the theory of compression and supervised learning can be applied to gpt models, as demonstrated by the igpt work in 2020, where next pixel prediction led to improved unsupervised learning on cifar-10 and imagenet.', 'duration': 376.416, 'highlights': ['IGPT work in 2020 demonstrated improved unsupervised learning through next pixel prediction on CIFAR-10 and ImageNet', 'Speculation that next pixel prediction models have better linear representations than BERT due to the complexity of the prediction task']}, {'end': 2509.677, 'start': 2239.615, 'title': 'Next token prediction and diffusion models', 'summary': 'Discusses the concepts of next token prediction and diffusion models, their objectives, and their impact on representations in neural networks, along with the analogy between compression and prediction in unsupervised learning.', 'duration': 270.062, 'highlights': ['Diffusion models focus on maximizing likelihood, while next token prediction uses a different objective.', 'The impact of diffusion models and next token prediction on the quality of representations in neural networks is discussed.', 'The analogy between compression and prediction in unsupervised learning is explored, relating it to the ability to distinguish and predict in the context of cryptography.']}], 'duration': 649.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A1860498.jpg', 'highlights': ['IGPT work in 2020 demonstrated improved unsupervised learning through next pixel prediction on CIFAR-10 and ImageNet', 'Speculation that next pixel prediction models have better linear representations than BERT due to the complexity of the prediction task', 'The impact of diffusion models and next token prediction on the quality of representations in neural networks is discussed', 'The analogy between compression and prediction in unsupervised learning is explored, relating it to the ability to distinguish and predict in the context of cryptography', 'Diffusion models focus on maximizing likelihood, while next token prediction uses a different objective']}, {'end': 3432.37, 'segs': [{'end': 2565.495, 'src': 'embed', 'start': 2512.358, 'weight': 0, 'content': [{'end': 2527.683, 'text': 'So I mean, I can mention one thing that is related, which is energy-based requests.', 'start': 2512.358, 'duration': 15.325}, {'end': 2539.651, 'text': 'energy-based models offer yet another way of turning neural networks into probability distributions where energy an energy-based model will say i just give,', 'start': 2528.687, 'duration': 10.964}, {'end': 2547.054, 'text': "give me your configuration of vectors and i'm just going to tell you how like how it feels, and then you normalize over all of them.", 'start': 2539.651, 'duration': 7.403}, {'end': 2555.665, 'text': 'and i feel like, when it comes to energy-based models in particular, ratios of distributions correspond to differences of energy.', 'start': 2547.054, 'duration': 8.611}, {'end': 2557.907, 'text': "Maybe there is some relation to what you're saying.", 'start': 2556.166, 'duration': 1.741}, {'end': 2565.495, 'text': "I think I'm probably not precisely commenting on the thing that you said, but I don't think I have anything more to add, unfortunately.", 'start': 2557.927, 'duration': 7.568}], 'summary': 'Energy-based models offer another way of turning neural networks into probability distributions.', 'duration': 53.137, 'max_score': 2512.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2512358.jpg'}, {'end': 2720.232, 'src': 'embed', 'start': 2686.222, 'weight': 3, 'content': [{'end': 2697.932, 'text': 'And that gives you your negative log probability literally the number of bits you need to compress this data set using this neural network as a compressor.', 'start': 2686.222, 'duration': 11.71}, {'end': 2712.695, 'text': "You're arguing for compression as a framework for understanding or motivating unsupervised learning.", 'start': 2707.09, 'duration': 5.605}, {'end': 2720.232, 'text': 'And a point you made at the end was that if you apply that framework to language models to next word prediction,', 'start': 2713.485, 'duration': 6.747}], 'summary': 'Compression as framework for unsupervised learning, using neural network as a compressor.', 'duration': 34.01, 'max_score': 2686.222, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2686222.jpg'}, {'end': 2769.7, 'src': 'embed', 'start': 2741.125, 'weight': 2, 'content': [{'end': 2746.829, 'text': 'But then you can use the linear representation as a way of showing that compression is a good way to formulate unsupervised learning.', 'start': 2741.125, 'duration': 5.704}, {'end': 2754.273, 'text': "But then there are highly effective compressors that wouldn't give you a useful linear representation.", 'start': 2747.629, 'duration': 6.644}, {'end': 2760.717, 'text': "So I'm wondering are there any cases where unsupervised learning and supervised learning are not superficially the same?", 'start': 2754.753, 'duration': 5.964}, {'end': 2769.7, 'text': "but that you don't require your compressor to give you an effective linear representation to show that compression is a good unsupervised objective?", 'start': 2762.534, 'duration': 7.166}], 'summary': 'Compression can be a good way for unsupervised learning, but not all compressors give useful linear representation.', 'duration': 28.575, 'max_score': 2741.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2741125.jpg'}, {'end': 2995.972, 'src': 'embed', 'start': 2949.398, 'weight': 4, 'content': [{'end': 2953.64, 'text': 'And so all this theory predicts that diffusion models should also be.', 'start': 2949.398, 'duration': 4.242}, {'end': 2962.567, 'text': 'it should be possible to make diffusion models to do equally great things, perhaps with some constant factors of because this this is like.', 'start': 2954.802, 'duration': 7.765}, {'end': 2971.013, 'text': "as the earlier answer, this is not a compute sensitive theory, so it's going to say okay, like, maybe you need like a factor of 10 or 15 compute,", 'start': 2962.567, 'duration': 8.446}, {'end': 2974.816, 'text': 'and then things will be the same between the other aggressive and diffusion model.', 'start': 2971.013, 'duration': 3.803}, {'end': 2976.537, 'text': 'the other aggressive model has.', 'start': 2974.816, 'duration': 1.721}, {'end': 2978.338, 'text': "it's just simple, it's convenient.", 'start': 2976.537, 'duration': 1.801}, {'end': 2979.919, 'text': 'maybe the energy-based model.', 'start': 2978.338, 'duration': 1.581}, {'end': 2983.742, 'text': "you'll do even greater things, but from that perspective they're all the same.", 'start': 2979.919, 'duration': 3.823}, {'end': 2995.972, 'text': 'It seems like GPT-4 may be the best compressor at the moment, which presumably is the largest model out there as well.', 'start': 2986.83, 'duration': 9.142}], 'summary': 'Diffusion models can achieve great results with a factor of 10 or 15 compute, making them comparable to other aggressive models such as gpt-4.', 'duration': 46.574, 'max_score': 2949.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2949398.jpg'}, {'end': 3152.108, 'src': 'embed', 'start': 3123.744, 'weight': 7, 'content': [{'end': 3125.804, 'text': 'So, no training, no parameters.', 'start': 3123.744, 'duration': 2.06}, {'end': 3134.888, 'text': 'GZ compress strings, just like you showed, concatenate two strings together, compress them individually and compute distance.', 'start': 3126.825, 'duration': 8.063}, {'end': 3141.719, 'text': "The reason we're using the output of GZ as the follow-up Yeah.", 'start': 3135.668, 'duration': 6.051}, {'end': 3145.943, 'text': 'My only comment on that is that Gzip is not a very strong compressor of text.', 'start': 3142.1, 'duration': 3.843}, {'end': 3152.108, 'text': 'So I think it does show that things are possible to some degree.', 'start': 3146.624, 'duration': 5.484}], 'summary': 'No training, no parameters. gz compresses strings, computes distance, shows possibilities.', 'duration': 28.364, 'max_score': 3123.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A3123744.jpg'}, {'end': 3251.51, 'src': 'embed', 'start': 3219.72, 'weight': 6, 'content': [{'end': 3223.923, 'text': 'We made changes to the architecture so that training is as easy as possible.', 'start': 3219.72, 'duration': 4.203}, {'end': 3234.029, 'text': 'The easier is, the more easy the training optimization problem is, the less susceptible you are to curriculum effects.', 'start': 3224.563, 'duration': 9.466}, {'end': 3242.384, 'text': "And it's known, for example, people who were trained in all kinds of exotic architectures, like neural Turing machines, for example,", 'start': 3235.64, 'duration': 6.744}, {'end': 3244.526, 'text': 'which are these really complicated things?', 'start': 3242.384, 'duration': 2.142}, {'end': 3251.51, 'text': "And it's super, super huge numbers of heterogeneous layers, which are different.", 'start': 3246.287, 'duration': 5.223}], 'summary': 'Changes to architecture to simplify training, reduces susceptibility to curriculum effects.', 'duration': 31.79, 'max_score': 3219.72, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A3219720.jpg'}], 'start': 2512.358, 'title': 'Energy-based models, unsupervised learning, and diffusion models', 'summary': 'Explores energy-based models for probability distributions, unsupervised learning framework with applications in language and image models, and the significance of diffusion models, compressors, and autoregressive modeling in maximum likelihood training.', 'chapters': [{'end': 2565.495, 'start': 2512.358, 'title': 'Energy-based models for probability distributions', 'summary': 'Discusses energy-based models as a way of turning neural networks into probability distributions, where energy determines the configuration of vectors and ratios of distributions correspond to differences of energy.', 'duration': 53.137, 'highlights': ['Energy-based models offer a way of turning neural networks into probability distributions, using energy to determine the configuration of vectors.', 'Ratios of distributions correspond to differences of energy in energy-based models.']}, {'end': 2916.138, 'start': 2570.925, 'title': 'Unsupervised learning framework', 'summary': 'Discusses the application of compression as a framework for understanding unsupervised learning, exploring the implications for language models and image gpt, and the potential insights for supervised learning.', 'duration': 345.213, 'highlights': ['The chapter discusses the application of compression as a framework for understanding unsupervised learning.', 'The implications for language models and image GPT are explored in the context of the compression framework for unsupervised learning.', 'The potential insights for supervised learning are considered, particularly regarding the desired function class and the impact on parameters and compute cost.']}, {'end': 3432.37, 'start': 2916.959, 'title': 'Diffusion models and compressors', 'summary': 'Discusses the importance of autoregressive modeling, diffusion models, and compressors in the context of maximum likelihood training, size of compressors, and the empirical situation about curriculum effects.', 'duration': 515.411, 'highlights': ['GPT-4 may be the best compressor at the moment, which presumably is the largest model out there as well.', 'Diffusion models can be set up to be maximum likelihood models, and it should be possible to make diffusion models do equally great things with some constant factors of compute.', 'The empirical situation about curriculum effects and the susceptibility to curriculum effects based on the ease of training optimization is discussed.', 'The use of Gzip as a text compressor is discussed, highlighting its limitations and the need for more effective compression methods.']}], 'duration': 920.012, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AKMuA_TVz3A/pics/AKMuA_TVz3A2512358.jpg', 'highlights': ['Energy-based models offer a way of turning neural networks into probability distributions, using energy to determine the configuration of vectors.', 'Ratios of distributions correspond to differences of energy in energy-based models.', 'The chapter discusses the application of compression as a framework for understanding unsupervised learning.', 'The implications for language models and image GPT are explored in the context of the compression framework for unsupervised learning.', 'GPT-4 may be the best compressor at the moment, which presumably is the largest model out there as well.', 'Diffusion models can be set up to be maximum likelihood models, and it should be possible to make diffusion models do equally great things with some constant factors of compute.', 'The empirical situation about curriculum effects and the susceptibility to curriculum effects based on the ease of training optimization is discussed.', 'The use of Gzip as a text compressor is discussed, highlighting its limitations and the need for more effective compression methods.']}], 'highlights': ['The discovery and formalization of supervised learning under the name of PAC learning or statistical learning theory provides precise mathematical conditions for learning success, guaranteeing low test error if the training loss is low and the degrees of freedom are smaller than the training set.', 'Kolmogorov complexity provides ultimate low-regret algorithm for compression', 'Conditional Kolmogorov complexity as the ultimate low regret solution to unsupervised learning, providing a framework that extracts all the value out of the dataset for predicting y.', 'Energy-based models offer a way of turning neural networks into probability distributions, using energy to determine the configuration of vectors.', 'IGPT work in 2020 demonstrated improved unsupervised learning through next pixel prediction on CIFAR-10 and ImageNet']}