title
C5W3L06 Bleu Score (Optional)
description
Take the Deep Learning Specialization: http://bit.ly/2PQrQQd
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
detail
{'title': 'C5W3L06 Bleu Score (Optional)', 'heatmap': [{'end': 261.492, 'start': 217.015, 'weight': 0.733}, {'end': 495.568, 'start': 423.815, 'weight': 0.711}, {'end': 540.398, 'start': 499.391, 'weight': 0.735}, {'end': 701.407, 'start': 674.582, 'weight': 0.726}, {'end': 789.669, 'start': 742.031, 'weight': 0.723}, {'end': 811.518, 'start': 789.829, 'weight': 0.728}, {'end': 872.769, 'start': 847.057, 'weight': 0.867}], 'summary': 'Discusses machine translation evaluation, introducing the blue score for automatic quality measurement developed by kishore papaneni, salim rukus, todd ward, and wei-jing zhu. it explores blue score for translation precision, achieving a perfect precision of 7/7, and a modified precision measure in nlp with a score of two-thirds for bigrams. additionally, it explains the computation of precision values, formation of the final blue score, and the brevity penalty adjustment factor in machine translation evaluation, highlighting the significance of bluescore in machine translation and its wide usage in evaluating systems generating text and image captions.', 'chapters': [{'end': 160.937, 'segs': [{'end': 29.804, 'src': 'embed', 'start': 0.729, 'weight': 2, 'content': [{'end': 5.331, 'text': 'One of the challenges of machine translation is that, given a French sentence,', 'start': 0.729, 'duration': 4.602}, {'end': 10.813, 'text': 'there could be multiple English translations that are equally good translations in that French sentence.', 'start': 5.331, 'duration': 5.482}, {'end': 16.716, 'text': 'So how do you evaluate a machine translation system if there are multiple equally good answers?', 'start': 11.294, 'duration': 5.422}, {'end': 22.138, 'text': "Unlike, say, image recognition, where there's one right answer, you can just measure accuracy.", 'start': 17.216, 'duration': 4.922}, {'end': 25.901, 'text': 'If there are multiple great answers, how do you measure accuracy?', 'start': 22.598, 'duration': 3.303}, {'end': 29.804, 'text': 'The way this is done conventionally is with something called the blue score.', 'start': 26.361, 'duration': 3.443}], 'summary': 'Evaluating machine translation with multiple good answers using blue score.', 'duration': 29.075, 'max_score': 0.729, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q729.jpg'}, {'end': 106.188, 'src': 'embed', 'start': 72.774, 'weight': 0, 'content': [{'end': 75.135, 'text': 'how good is that machine translation?', 'start': 72.774, 'duration': 2.361}, {'end': 90.68, 'text': 'and the intuition is so long as the machine-generated translation is pretty close to any of the references provided by humans,', 'start': 80.715, 'duration': 9.965}, {'end': 92.641, 'text': 'then it will get a high blue score.', 'start': 90.68, 'duration': 1.961}, {'end': 106.188, 'text': 'Blue, by the way, stands for, um, Bilingual Evaluation Understudy.', 'start': 97.343, 'duration': 8.845}], 'summary': 'Machine translation evaluated based on closeness to human references for high blue score.', 'duration': 33.414, 'max_score': 72.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q72774.jpg'}, {'end': 160.937, 'src': 'embed', 'start': 133.189, 'weight': 1, 'content': [{'end': 139.751, 'text': 'uh could be a substitute for having humans evaluate every output of a machine translation system.', 'start': 133.189, 'duration': 6.562}, {'end': 150.125, 'text': 'So the blue score was due to Kishore Papaneni, Salim Rukus, Todd Ward, and Wei-Jing Zhu.', 'start': 143.337, 'duration': 6.788}, {'end': 158.214, 'text': "Uh, this paper has been incredibly influential and it's actually quite a reasonable and it's actually quite a readable paper,", 'start': 150.826, 'duration': 7.388}, {'end': 160.937, 'text': 'so I encourage you to take a look if you have time.', 'start': 158.214, 'duration': 2.723}], 'summary': 'A paper suggests using bleu score as a substitute for human evaluation in machine translation.', 'duration': 27.748, 'max_score': 133.189, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q133189.jpg'}], 'start': 0.729, 'title': 'Machine translation and blue score', 'summary': 'Discusses evaluating mt systems and introduces blue score for automatic quality measurement, developed by kishore papaneni, salim rukus, todd ward, and wei-jing zhu.', 'chapters': [{'end': 160.937, 'start': 0.729, 'title': 'Machine translation and blue score', 'summary': 'Discusses the challenge of evaluating machine translation systems due to multiple equally good translations in a given language, and introduces the blue score as a method to automatically compute a score that measures the quality of machine-generated translations, developed by kishore papaneni, salim rukus, todd ward, and wei-jing zhu.', 'duration': 160.208, 'highlights': ['The blue score is used to measure the quality of machine-generated translation by automatically computing a score based on its proximity to human-generated reference translations.', 'Machine translation evaluation is challenging due to the existence of multiple equally good translations in a given language, unlike tasks such as image recognition which have a single correct answer.', 'The blue score was developed by Kishore Papaneni, Salim Rukus, Todd Ward, and Wei-Jing Zhu, and has been influential in the field of machine translation evaluation.']}], 'duration': 160.208, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q729.jpg', 'highlights': ['The blue score is used to measure the quality of machine-generated translation by automatically computing a score based on its proximity to human-generated reference translations.', 'The blue score was developed by Kishore Papaneni, Salim Rukus, Todd Ward, and Wei-Jing Zhu, and has been influential in the field of machine translation evaluation.', 'Machine translation evaluation is challenging due to the existence of multiple equally good translations in a given language, unlike tasks such as image recognition which have a single correct answer.']}, {'end': 652.221, 'segs': [{'end': 188.227, 'src': 'embed', 'start': 162.279, 'weight': 0, 'content': [{'end': 177.657, 'text': "So The intuition behind the blue score is we're going to look at the machine generated output and see if the types of words it generates appear in at least one of the human generated references.", 'start': 162.279, 'duration': 15.378}, {'end': 185.442, 'text': 'And so these human generated references would be provided as part of the dev set or as part of the test set.', 'start': 178.097, 'duration': 7.345}, {'end': 188.227, 'text': "Now, let's look at some extreme example.", 'start': 186.144, 'duration': 2.083}], 'summary': 'Blue score measures overlap between machine output and human references.', 'duration': 25.948, 'max_score': 162.279, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q162279.jpg'}, {'end': 245.408, 'src': 'embed', 'start': 217.015, 'weight': 4, 'content': [{'end': 222.017, 'text': 'And so, um, this we call the precision of the machine translation output.', 'start': 217.015, 'duration': 5.002}, {'end': 228.259, 'text': 'And in this case, there are seven words in the machine translation output.', 'start': 222.637, 'duration': 5.622}, {'end': 233.801, 'text': 'And every one of these seven words appears in either reference one or reference two.', 'start': 228.939, 'duration': 4.862}, {'end': 238.826, 'text': 'So the word the appears in both references.', 'start': 235.385, 'duration': 3.441}, {'end': 241.747, 'text': 'So each of these words looks like a pretty good word to include.', 'start': 238.906, 'duration': 2.841}, {'end': 245.408, 'text': 'So this will have a precision of 7 over 7.', 'start': 241.767, 'duration': 3.641}], 'summary': 'Precision of the machine translation output is 7 over 7.', 'duration': 28.393, 'max_score': 217.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q217015.jpg'}, {'end': 261.492, 'src': 'heatmap', 'start': 217.015, 'weight': 0.733, 'content': [{'end': 222.017, 'text': 'And so, um, this we call the precision of the machine translation output.', 'start': 217.015, 'duration': 5.002}, {'end': 228.259, 'text': 'And in this case, there are seven words in the machine translation output.', 'start': 222.637, 'duration': 5.622}, {'end': 233.801, 'text': 'And every one of these seven words appears in either reference one or reference two.', 'start': 228.939, 'duration': 4.862}, {'end': 238.826, 'text': 'So the word the appears in both references.', 'start': 235.385, 'duration': 3.441}, {'end': 241.747, 'text': 'So each of these words looks like a pretty good word to include.', 'start': 238.906, 'duration': 2.841}, {'end': 245.408, 'text': 'So this will have a precision of 7 over 7.', 'start': 241.767, 'duration': 3.641}, {'end': 247.088, 'text': 'It looks like it has a great precision.', 'start': 245.408, 'duration': 1.68}, {'end': 255.73, 'text': 'So this is why the basic precision measure of what fraction of the words in the empty output also appear in the references.', 'start': 247.568, 'duration': 8.162}, {'end': 261.492, 'text': 'this is not a particularly useful measure, because it seems to imply that this empty output has very high precision.', 'start': 255.73, 'duration': 5.762}], 'summary': 'Machine translation output has a precision of 7 over 7, indicating high precision.', 'duration': 44.477, 'max_score': 217.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q217015.jpg'}, {'end': 292.325, 'src': 'embed', 'start': 262.872, 'weight': 3, 'content': [{'end': 275.538, 'text': "what we're going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in um,", 'start': 262.872, 'duration': 12.666}, {'end': 276.759, 'text': 'the reference sentences.', 'start': 275.538, 'duration': 1.221}, {'end': 280.32, 'text': 'So in reference 1, the word d appears twice.', 'start': 277.279, 'duration': 3.041}, {'end': 283.482, 'text': 'In reference 2, the word d appears just once.', 'start': 280.96, 'duration': 2.522}, {'end': 287.103, 'text': 'So 2 is bigger than 1.', 'start': 284.202, 'duration': 2.901}, {'end': 292.325, 'text': "And so we're gonna say that, um, the word d gets credit up to twice.", 'start': 287.103, 'duration': 5.222}], 'summary': 'Using modified precision measure, word d gets credit up to twice based on reference sentences.', 'duration': 29.453, 'max_score': 262.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q262872.jpg'}, {'end': 376.736, 'src': 'embed', 'start': 345.728, 'weight': 5, 'content': [{'end': 350.033, 'text': "Let's define a portion of the blue school on bigrams.", 'start': 345.728, 'duration': 4.305}, {'end': 353.317, 'text': 'And bigrams just means pairs of words appearing next to each other.', 'start': 350.073, 'duration': 3.244}, {'end': 360.32, 'text': "So now, let's see how we could use bigrams to define the blue score.", 'start': 354.855, 'duration': 5.465}, {'end': 367.487, 'text': "And this would just be a portion of the final blue score, and we'll take unigrams or single words, as well as bigrams,", 'start': 360.741, 'duration': 6.746}, {'end': 376.736, 'text': 'which means pairs of words into count, as well as maybe even longer sequences of words, such as trigrams, which means three words appearing together.', 'start': 367.487, 'duration': 9.249}], 'summary': 'Defining blue score using bigrams and trigrams for language analysis.', 'duration': 31.008, 'max_score': 345.728, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q345728.jpg'}, {'end': 495.568, 'src': 'heatmap', 'start': 423.815, 'weight': 0.711, 'content': [{'end': 436.214, 'text': "And so let's count up How many times each of these migraines appear? The cat appears twice, cat D appears once, and the others all appear just once.", 'start': 423.815, 'duration': 12.399}, {'end': 441.24, 'text': "And then finally, let's define the clipped count.", 'start': 437.839, 'duration': 3.401}, {'end': 445.181, 'text': 'So count and then subscript clip.', 'start': 441.52, 'duration': 3.661}, {'end': 446.962, 'text': 'And to define that.', 'start': 445.781, 'duration': 1.181}, {'end': 448.742, 'text': "let's take this column of numbers,", 'start': 446.962, 'duration': 1.78}, {'end': 459.785, 'text': 'but give our algorithm credit only up to the maximum number of times that that diagram appears in either reference 1 or reference 2..', 'start': 448.742, 'duration': 11.043}, {'end': 465.366, 'text': 'So the cat appears a maximum of ones in, uh, either of the references.', 'start': 459.785, 'duration': 5.581}, {'end': 467.487, 'text': "So I'm gonna clip that count to one.", 'start': 465.606, 'duration': 1.881}, {'end': 473.668, 'text': "Cat v, well, it doesn't appear in reference one or reference two, so I'm gonna clip that to zero.", 'start': 468.387, 'duration': 5.281}, {'end': 477.969, 'text': 'Uh, cat on, yup, that appears once, we give it credit for once.', 'start': 473.688, 'duration': 4.281}, {'end': 482.891, 'text': 'On d appears once, give that credit for once, and d mat appears once.', 'start': 478.25, 'duration': 4.641}, {'end': 484.251, 'text': 'So these are the clipped counts.', 'start': 482.951, 'duration': 1.3}, {'end': 488.052, 'text': "We're taking all the counts and clipping them really uh,", 'start': 484.271, 'duration': 3.781}, {'end': 495.568, 'text': 'reducing them to be no more than the number of times that bigram appears in at least one of the references.', 'start': 488.962, 'duration': 6.606}], 'summary': 'Migraine counts: cat-2, cat d-1, others-1. clipped counts: cat-1, cat d-0, on-1, d mat-1.', 'duration': 71.753, 'max_score': 423.815, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q423815.jpg'}, {'end': 506.077, 'src': 'embed', 'start': 473.688, 'weight': 1, 'content': [{'end': 477.969, 'text': 'Uh, cat on, yup, that appears once, we give it credit for once.', 'start': 473.688, 'duration': 4.281}, {'end': 482.891, 'text': 'On d appears once, give that credit for once, and d mat appears once.', 'start': 478.25, 'duration': 4.641}, {'end': 484.251, 'text': 'So these are the clipped counts.', 'start': 482.951, 'duration': 1.3}, {'end': 488.052, 'text': "We're taking all the counts and clipping them really uh,", 'start': 484.271, 'duration': 3.781}, {'end': 495.568, 'text': 'reducing them to be no more than the number of times that bigram appears in at least one of the references.', 'start': 488.962, 'duration': 6.606}, {'end': 506.077, 'text': 'And then finally, our modified bigram precision will be the sum of the count clips.', 'start': 499.391, 'duration': 6.686}], 'summary': 'Clipped counts for bigrams precision calculation.', 'duration': 32.389, 'max_score': 473.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q473688.jpg'}, {'end': 540.398, 'src': 'heatmap', 'start': 499.391, 'weight': 0.735, 'content': [{'end': 506.077, 'text': 'And then finally, our modified bigram precision will be the sum of the count clips.', 'start': 499.391, 'duration': 6.686}, {'end': 509.909, 'text': "So that's 1, 2, 3, 4.", 'start': 506.117, 'duration': 3.792}, {'end': 515.052, 'text': "divided by the total number of biograms, that's two, three, four, five, six.", 'start': 509.909, 'duration': 5.143}, {'end': 521.674, 'text': 'So four out of six or two-thirds is the modified precision on biograms.', 'start': 515.611, 'duration': 6.063}, {'end': 526.757, 'text': "So let's just formalize this a little bit further.", 'start': 524.116, 'duration': 2.641}, {'end': 540.398, 'text': 'with what we had developed with on unigrams, we defined this modified precision computed on unigrams as, um, p subscript 1.', 'start': 528.692, 'duration': 11.706}], 'summary': 'Modified bigram precision: 4 out of 6 or 2/3.', 'duration': 41.007, 'max_score': 499.391, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q499391.jpg'}], 'start': 162.279, 'title': 'Translation precision measures', 'summary': 'Explores blue score for translation precision, achieving a perfect precision of 7/7, and a modified precision measure in nlp with a score of two-thirds for bigrams.', 'chapters': [{'end': 245.408, 'start': 162.279, 'title': 'Blue score and precision in machine translation', 'summary': 'Explains the intuition behind the blue score, which measures the precision of machine translation outputs by comparing the appearance of generated words in human references, demonstrated with an extreme example of a machine translation output, resulting in a precision of 7 over 7.', 'duration': 83.129, 'highlights': ['The blue score measures the precision of machine translation outputs by comparing the appearance of generated words in human references, with an extreme example yielding a precision of 7 over 7.', 'Human generated references are provided as part of the dev set or test set for evaluating machine translation outputs.', 'The example demonstrates how the blue score is calculated by checking if each word in the machine translation output appears in the references, resulting in a precision of 7 over 7.']}, {'end': 652.221, 'start': 245.408, 'title': 'Modified precision measure in nlp', 'summary': 'Explains the modified precision measure in natural language processing, where words and bigrams are given credit based on their maximum appearance in the reference sentences, resulting in a modified precision score of two-thirds for bigrams.', 'duration': 406.813, 'highlights': ['The modified bigram precision is calculated as the sum of the count clips, resulting in a score of two-thirds for bigrams.', 'The modified precision measure gives each word credit only up to the maximum number of times it appears in the reference sentences, resulting in a score of two out of seven for a specific word.', 'The chapter emphasizes the importance of considering bigrams in addition to unigrams in defining the blue score for evaluating machine translation output.']}], 'duration': 489.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q162279.jpg', 'highlights': ['The blue score measures the precision of machine translation outputs by comparing the appearance of generated words in human references, with an extreme example yielding a precision of 7 over 7.', 'The modified bigram precision is calculated as the sum of the count clips, resulting in a score of two-thirds for bigrams.', 'Human generated references are provided as part of the dev set or test set for evaluating machine translation outputs.', 'The modified precision measure gives each word credit only up to the maximum number of times it appears in the reference sentences, resulting in a score of two out of seven for a specific word.', 'The example demonstrates how the blue score is calculated by checking if each word in the machine translation output appears in the references, resulting in a precision of 7 over 7.', 'The chapter emphasizes the importance of considering bigrams in addition to unigrams in defining the blue score for evaluating machine translation output.']}, {'end': 985.345, 'segs': [{'end': 789.669, 'src': 'heatmap', 'start': 674.582, 'weight': 1, 'content': [{'end': 683.094, 'text': 'And one thing that you could pretty convince yourself of is if the empty output is exactly the same as either reference 1 or reference 2,', 'start': 674.582, 'duration': 8.512}, {'end': 692.02, 'text': "then all of these values p1 and p2 and so on they'll all be equal to 1.0..", 'start': 683.094, 'duration': 8.926}, {'end': 701.407, 'text': 'So to get a, um, precision or a modified precision of 1.0, you just have to be exactly equal to one of the references.', 'start': 692.02, 'duration': 9.387}, {'end': 709.333, 'text': "And sometimes it's possible to achieve this even if you aren't exactly the same as any of the references, but kind of combine them in a way that, uh,", 'start': 701.627, 'duration': 7.706}, {'end': 711.254, 'text': 'hopefully still results in a good translation.', 'start': 709.333, 'duration': 1.921}, {'end': 730.888, 'text': "Finally, Finally, let's put this together to form the final blue score.", 'start': 715.117, 'duration': 15.771}, {'end': 741.211, 'text': 'So p subscript n is the blue score computed on n grams only, also the modified precision computed on n grams only.', 'start': 731.728, 'duration': 9.483}, {'end': 754.584, 'text': 'And, and by convention to compute one number, you compute p1, p2, p3, and p4, and combine them together using the following formula.', 'start': 742.031, 'duration': 12.553}, {'end': 762.95, 'text': "Um, it's gonna be the average, so sum from n equals 1 to 4 of pn, and divide that by 4.", 'start': 754.944, 'duration': 8.006}, {'end': 764.371, 'text': 'So basically taking the average.', 'start': 762.95, 'duration': 1.421}, {'end': 771.235, 'text': 'Um, by convention, the blue score is defined as e to the this, and exponentiation is a linear operate.', 'start': 765.051, 'duration': 6.184}, {'end': 776.82, 'text': 'exponentiation is a strictly monotonically, um, increasing operation.', 'start': 772.196, 'duration': 4.624}, {'end': 789.669, 'text': 'And, and then we actually adjust this with one more factor called the, uh, BP penalty.', 'start': 777.36, 'duration': 12.309}], 'summary': 'Achieving a precision of 1.0 by combining references and computing final blue score.', 'duration': 80.002, 'max_score': 674.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q674582.jpg'}, {'end': 811.518, 'src': 'heatmap', 'start': 789.829, 'weight': 0.728, 'content': [{'end': 801.358, 'text': 'So BP stands for, uh, brevity penalty.', 'start': 789.829, 'duration': 11.529}, {'end': 811.518, 'text': "the details maybe aren't super important, but to just give you a sense, it turns out that if you output very short translations,", 'start': 803.595, 'duration': 7.923}], 'summary': 'Bp stands for brevity penalty in translation evaluations.', 'duration': 21.689, 'max_score': 789.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q789829.jpg'}, {'end': 872.769, 'src': 'heatmap', 'start': 847.057, 'weight': 0.867, 'content': [{'end': 854.922, 'text': "And otherwise, it's some formula like that, that, um, overall penalizes shorter translations.", 'start': 847.057, 'duration': 7.865}, {'end': 861.907, 'text': 'So the details you can find in this paper.', 'start': 859.545, 'duration': 2.362}, {'end': 865.825, 'text': 'So once again.', 'start': 864.784, 'duration': 1.041}, {'end': 872.769, 'text': 'earlier in this set of courses, you saw the importance of having a single row number evaluation metric,', 'start': 865.825, 'duration': 6.944}], 'summary': 'A formula penalizes shorter translations, details in paper.', 'duration': 25.712, 'max_score': 847.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q847057.jpg'}, {'end': 917.038, 'src': 'embed', 'start': 877.772, 'weight': 0, 'content': [{'end': 880.354, 'text': 'try to stick with the one that achieved the highest score.', 'start': 877.772, 'duration': 2.582}, {'end': 888.239, 'text': 'So the reason the blue score was revolutionary for machine translation was because this gave a pretty good by no means perfect,', 'start': 880.794, 'duration': 7.445}, {'end': 893.282, 'text': 'but pretty good single row number evaluation, metric, and so that accelerated the progress.', 'start': 888.239, 'duration': 5.043}, {'end': 896.084, 'text': 'of the entire field of machine translation.', 'start': 893.682, 'duration': 2.402}, {'end': 899.866, 'text': 'I hope this video gave you a sense of how the BlueScore works.', 'start': 896.684, 'duration': 3.182}, {'end': 904.229, 'text': 'In practice, few people would implement the BlueScore from scratch.', 'start': 900.327, 'duration': 3.902}, {'end': 909.333, 'text': "They're open source implementations you can download and just use to evaluate your own system.", 'start': 904.269, 'duration': 5.064}, {'end': 917.038, 'text': 'But today BlueScore is used to evaluate many systems that generate texts, such as machine translation systems,', 'start': 909.733, 'duration': 7.305}], 'summary': 'The bluescore revolutionized machine translation, accelerating progress in the field by providing a good evaluation metric, used to assess various text-generating systems.', 'duration': 39.266, 'max_score': 877.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q877772.jpg'}, {'end': 981.021, 'src': 'embed', 'start': 956.84, 'weight': 5, 'content': [{'end': 964.826, 'text': "one ground truth and you just use other measures to see if you've got a speech transcription on pretty much exactly word for word correct.", 'start': 956.84, 'duration': 7.986}, {'end': 969.911, 'text': 'But for things like image captioning, multiple captions for a picture, it could be about equally good.', 'start': 965.226, 'duration': 4.685}, {'end': 973.934, 'text': 'Or for machine translation, there are multiple translations about equally good.', 'start': 970.051, 'duration': 3.883}, {'end': 981.021, 'text': 'The blue score gives you a way to evaluate that automatically and therefore speed up your alphabet development.', 'start': 974.315, 'duration': 6.706}], 'summary': 'Blue score evaluates machine translation quality, speeding up development.', 'duration': 24.181, 'max_score': 956.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q956840.jpg'}], 'start': 652.221, 'title': 'Machine translation evaluation', 'summary': 'Discusses the computation of precision values, formation of the final blue score, and the brevity penalty adjustment factor in machine translation evaluation. it also explains the significance of bluescore in machine translation, highlighting its ability to provide a single real number evaluation metric, accelerate the progress of the field, and its wide usage in evaluating systems generating text and image captions.', 'chapters': [{'end': 877.772, 'start': 652.221, 'title': 'Evaluating machine translation quality', 'summary': 'Discusses the computation of precision values, formation of the final blue score, and the brevity penalty adjustment factor in machine translation evaluation.', 'duration': 225.551, 'highlights': ['The brevity penalty adjustment factor penalizes translation systems that output translations that are too short, ensuring that the translations are not significantly shorter than the human-generated reference outputs.', 'The final blue score is formed by computing precision values for n-grams and combining them using a specific formula, ultimately providing a single number evaluation metric for machine translation quality.', 'The computation of precision values allows the measurement of the degree to which machine translation output is similar or overlaps with the references, with a precision or modified precision of 1.0 achieved by being exactly equal to one of the references.']}, {'end': 985.345, 'start': 877.772, 'title': 'Understanding blue score evaluation', 'summary': 'Explains the significance of bluescore in machine translation, highlighting its ability to provide a single real number evaluation metric, accelerate the progress of the field, and its wide usage in evaluating systems generating text and image captions.', 'duration': 107.573, 'highlights': ['The BlueScore provides a single real number evaluation metric that accelerated the progress of the entire field of machine translation.', 'BlueScore is used to evaluate many systems that generate texts, such as machine translation systems and image captioning systems, by comparing the generated text with reference captions.', 'It is not used for speech recognition but is useful for evaluating multiple translations and image captioning systems, providing a way to automatically evaluate the quality of the generated text.']}], 'duration': 333.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/DejHQYAGb7Q/pics/DejHQYAGb7Q652221.jpg', 'highlights': ['The final blue score is formed by computing precision values for n-grams and combining them using a specific formula, ultimately providing a single number evaluation metric for machine translation quality.', 'The BlueScore provides a single real number evaluation metric that accelerated the progress of the entire field of machine translation.', 'The computation of precision values allows the measurement of the degree to which machine translation output is similar or overlaps with the references, with a precision or modified precision of 1.0 achieved by being exactly equal to one of the references.', 'The brevity penalty adjustment factor penalizes translation systems that output translations that are too short, ensuring that the translations are not significantly shorter than the human-generated reference outputs.', 'BlueScore is used to evaluate many systems that generate texts, such as machine translation systems and image captioning systems, by comparing the generated text with reference captions.', 'It is not used for speech recognition but is useful for evaluating multiple translations and image captioning systems, providing a way to automatically evaluate the quality of the generated text.']}], 'highlights': ['The blue score is used to measure the quality of machine-generated translation by automatically computing a score based on its proximity to human-generated reference translations.', 'The blue score was developed by Kishore Papaneni, Salim Rukus, Todd Ward, and Wei-Jing Zhu, and has been influential in the field of machine translation evaluation.', 'The final blue score is formed by computing precision values for n-grams and combining them using a specific formula, ultimately providing a single number evaluation metric for machine translation quality.', 'The BlueScore provides a single real number evaluation metric that accelerated the progress of the entire field of machine translation.', 'The computation of precision values allows the measurement of the degree to which machine translation output is similar or overlaps with the references, with a precision or modified precision of 1.0 achieved by being exactly equal to one of the references.', 'The brevity penalty adjustment factor penalizes translation systems that output translations that are too short, ensuring that the translations are not significantly shorter than the human-generated reference outputs.', 'Machine translation evaluation is challenging due to the existence of multiple equally good translations in a given language, unlike tasks such as image recognition which have a single correct answer.', 'The blue score measures the precision of machine translation outputs by comparing the appearance of generated words in human references, with an extreme example yielding a precision of 7 over 7.', 'The modified bigram precision is calculated as the sum of the count clips, resulting in a score of two-thirds for bigrams.', 'Human generated references are provided as part of the dev set or test set for evaluating machine translation outputs.', 'The modified precision measure gives each word credit only up to the maximum number of times it appears in the reference sentences, resulting in a score of two out of seven for a specific word.', 'The example demonstrates how the blue score is calculated by checking if each word in the machine translation output appears in the references, resulting in a precision of 7 over 7.', 'The chapter emphasizes the importance of considering bigrams in addition to unigrams in defining the blue score for evaluating machine translation output.', 'BlueScore is used to evaluate many systems that generate texts, such as machine translation systems and image captioning systems, by comparing the generated text with reference captions.', 'It is not used for speech recognition but is useful for evaluating multiple translations and image captioning systems, providing a way to automatically evaluate the quality of the generated text.']}