In the realm of translation quality assessment, GEMBA (GPT Estimation Metric-Based Assessment) offers an approach to evaluate translations. The findings are promising, not just for translation assessments but for other evaluations as well.
Introduction
The study behind GEMBA primarily investigates whether LLMs can effectively assess translation quality. However, this inquiry could be extended beyond translations, offering insights into the design of various assessments using Gen AI.
GEMBA operates both with and without a reference translation. By employing zero-shot prompting, the study compares four distinct prompt variants across two modes, depending on whether a reference translation is available. This method has been compared to the results from the WMT22 Metrics shared task, showcasing GEMBA’s ability to achieve state-of-the-art accuracy in assessing translations from English into German, English into Russian, and Chinese into English.
To use GEMBA for assessing translations, certain parameters are required:
- Source Language.
- Target Language.
- Source Segments.
- Candidate Translations.
- Optional Reference Translations.
GEMBA can be used for different assessment needs:
- For scoring tasks: GEMBA-DA and GEMBA-SQM
- For classification tasks: GEMBA-stars and GEMBA-classes
Scoring Tasks
GEMBA-DA: Direct Assessment
Output scores range from 0 − 100.
Accuracy with GPT-4:
- With human references: 89.8%
- Without human references: 87.6%
Prompt for assessment with human references:
Score the following translation from {source_lang} to {target_lang}with respect to the human reference on a continuous scale from 0 to 100,where a score of zero means "no meaning preserved" and score of onehundred means "perfect meaning and grammar".
{source_lang} source: "{source_seg}"{target_lang} human reference: {reference_seg}{target_lang} translation: "{target_seg}"Score:
Prompt for assessment without human references:
Score the following translation from {source_lang} to {target_lang}on a continuous scale from 0 to 100, where a score of zero means"no meaning preserved" and score of one hundred means"perfect meaning and grammar".
{source_lang} source: "{source_seg}"{target_lang} translation: "{target_seg}"Score:
GEMBA-SQM: Scalar Quality Metrics
Output scores range from 0 − 100.
Accuracy with GPT-4:
- With human references: 88.7%
- Without human references: 89.1%
Prompt for assessment with human references:
Score the following translation from {source_lang} to {target_lang}with respect to the human reference on a continuous scale from 0 to 100that starts with "No meaning preserved", goes through "Some meaning preserved",then "Most meaning preserved and few grammar mistakes", up to "Perfect meaningand grammar".
{source_lang} source: "{source_seg}"{target_lang} human reference: "{reference_seg}"{target_lang} translation: "{target_seg}"Score (0-100):
Prompt for assessment without human references:
Score the following translation from {source_lang} to {target_lang} ona continuous scale from 0 to 100 that starts with "No meaning preserved",goes through "Some meaning preserved", then "Most meaning preserved andfew grammar mistakes", up to "Perfect meaning and grammar".
{source_lang} source: "{source_seg}"{target_lang} translation: "{target_seg}"Score (0-100):
Classification Tasks
GEMBA-Stars
Output scores range from 1 − 5. Special care is taken for answers containing non-numerical answers, such as “Three stars”, ”****”, or “1 star”.
Accuracy with GPT-4:
- With human references: 91.2%
- Without human references: 89.1%
Prompt for assessment with human references:
Score the following translation from {source_lang} to {target_lang} withrespect to the human reference with one to five stars. Where one starmeans "Nonsense/No meaning preserved", two stars mean "Some meaningpreserved, but not understandable", three stars mean "Some meaningpreserved and understandable", four stars mean "Most meaning preservedwith possibly few grammar mistakes", and five stars mean "Perfect meaningand grammar".
{source_lang} source: "{source_seg}"{target_lang} human reference: "{reference_seg}"{target_lang} translation: "{target_seg}"Stars:
Prompt for assessment without human references:
Score the following translation from {source_lang} to {target_lang} withone to five stars. Where one star means "Nonsense/No meaning preserved",two stars mean "Some meaning preserved, but not understandable", three starsmean "Some meaning preserved and understandable", four stars mean "Mostmeaning preserved with possibly few grammar mistakes", and five stars mean"Perfect meaning and grammar".
{source_lang} source: "{source_seg}"{target_lang} translation: "{target_seg}"Stars:
GEMBA-Classes
Output label one of “No meaning preserved”, “Some meaning preserved, but not understandable”, “Some meaning preserved and understandable”, “Most meaning preserved, minor issues”, “Perfect translation”.
Accuracy with GPT-4:
- With human references: 89.1%
- Without human references: 91.2%
Prompt for assessment with human references:
Classify the quality of translation from {source_lang} to {target_lang} withrespect to the human reference into one of following classes: "No meaningpreserved", "Some meaning preserved, but not understandable", "Some meaningpreserved and understandable", "Most meaning preserved, minor issues", "Perfecttranslation".
{source_lang} source: "{source_seg}"{target_lang} human reference: "{reference_seg}"{target_lang} translation: "{target_seg}"Class:
Prompt for assessment without human references:
Classify the quality of translation from {source_lang} to {target_lang} into oneof following classes: "No meaning preserved", "Some meaning preserved, but notunderstandable", "Some meaning preserved and understandable", "Most meaningpreserved, minor issues", "Perfect translation".
{source_lang} source: "{source_seg}"{target_lang} translation: "{target_seg}"Class:
Conclusion
Protocol | Task type | Accuracy, % |
---|---|---|
GEMBA-DA | Scoring | 89.8 |
GEMBA-DA[noref] | Scoring | 87.6 |
GEMBA-SQM | Scoring | 88.7 |
GEMBA-SQM[noref] | Scoring | 89.1 |
GEMBA-Stars | Classification | 91.2 |
GEMBA-Stars[noref] | Classification | 89.1 |
GEMBA-Classes | Classification | 89.1 |
GEMBA-Classes[noref] | Classification | 91.2 |
GEMBA stands as a testament to the growing capabilities of LLMs in practical applications. By offering an adaptable method for translation quality assessment, it opens new avenues for practical language service applications.
Let’s keep an eye on WMT23 outcomes to discover the full potential of modern LLMs across various tasks.