In the realm of translation quality assessment, GEMBA (GPT Estimation
Metric-Based Assessment) offers an approach to evaluate translations.
The findings are promising, not just for translation assessments but for other
evaluations as well.
Introduction
The study behind GEMBA primarily investigates whether LLMs can
effectively assess translation quality. However, this inquiry could be extended
beyond translations, offering insights into the design of various assessments
using Gen AI.
GEMBA operates both with and without a reference translation. By employing
zero-shot prompting, the study compares four distinct prompt variants across
two modes, depending on whether a reference translation is available. This
method has been compared to the results from
the WMT22 Metrics shared task, showcasing GEMBA’s ability to achieve
state-of-the-art accuracy in assessing translations from English into German,
English into Russian, and Chinese into English.
To use GEMBA for assessing translations, certain parameters are required:
Source Language.
Target Language.
Source Segments.
Candidate Translations.
Optional Reference Translations.
GEMBA can be used for different assessment needs:
For scoring tasks: GEMBA-DA and GEMBA-SQM
For classification tasks: GEMBA-stars and GEMBA-classes
Scoring Tasks
GEMBA-DA: Direct Assessment
Output scores range from 0 − 100.
Accuracy with GPT-4:
With human references: 89.8%
Without human references: 87.6%
Prompt for assessment with human references:
Prompt for assessment without human references:
GEMBA-SQM: Scalar Quality Metrics
Output scores range from 0 − 100.
Accuracy with GPT-4:
With human references: 88.7%
Without human references: 89.1%
Prompt for assessment with human references:
Prompt for assessment without human references:
Classification Tasks
GEMBA-Stars
Output scores range from 1 − 5. Special care is taken for answers containing
non-numerical answers, such as “Three stars”, ”****”, or “1 star”.
Accuracy with GPT-4:
With human references: 91.2%
Without human references: 89.1%
Prompt for assessment with human references:
Prompt for assessment without human references:
GEMBA-Classes
Output label one of “No meaning preserved”, “Some meaning preserved, but not
understandable”, “Some meaning preserved and understandable”, “Most meaning
preserved, minor issues”, “Perfect translation”.
Accuracy with GPT-4:
With human references: 89.1%
Without human references: 91.2%
Prompt for assessment with human references:
Prompt for assessment without human references:
Conclusion
Protocol
Task type
Accuracy, %
GEMBA-DA
Scoring
89.8
GEMBA-DA[noref]
Scoring
87.6
GEMBA-SQM
Scoring
88.7
GEMBA-SQM[noref]
Scoring
89.1
GEMBA-Stars
Classification
91.2
GEMBA-Stars[noref]
Classification
89.1
GEMBA-Classes
Classification
89.1
GEMBA-Classes[noref]
Classification
91.2
GEMBA stands as a testament to the growing capabilities of LLMs in practical
applications. By offering an adaptable method for translation quality
assessment, it opens new avenues for practical language service applications.
Let’s keep an eye on WMT23 outcomes to discover the full potential of
modern LLMs across various tasks.