Table 35 includes DA results for English-German and Table 36 shows results for
German-English APE systems. Clusters are identified by grouping systems together
according to which systems significantly outperform all others in lower ranking
clusters, according to Wilcoxon rank-sum test.
# Ave % Ave z System
− 84.8 0.520 HUMAN POST EDIT
1 78.2 0.261 AMU
77.9 0.261 FBK
76.8 0.221 DCU
4 73.8 0.115 JXNU
5 71.9 0.038 USAAR
71.1 0.014 CUNI
70.2 −0.020 LIG
− 68.6 −0.083 NO POST EDIT
Table 35: EN-DE DA Human evaluation results showing average raw DA scores (Ave
%) and average standardized scores (Ave z), lines between systems indicate
clusters according to Wilcoxon rank-sum test at p-level p ≤ 0.05.
Seems to indicate that human translation is better than machine translation, but
of course that doesn't guarantee that there isn't a better translation program
somewhere from pre 31 Dec 2015 that simply didn't attend the conference.
Still if human level translation existed in 2015, you would not expect to read
This steady improvement has been mainly driven by the massive migration to the
neural approach, which in 2016 allowed the winning system to achieve impressive
I don't believe there is a program that can justifiably claim "equal or better
average quality, as professional human translations" but proving a negative is
difficult. I suggest if there was such a program it would be big news, not
difficult to find, and conference findings would be markedly different to those
Not sure how much more a judge might want before deciding how to judge the
claim. Are there any more authoritative events or other event before claim
deadline of 31 Dec 2017? (Note program has to exist by 31 Dec 2015 and
translations have to 'be of comparable cost and turnaround time'.)
The comparable cost and turnaround time requirement seems to me to indicate that
secret research would not qualify.