Correction of errors in preference ratings from automated metrics for text generation

Deriu, Jan; von Däniken, Pius; Tuggener, Don; Cieliebak, Mark

doi:10.18653/v1/2023.findings-acl.404

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-29048

Full metadata record

DC Field	Value	Language
dc.contributor.author	Deriu, Jan	-
dc.contributor.author	von Däniken, Pius	-
dc.contributor.author	Tuggener, Don	-
dc.contributor.author	Cieliebak, Mark	-
dc.date.accessioned	2023-11-10T18:02:13Z	-
dc.date.available	2023-11-10T18:02:13Z	-
dc.date.issued	2023	-
dc.identifier.uri	https://digitalcollection.zhaw.ch/handle/11475/29048	-
dc.description.abstract	A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreements with human judgments. In this paper, we propose to apply automated metrics for Text Generation in a preference-based evaluation protocol. The protocol features a statistical model that incorporates various levels of uncertainty to account for the error-proneness of the metrics. We show that existing metrics are generally over-confident in assigning significant differences between systems. As a remedy, the model allows to combine human ratings with automated ratings. We show that it can reduce the required amounts of human ratings to arrive at robust and statistically significant results by more than 50%, while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of the evaluation protocol for three text generation tasks: dialogue systems, machine translation, and text summarization.	de_CH
dc.language.iso	en	de_CH
dc.publisher	Association for Computational Linguistics	de_CH
dc.rights	http://creativecommons.org/licenses/by/4.0/	de_CH
dc.subject	Preference rating	de_CH
dc.subject	Automated metrics	de_CH
dc.subject	Machine translation	de_CH
dc.subject	Text generation	de_CH
dc.subject	Bayesian	de_CH
dc.subject	Error correction	de_CH
dc.subject.ddc	410.285: Computerlinguistik	de_CH
dc.title	Correction of errors in preference ratings from automated metrics for text generation	de_CH
dc.type	Konferenz: Paper	de_CH
dcterms.type	Text	de_CH
zhaw.departement	School of Engineering	de_CH
zhaw.organisationalunit	Centre for Artificial Intelligence (CAI)	de_CH
dc.identifier.doi	10.18653/v1/2023.findings-acl.404	de_CH
dc.identifier.doi	10.21256/zhaw-29048	-
zhaw.conference.details	61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, 9-14 July 2023	de_CH
zhaw.funding.eu	No	de_CH
zhaw.originated.zhaw	Yes	de_CH
zhaw.pages.end	6474	de_CH
zhaw.pages.start	6456	de_CH
zhaw.parentwork.editor	Rogers, Anna	-
zhaw.parentwork.editor	Boyd-Graber, Roger	-
zhaw.parentwork.editor	Okazaki, Naoaki	-
zhaw.publication.status	publishedVersion	de_CH
zhaw.publication.review	Peer review (Publikation)	de_CH
zhaw.title.proceedings	Findings of the Association for Computational Linguistics: ACL 2023	de_CH
zhaw.webfeed	Natural Language Processing	de_CH
zhaw.author.additional	No	de_CH
zhaw.display.portrait	Yes	de_CH
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
2023_Deriu-etal_Correction-of-errors-in-preference-ratings.pdf		623.63 kB	Adobe PDF	View/Open

Show simple item record

Deriu, J., von Däniken, P., Tuggener, D., & Cieliebak, M. (2023). Correction of errors in preference ratings from automated metrics for text generation [Conference paper]. In A. Rogers, R. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 6456–6474). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.404

Deriu, J. et al. (2023) ‘Correction of errors in preference ratings from automated metrics for text generation’, in A. Rogers, R. Boyd-Graber, and N. Okazaki (eds) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, pp. 6456–6474. Available at: https://doi.org/10.18653/v1/2023.findings-acl.404.

J. Deriu, P. von Däniken, D. Tuggener, and M. Cieliebak, “Correction of errors in preference ratings from automated metrics for text generation,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 6456–6474. doi: 10.18653/v1/2023.findings-acl.404.

DERIU, Jan, Pius VON DÄNIKEN, Don TUGGENER und Mark CIELIEBAK, 2023. Correction of errors in preference ratings from automated metrics for text generation. In: Anna ROGERS, Roger BOYD-GRABER und Naoaki OKAZAKI (Hrsg.), Findings of the Association for Computational Linguistics: ACL 2023. Conference paper. Association for Computational Linguistics. 2023. S. 6456–6474

Deriu, Jan, Pius von Däniken, Don Tuggener, and Mark Cieliebak. 2023. “Correction of Errors in Preference Ratings from Automated Metrics for Text Generation.” Conference paper. In Findings of the Association for Computational Linguistics: ACL 2023, edited by Anna Rogers, Roger Boyd-Graber, and Naoaki Okazaki, 6456–74. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.404.

Deriu, Jan, et al. “Correction of Errors in Preference Ratings from Automated Metrics for Text Generation.” Findings of the Association for Computational Linguistics: ACL 2023, edited by Anna Rogers et al., Association for Computational Linguistics, 2023, pp. 6456–74, https://doi.org/10.18653/v1/2023.findings-acl.404.