The influence of audio length on the performance of Swiss-German speech translation models

van der Heide, Niklas Rijk; Saaro, Felix Matthias

doi:10.21256/zhaw-29666

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-29666

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Cieliebak, Mark	-
dc.contributor.advisor	Deriu, Jan Milan	-
dc.contributor.author	van der Heide, Niklas Rijk	-
dc.contributor.author	Saaro, Felix Matthias	-
dc.date.accessioned	2024-01-27T13:41:27Z	-
dc.date.available	2024-01-27T13:41:27Z	-
dc.date.issued	2023	-
dc.identifier.uri	https://digitalcollection.zhaw.ch/handle/11475/29666	-
dc.description.abstract	Speech Translation models designed to convert spoken Swiss-German to written German have been in existence for some time. While these models generally perform well, their performance in various scenarios remains poorly understood. In this thesis, we explore the influence of audio length on the performance of Swiss-German speech translation models and identify the necessary factors for achieving better performance on longer audio segments. To achieve this, we examined four speech translation models from different institutions. A model from the Zurich University of Applied Sciences (ZHAW), one from the University of Applied Sciences Northwestern Switzerland (FHNW), a model from Microsoft, as well as a model from OpenAI called Whisper. We conducted eight different experiments using a Swiss-German corpus collected by the ZHAW and FHNW. In the experiments, the audio length was augmented in various ways. From there, we found that while the ZHAW, FHNW and Microsoft models showed a tendency to perform worse on longer duration, extending the duration by adding silence did not influence on the performance. Changing the playback speed has a negative influence on the ZHAW, Microsoft and Whisper models, both when speeding segments up or slowing them down. The FHNW model exhibited extraordinary robustness to changes in playback speed, as the results when accelerated by a factor of 1.25 were nearly identical to the results when the playback speed was not altered. The biggest influence on performance was when adding more than one sentence to a segment. Without a segmentation of the input audio the ZHAW, FHNW and Microsoft models performed badly, indicating that segmentation should be introduced as soon as more than one sentence appears in an audio recording. Training a model specifically on multi-sentence segments showed promising results, on single sentence segments and multi-sentence segments as well as in scenarios where sentences are split while segmenting the audio recordings. Comparing a sentence-based segmentation, which is considered ideal for models trained on single sentence segments, to a fixed-window segmentation with an overlap showed an almost identical result. Examining the models on a real-life recording showed that the ZHAW (lowercase) and ZHAW (multisentence) models perform considerably worse than the FHNW, Microsoft and Whisper models. Indicating that more investigation is required to fully understand what makes a speech translation model work well in real-life scenarios.	de_CH
dc.format.extent	89	de_CH
dc.language.iso	en	de_CH
dc.publisher	ZHAW Zürcher Hochschule für Angewandte Wissenschaften	de_CH
dc.relation.ispartofseries	Bachelorarbeiten ZHAW School of Engineering	de_CH
dc.rights	http://creativecommons.org/licenses/by/4.0/	de_CH
dc.subject.ddc	418.02: Translationswissenschaft	de_CH
dc.subject.ddc	430: Deutsch	de_CH
dc.title	The influence of audio length on the performance of Swiss-German speech translation models	de_CH
dc.type	Thesis: Bachelor	de_CH
dcterms.type	Text	de_CH
zhaw.departement	School of Engineering	de_CH
zhaw.publisher.place	Winterthur	de_CH
dc.identifier.doi	10.21256/zhaw-29666	-
zhaw.originated.zhaw	Yes	de_CH
Appears in collections:	Bachelorarbeiten ZHAW School of Engineering

Files in This Item:

File	Description	Size	Format
2023_van-der-Heide-Niklas_Saaro-Felix_BA_SoE.pdf		10.04 MB	Adobe PDF	View/Open

Show simple item record

van der Heide, N. R., & Saaro, F. M. (2023). The influence of audio length on the performance of Swiss-German speech translation models [Bachelor’s thesis, ZHAW Zürcher Hochschule für Angewandte Wissenschaften]. https://doi.org/10.21256/zhaw-29666

van der Heide, N.R. and Saaro, F.M. (2023) The influence of audio length on the performance of Swiss-German speech translation models. Bachelor’s thesis. ZHAW Zürcher Hochschule für Angewandte Wissenschaften. Available at: https://doi.org/10.21256/zhaw-29666.

N. R. van der Heide and F. M. Saaro, “The influence of audio length on the performance of Swiss-German speech translation models,” Bachelor’s thesis, ZHAW Zürcher Hochschule für Angewandte Wissenschaften, Winterthur, 2023. doi: 10.21256/zhaw-29666.

VAN DER HEIDE, Niklas Rijk und Felix Matthias SAARO, 2023. The influence of audio length on the performance of Swiss-German speech translation models. Bachelor’s thesis. Winterthur: ZHAW Zürcher Hochschule für Angewandte Wissenschaften

van der Heide, Niklas Rijk, and Felix Matthias Saaro. 2023. “The Influence of Audio Length on the Performance of Swiss-German Speech Translation Models.” Bachelor’s thesis, Winterthur: ZHAW Zürcher Hochschule für Angewandte Wissenschaften. https://doi.org/10.21256/zhaw-29666.

van der Heide, Niklas Rijk, and Felix Matthias Saaro. The Influence of Audio Length on the Performance of Swiss-German Speech Translation Models. ZHAW Zürcher Hochschule für Angewandte Wissenschaften, 2023, https://doi.org/10.21256/zhaw-29666.