Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-29667
Publication type: Bachelor thesis
Title: Automatic identification of Swiss German dialects using large language models
Authors: Frei, Claudio
Schneider, Philippe
Advisors / Reviewers: Cieliebak, Mark
Bogojeska, Jasmina
DOI: 10.21256/zhaw-29667
Extent: 83
Issue Date: 2023
Series: Bachelorarbeiten ZHAW School of Engineering
Publisher / Ed. Institution: ZHAW Zürcher Hochschule für Angewandte Wissenschaften
Publisher / Ed. Institution: Winterthur
Language: English
Subject (DDC): 410.285: Computational linguistics
Abstract: The publication of pre-trained language models enabled the development of various speech technologies for low-resource languages such as Swiss German. The training data required for this has become available with the creation of the SDS-200 corpus and the recent finalisation of the STT4SG-350 corpus. While Swiss German speech-to-text systems are the main research area, the ability to automatically identify Swiss German dialects can help to further improve the performance of such systems. In previous work, Swiss German dialect identification systems were already developed, but recent advancements such as the finalisation of the STT4SG-350 corpus and the publication of the Whisper model provided new resources with the potential to significantly increase the performance in this area. This thesis evaluated how the newly available resources can be leveraged to build the best-performing model for Swiss German dialect identification by training and validating models in various configurations. It has been found that mixing the SDS-200 and STT4SG-350 corpora can achieve promising results, but that the variety of speakers is an important factor to reach good generalisation. Speech augmentation has been found as a promising technique that can help to further increase the performance by artificially increasing the number of samples and the variety of speakers. Additionally, a representation learning approach was evaluated, which has not proven satisfactory. Finally, the newly available resources combined with the gained knowledge enabled an increase of the macro F1 score from 45.95% on classification to four canton groups to 62.76% on classification to seven subregions, an even harder task, thereby setting a new baseline for future systems.
URI: https://digitalcollection.zhaw.ch/handle/11475/29667
License (according to publishing contract): CC BY 4.0: Attribution 4.0 International
Departement: School of Engineering
Appears in collections:Bachelorarbeiten ZHAW School of Engineering

Files in This Item:
File Description SizeFormat 
2023_Frei-Claudio_Schneider-Philippe_BA_SoE.pdf3.71 MBAdobe PDFThumbnail
View/Open
Show full item record
Frei, C., & Schneider, P. (2023). Automatic identification of Swiss German dialects using large language models [Bachelor’s thesis, ZHAW Zürcher Hochschule für Angewandte Wissenschaften]. https://doi.org/10.21256/zhaw-29667
Frei, C. and Schneider, P. (2023) Automatic identification of Swiss German dialects using large language models. Bachelor’s thesis. ZHAW Zürcher Hochschule für Angewandte Wissenschaften. Available at: https://doi.org/10.21256/zhaw-29667.
C. Frei and P. Schneider, “Automatic identification of Swiss German dialects using large language models,” Bachelor’s thesis, ZHAW Zürcher Hochschule für Angewandte Wissenschaften, Winterthur, 2023. doi: 10.21256/zhaw-29667.
FREI, Claudio und Philippe SCHNEIDER, 2023. Automatic identification of Swiss German dialects using large language models. Bachelor’s thesis. Winterthur: ZHAW Zürcher Hochschule für Angewandte Wissenschaften
Frei, Claudio, and Philippe Schneider. 2023. “Automatic Identification of Swiss German Dialects Using Large Language Models.” Bachelor’s thesis, Winterthur: ZHAW Zürcher Hochschule für Angewandte Wissenschaften. https://doi.org/10.21256/zhaw-29667.
Frei, Claudio, and Philippe Schneider. Automatic Identification of Swiss German Dialects Using Large Language Models. ZHAW Zürcher Hochschule für Angewandte Wissenschaften, 2023, https://doi.org/10.21256/zhaw-29667.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.