Assessing keyness using permutation tests

Mildenberger, Thoralf

doi:10.48550/arXiv.2308.13383

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-28571

Full metadata record

DC Field	Value	Language
dc.contributor.author	Mildenberger, Thoralf	-
dc.date.accessioned	2023-09-01T13:16:30Z	-
dc.date.available	2023-09-01T13:16:30Z	-
dc.date.issued	2023-08	-
dc.identifier.other	arXiv:2308.13383	de_CH
dc.identifier.uri	https://digitalcollection.zhaw.ch/handle/11475/28571	-
dc.description.abstract	We propose a resampling-based approach for assessing keyness in corpus linguistics based on suggestions by Gries (2006, 2022). Traditional approaches based on hypothesis tests (e.g. Likelihood Ratio) model the copora as independent identically distributed samples of tokens. This model does not account for the often observed uneven distribution of occurences of a word across a corpus. When occurences of a word are concentrated in few documents, large values of LLR and similar scores are in fact much more likely than accounted for by the token-by-token sampling model, leading to false positives. We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens, which is much closer to the way corpora are actually assembled. We then use a permutation approach to approximate the distribution of a given keyness score under the null hypothesis of equal frequencies and obtain p-values for assessing significance. We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically any keyness score. Hence, appart from obtaining more accurate p-values for scores like LLR, we can also assess significance for e.g. the logratio which has been proposed as a measure of effect size. An efficient implementation of the proposed approach is provided in the `R` package `keyperm` available from github.	de_CH
dc.format.extent	15	de_CH
dc.language.iso	en	de_CH
dc.publisher	arXiv	de_CH
dc.rights	Licence according to publishing contract	de_CH
dc.subject	Corpus linguistics	de_CH
dc.subject	Applied statistics	de_CH
dc.subject.ddc	400: Sprache und Linguistik	de_CH
dc.subject.ddc	510: Mathematik	de_CH
dc.title	Assessing keyness using permutation tests	de_CH
dc.type	Working Paper – Gutachten – Studie	de_CH
dcterms.type	Text	de_CH
zhaw.departement	School of Engineering	de_CH
zhaw.organisationalunit	Institut für Datenanalyse und Prozessdesign (IDP)	de_CH
dc.identifier.doi	10.48550/arXiv.2308.13383	de_CH
dc.identifier.doi	10.21256/zhaw-28571	-
zhaw.funding.eu	No	de_CH
zhaw.originated.zhaw	Yes	de_CH
zhaw.webfeed	Datalab	de_CH
zhaw.author.additional	No	de_CH
zhaw.display.portrait	Yes	de_CH
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
2023_Mildenberger_Assessing-keyness-using-permutation-tests.pdf		595.35 kB	Adobe PDF	View/Open

Show simple item record

Mildenberger, T. (2023). Assessing keyness using permutation tests. arXiv. https://doi.org/10.48550/arXiv.2308.13383

Mildenberger, T. (2023) Assessing keyness using permutation tests. arXiv. Available at: https://doi.org/10.48550/arXiv.2308.13383.

T. Mildenberger, “Assessing keyness using permutation tests,” arXiv, Aug. 2023. doi: 10.48550/arXiv.2308.13383.

MILDENBERGER, Thoralf, 2023. Assessing keyness using permutation tests. arXiv

Mildenberger, Thoralf. 2023. “Assessing Keyness Using Permutation Tests.” arXiv. https://doi.org/10.48550/arXiv.2308.13383.

Mildenberger, Thoralf. Assessing Keyness Using Permutation Tests. arXiv, Aug. 2023, https://doi.org/10.48550/arXiv.2308.13383.