Background data for: Latent-variable modeling of ordinal outcomes in language data analysis

Dataset

DOI

This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth.

Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.

Identifier
DOI	https://doi.org/10.18710/WI9TEH
Metadata Access	https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/WI9TEH

Provenance
Creator	Krug, Manfred ; Vetter, Fabian ; Sönning, Lukas (ORCID: 0000-0002-2705-395X)
Publisher	DataverseNO
Contributor	Sönning, Lukas; University of Bamberg; Hilbert, Michaela; Pabel, Sebastian; Scheiner, Katharina; Linne, Anja; Schützler, Ole; Lucas, Christopher; Peterson, Nicholas; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year	2024
Funding Reference	Bavarian Ministry for Science, Research and the Arts ; Spanish Ministry of Education and Science with European Regional Development Fund HUM2007-60706/FILO ; German Humboldt Foundation
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Sönning, Lukas (University of Bamberg)

Representation
Resource Type	questionnaire data; Dataset
Format	text/plain; text/tsv; application/pdf
Size	8660; 4475; 1079156; 287207; 160867
Version	1.0
Discipline	Humanities
Spatial Coverage	Malta