EsCoLA - Spanish Corpus of Linguistic Acceptability - Dataset

Dataset

EsCoLA - Spanish Corpus of Linguistic Acceptability

DOI

Acceptability is one of the General Language Understanding Evaluation Benchmark (GLUE) probing tasks proposed to assess the linguistic capabilities acquired by a deep-learning transformer-based language model (LM). In this paper, we introduce the Spanish Corpus of Linguistic Acceptability EsCoLA. EsCoLA has been developed following the example of other linguistic acceptability data sets for English, Italian, Norwegian or Russian, with the aim of having a complete GLUE benchmark for Spanish. EsCoLA consists of 11,174 sentences and their acceptability judgements as found in well-known Spanish reference grammars. Additionally, all sentences have been annotated with the class of linguistic phenomenon the sentence is an example of, also following previous practices. We also provide as task baselines the results of fine-tuning four different language models with this data set and the results of a human annotation experiment. The results are also analyzed and commented to guide future research.

Identifier
DOI	https://doi.org/10.34810/data1138
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data1138

Provenance
Creator	Bel, Núria ; Punsola, Marta; Ruiz-Fernández, Valle
Publisher	CORA.Repositori de Dades de Recerca
Contributor	Bel, Núria; Universitat Pompeu Fabra
Publication Year	2024
Funding Reference	Agencia Estatal de Investigación PID2019-104512GB-I00
Rights	CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	Bel, Núria (Universitat Pompeu Fabra)

Representation
Resource Type	Textual data; Dataset
Format	text/tab-separated-values; text/plain
Size	978145; 8220
Version	1.2
Discipline	Humanities; Linguistics