The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle".
The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". In total, the dataset contains 3955 observations, which are tokens of the inchoative construction for the following auxiliaries: comenzar, empezar, meter, poner, echar(se), liar, arrancar and romper. The data originates from the the Spanish Web corpus (esTenTen18), accessed via Sketch Engine. Only the European Spanish subcorpus was selected. The search syntax that was used to detect the inchoative construction was the following: “[lemma="empezar"] [tag="R."]{0,3}"a"[tag="V."] within " (replacing the concrete lemma "empezar" by other lemma's for each auxiliary, see Spinc_queries_20221202.txt for all concrete corpus queries). After downloading samples of 10.000 tokens per auxiliary, the samples were manually cleaned. Only 500 tokens per auxiliary were retained in the dataset. Next, the data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'INF' (infinitive) and 'Class' were used as input for the hierarchical cluster analysis (see data-specific sections below for more information about the variables).