Terminology identification dataset KAS-term 1.0

Dataset

PID

The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the frequency threshold of 3. The PhD theses are from the areas of chemistry, computer science and political science. Each of the term candidates is annotated by four annotators as being (1) in-domain term, (2) out-of-domain term, (3) general academic term or (4) not a term. Each term candidate is also annotated with its frequency in the PhD thesis and 7 statistical measures. The resource can serve as a training set for supervised learning of term extraction and for terminology extraction tool benchmarking.

Identifier
PID	http://hdl.handle.net/11356/1198
Related Identifier	http://nl.ijs.si/kas/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1198

Provenance
Creator	Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Arhar Holdt, Špela; Bren, Urban; Robnik-Šikonja, Marko; Udovič, Boštjan
Publisher	Jožef Stefan Institute
Publication Year	2018
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/octet-stream; text/csv; text/plain; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline	Linguistics