Trankit model for SST 2.15

Dataset

PID

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev) featuring transcriptions of spontaneous speech in various everyday settings.

It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).

Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965. The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission.

Identifier
PID	http://hdl.handle.net/11356/1966
Related Identifier	https://arxiv.org/pdf/2101.03289.pdf
Related Identifier	http://hdl.handle.net/11356/1965
Related Identifier	https://github.com/clarinsi/trankit-train
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1966

Provenance
Creator	Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2024
Rights	Apache License 2.0; https://opensource.org/licenses/Apache-2.0; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	toolService
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics