The Trankit model for linguistic process of standard written Slovenian 1.1

PID

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian.

It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).

The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14).

To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.

This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results.

Identifier
PID http://hdl.handle.net/11356/1963
Related Identifier https://arxiv.org/pdf/2101.03289.pdf
Related Identifier http://hdl.handle.net/11356/1870
Related Identifier https://github.com/clarinsi/trankit-train
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1963
Provenance
Creator Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2024
Rights Apache License 2.0; PUB; https://opensource.org/licenses/Apache-2.0
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type toolService
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics