ELMo embeddings models for seven languages

PID

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level.

Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.

Identifier
PID http://hdl.handle.net/11356/1277
Related Identifier https://arxiv.org/abs/1911.10049
Related Identifier http://hdl.handle.net/11356/1257
Related Identifier http://embeddia.eu
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1277
Provenance
Creator Ulčar, Matej
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2019
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Apache License 2.0; PUB; https://opensource.org/licenses/Apache-2.0
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; Croatian; Finnish; Estonian; Latvian; Lithuanian; Swedish
Resource Type toolService
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline Linguistics