Slovene-English parallel corpus slenWaC 1.0

PID

The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%.

Identifier
PID http://hdl.handle.net/11356/1061
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1061
Provenance
Creator Ljubešić, Nikola; Esplà-Gomis, Miquel; Ortiz Rojas, Sergio; Klubička, Filip; Toral, Antonio
Publisher Jožef Stefan Institute
Publication Year 2016
Funding Reference info:eu-repo/grantAgreement/EC/FP7/324414
Rights CLARIN.SI User Licence for Internet Corpora; https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; English
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline Linguistics