Serbian-English parallel corpus srenWaC 1.0

PID

The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the sentence level and 76% on the word level.

Identifier
PID http://hdl.handle.net/11356/1059
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1059
Provenance
Creator Ljubešić, Nikola; Esplà-Gomis, Miquel; Ortiz Rojas, Sergio; Klubička, Filip; Toral, Antonio
Publisher Jožef Stefan Institute
Publication Year 2016
Funding Reference info:eu-repo/grantAgreement/EC/FP7/324414
Rights CLARIN.SI User Licence for Internet Corpora; https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Serbian; English
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline Linguistics