Finnish-English parallel corpus fienWaC 1.0

PID

The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the segment level and 76% on the word level.

Identifier
PID http://hdl.handle.net/11356/1060
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1060
Provenance
Creator Ljubešić, Nikola; Esplà-Gomis, Miquel; Ortiz Rojas, Sergio; Klubička, Filip; Toral, Antonio
Publisher Jožef Stefan Institute
Publication Year 2016
Funding Reference info:eu-repo/grantAgreement/EC/FP7/324414
Rights CLARIN.SI User Licence for Internet Corpora; https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Finnish; English
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline Linguistics