DSI-enriched ParaCrawl 9 en-es corpus

PID

This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the affinity of each segment pair to a specific Digital Service Infrastructure (DSI), which includes Cybersecurity, Electronic Exchange of Social Security Information, E-health, E-justice, Europeana, Online Dispute Resolution, Open Data Portal and Safer Internet. The model that assigned the probabilities is a fine-tuned pre-trained language model (DeBERTa-v3-large), trained on a crawled corpus of English DSI-specific texts. More information is available on the corresponding GitHub page: https://github.com/RikVN/DSI. The rest of the information in the original version of the corpus remained unchanged.

Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

Identifier
PID http://hdl.handle.net/11356/1526
Related Identifier https://macocu.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1526
Provenance
Creator Bañón, Marta; Esplà-Gomis, Miquel; Forcada, Mikel L.; García-Romero, Cristian; Kuzman, Taja; Ljubešić, Nikola; van Noord, Rik; Pla Sempere, Leopoldo; Ramírez-Sánchez, Gema; Rupnik, Peter; Suchomel, Vít; Toral, Antonio; van der Werff, Tobias; Zaragoza, Jaume
Publisher Jožef Stefan Institute; Prompsit; Rijksuniversiteit Groningen; Universitat d'Alacant
Publication Year 2022
Rights CC0-No Rights Reserved; https://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Spanish; Castilian; English
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 4
Discipline Linguistics