CLARIN - Repositories

al-qāmūs l-muḥīṭ: a digital Arabic dictionary: letter tāʾ

Dossier letter tāʾ contains: TXT file: part of plain text corresponding of the section of the letter tāʾ XML files without translation: conversion of text into XML resulting...

OpeNER tokenizer

JAVA Wrapping for OpeNER tokenizer Web Service. Works with ita,eng, fra, deu, esp,nld languages. To be used in WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry.

KrdWrd CANOLA Corpus 1.1

The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...

ACTER (Annotated Corpora for Term Extraction Research) v1.3

The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised...

ACTER (Annotated Corpora for Term Extraction Research) v1.4

The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised...

DIDI - The DiDi Corpus of South Tyrolean CMC 1.0.0

The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook...

LEONIDE - Longitudinal Learner Corpus in Italiano, Deutsch and English 1.1

LEONIDE is a longitudinal corpus of student essays documenting the language competences and writing development of lower secondary school students in three different languages....

KrdWrd CANOLA Corpus 1.0

The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...

Code preference in OLL of accommodation in Palma

The file consists of a database in .SAV format (SPSS) of language choice and preference as reflected in the websites of accommodation establishments in the city of Palma de...

MT@BZ annotation guidelines v1.0

The MT@BZ annotation guidelines are guidelines for legal Italian-German machine translation quality assessment. Particularly, they cover the South Tyrolean German variety. They...

Core Metadata Schema for Learner Corpora (version 2)

This document contains a list of metadata fields that can be used to describe learner corpus data. The core metadata scheme is structured around 8 metadata types: -...

Core Metadata [Schema] for Learner Corpora Draft 1.0

First proposal towards a "Core Metadata [Schema] for Learner Corpora", presented at the "CLARIN workshop on Interoperability of Second Language Resources and Tools", Gothenburg,...

ACTER (Annotated Corpora for Term Extraction Research) v1.5

ACTER (Annotated Corpora for Term Extraction Research) is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains...

Core Metadata Schema for Learner Corpora (version 1)

The Core Metadata Schema for Learner Corpora is an extensive revision of Granger & Paquot's (2017) Core Metadata [Schema] for Learner Corpora Draft 1.0 in the field of...

CEFR-based Short Answer Grading

The project through which the corpus was collected is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English....

Eesti avatud paralleelkorpus Estonian Open Parallel Corpus

Projekti „Eesti avatud paralleelkorpus” eesmärk on luua oluline kogus keeleressursse statistiliste masintõlkesüsteemide parendamiseks. Projekt aitab kaasa olukorra saavutamisele...

Eesti-inglise paralleelkorpus Estonian-English parallel corpus

korpus More info at http://www.cl.ut.ee/korpused/paralleel/index.php?lang=en Annotated and sentence-aligned parallel text corpus; contains: 1. Estonian laws and their...

VESPA

The aim of the VESPA learner corpus project is to build a large collection of disciplinary writing by L2 English university students across registers, disciplines and degrees of...

678 datasets found