GECCC Grammar Error Correction Corpus for Czech

Dataset

PID

Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech

The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf

Identifier
PID	http://hdl.handle.net/11234/1-4639
Related Identifier	https://arxiv.org/pdf/2201.05590.pdf
Related Identifier	http://hdl.handle.net/11234/1-4861
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4639

Provenance
Creator	Náplava, Jakub; Straka, Milan; Straková, Jana; Rosen, Alexandr
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); http://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics