A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

PID

These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.

Identifier
PID http://hdl.handle.net/11234/1-4935
Related Identifier https://nlp.fi.muni.cz/projects/ahisto/ocr-dataset
Related Identifier https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
Related Identifier https://starfos.tacr.cz/en/project/TL03000365
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4935
Provenance
Creator Novotný, Vít; Horák, Aleš
Publisher Masaryk University, Brno
Publication Year 2022
Rights Public Domain Dedication (CC Zero); http://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Czech; English; German; Latin
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics