A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

Dataset

PID

These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.

Identifier
PID	http://hdl.handle.net/11234/1-4935
Related Identifier	https://nlp.fi.muni.cz/projects/ahisto/ocr-dataset
Related Identifier	https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
Related Identifier	https://starfos.tacr.cz/en/project/TL03000365
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4935

Provenance
Creator	Novotný, Vít; Horák, Aleš
Publisher	Masaryk University, Brno
Publication Year	2022
Rights	Public Domain Dedication (CC Zero); http://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech; English; German; Latin
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics