-
Post-OCR correction training dataset sPeriodika-postOCR
The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset... -
A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Docum...
This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations... -
A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Docum...
These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The...