11 datasets found

Keywords: text corpora

Filter Results
  • CorpusExplorer

    Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks...
  • LiFR-Lite

    Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...
  • Engineering job ads corpus

    The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic...
  • LiFR-Lite (2021-11-05)

    Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...
  • jusText

    jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD...
  • Amharic WIC Corpus

    Substantially cleaned version of existing morphologically annotated WIC Corpus.
  • Somali Web Corpus

    Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • Oromo web corpus

    Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • ParaCrawl Corpus version 1.0

    The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of...
  • Plaintext Wikipedia dump 2018

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at...
  • Tigrinya Web Corpus

    Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
You can also access this registry using the API (see API Docs).