5 datasets found

Keywords: under resourced language

Filter Results
  • AlbNews Albanian Topic Modeling

    AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text...
  • Amharic Web Corpus

    Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC...
  • Oromo web corpus

    Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • AlbMoRe Movie Reviews in Albanian

    AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and...
  • OdiEnCorp 2.0

    Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...
You can also access this registry using the API (see API Docs).