Dataset - B2FIND

Pre-trained POS tagging models for German social media

Pre-trained POS tagging models for the HunPos tagger (Halácsy et al. 2007) the biLSTM-char-CRF tagger (Reimers & Gurevych 2017) Online-Flors (Yin et al. 2015)....

German Twitter Titling Corpus

The German Titling Twitter Corpus consists of 1904 stance-annotated tweets collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum...

GermEval-2018 Corpus (DE)

This dataset comprises the training and test data (German tweets) from the GermEval 2018 Shared on Offensive Language Detection.

Dataset: input and results related to the paper 'Anticipointment detection in...

This dataset features the training models, emotion classifications and emotion patterns before and after events, related to the paper: F. Kunneman, M. van Mulken and A. Van den...

Dataset: Events and periodicity analysis related to the paper 'Automatically ...

This dataset features information on all the events that were automatically extracted from Twitter and used as input to periodicity detection, as described in the paper: F....

Anàlisi de la toxicitat de la política espanyola a Twitter durant la pandèmia...

Llistat dels tweets analitzats en la recerca de l'article "La toxicidad de la política española en Twitter durante la pandemia de la COVID-19" que es publica a la revista...

Data of posting strategies of five major media outlets on Twitter during Musk...

The dataset contains an excel file with the analysis of the impact of different digital newspapers on twitter. The file includes different sheets in which one can find an...

Slovenian Twitter hate speech dataset IMSyPP-sl

A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs,...

Twitter corpus Janes-Tweet 1.0

Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into...

Tweets about impact investing

The corpus contains 668,529 tweets (tweet IDs) relevant to "impact investing", accompanied by sentiment labels given by an automated sentiment classifier. Impact investing...

Dataset of European Parliament roll-call votes and Twitter activities MEP 1.0

The resource consists of two datasets related to Members of the 8th European Parliament (MEPs). The first one is a dataset of 2,535 roll-call votes of MEPs until 2016-03-01. The...

Slovenian Twitter dataset 2018-2020 1.0

The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It consists of tweet IDs, retweet IDs, pseudo-anonymized user IDs, publication...

Dictionary of Twitterese Janes-Dict 1.0

The Dictionary of Twitterese 1.0 is the first attempt at a lexicographic description of non-standard Slovene as found on Twitter. Version 1.0 contains 1,002 entries, of which...

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

Brexit stance annotated tweets

The corpus contains over 4.5 million tweets (tweet IDs) automatically labeled by a machine learning program with stance regarding Brexit: Positive (supporting Brexit), Negative...

xLiMe Twitter Corpus XTC 1.0.1

The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...

Twitter sentiment for 15 European languages

The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators. There are 15 Twitter corpora for the corresponding 15 European languages....

CMC shortening corpus Janes-Kratko 1.0

Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...

Tweet comma corpus Janes-Vejica 1.0

Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)correct use, according to the supplied typology. The corpus was sampled from...

The Twitter user dataset for discriminating between Bosnian, Croatian, Monten...

The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the...

36 datasets found