Dataset - B2FIND

Frequency lists of pivot words and GSE counts

The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing...

Wizerunek Andreja Babiša i Mateusza Morawieckiego w kontekście sytuacji kryzy...

Zbiór artykułów z prasy czeskiej dotyczący Mateusza Morawickiegi (iDnes) oraz z prasy polskiej dotyczących Andreja Babiša (Rzeczpospolita)

fronda

Some texts of fronda.pl

KGR10-RoBERTa

Polish RoBERTa model pre-trained on KGR10 corpora.

Inforex

Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense...

Liner2

Rozpoznaje nazwy własne w tekście polskim.

zmiany klimatu kraków

warsztaty w Krakowie - socjologia

CorpoGrabber

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get...

Polish WSD Datasets

Data and code for the paper published at ICCS 2022: "A Unified Sense Inventory for Word Sense Disambiguation in Polish". The code is available at...

ELMo Embeddings for Polish

A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10). To retrain the model please use the checkpoint and vocabulary files available at:...

MWE Świętochowski

Aleksander Świętochowski

1990_Skubiszewski

pierwsze expose MSZ III RP

Word Embeddings for Polish

Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia).

AspectEmo 1.0: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Senti...

AspectEmo 1.0 Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews, that was used in many projects on the use of different methods...

MWE Mniszek, Gehenna czyli dzieje nieszczęśliwej miłości, Część pierwsza, Na ...

Helena Mniszek

MWE Sienkiewicz, Ogniem i mieczem

Henryk Sienkiewicz

Big Data language model - STEMMED - RAW data

Big data language model stemmed in RAW format

KPWr annotation guidelines - coreference

Coreference annotation guidelines describing the process of manual annotation of documents in Polish Corpus of Wrocław University of Technology (KPWr)

Wikinews_luty_marzec_2020

Test corpus _ 3_03_20

PELCRA PARL corpus

The corpus comprises 50 sampled recordings (12 hours) and manual transcriptions (ca. 101 00 word tokens) of parliamentary data.

653 datasets found