Deltacorpus 1.1

Dataset

PID

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).

Changes in version 1.1:

Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.

Identifier
PID	http://hdl.handle.net/11234/1-1743
Related Identifier	http://hdl.handle.net/11234/1-1662
Related Identifier	http://ufal.mff.cuni.cz/deltacorpus
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-1743

Provenance
Creator	Mareček, David; Yu, Zhiwei; Zeman, Daniel; Žabokrtský, Zdeněk
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2016
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); http://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Belarusian; Bosnian; Bulgarian; Czech; Croatian; Upper Sorbian; Macedonian; Polish; Russian; Slovak; Slovenian; Slovene; Serbian; Ukrainian; Latvian; Lithuanian; Afrikaans; Danish; German; English; Faroese; Western Frisian; Swiss German; Alemannic; Alsatian; Icelandic; Limburgan; Limburger; Limburgish; Luxembourgish; Letzeburgesch; Low German; Low Saxon; German, Low; Saxon, Low; Dutch; Flemish; Norwegian Nynorsk; Nynorsk, Norwegian; Norwegian; Scots; Swedish; Yiddish; Aragonese; Asturian; Bable; Leonese; Asturleonese; Catalan; Valencian; French; Galician; Haitian; Haitian Creole; Italian; Latin; Neapolitan; Portuguese; Romanian; Moldavian; Moldovan; Spanish; Castilian; Walloon; Breton; Welsh; Gaelic; Scottish Gaelic; Irish; Greek, Modern (1453-); Greek; Armenian; Albanian; Persian; Farsi; Kurdish; Tajik; Bengali; Bangla; Gujarati; Hindi; Marathi; Marāṭhī; Nepali; Urdu; Amharic; Arabic; Hebrew; Estonian; Finnish; Hungarian; Basque; Georgian; Chuvash; Azerbaijani; Turkish; Uzbek; Kazakh; Tatar; Yakut; Korean; Mongolian; Telugu; Kannada; Malayalam; Tamil; Nepal Bhasa; Newari; Vietnamese; Indonesian; Javanese; Malagasy; Maori; Māori; Malay; Pampanga; Kapampangan; Sundanese; Tagalog; Waray; Swahili; Esperanto; Ido; Interlingua (International Auxiliary Language Association); Volapük
Resource Type	corpus
Format	application/x-tar; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics