BPEmb: Pre-trained Subword Embeddings in 275 Languages (LREC 2018)


BPEmb is a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization.

This dataset is split into 275 archives, one for each language. Languages are identified via their Wikipedia ID, e.g. "en" for English or "de" for German. Each archives contains three kinds of files:

.w2v.txt: Pretrained embeddings in word2vec plain text format, for different vocabulary sizes (vs1000 to vs100000) and embedding dimensionalities (d25 to d300). .vocab: The byte-pair vocabulary for the given vocabulary size (vs1000 to vs100000) in plain text format. .model: A byte-pair segmentation model in binary SentencePiece format. .model files contain the same information as *.vocab files and are provided for convenience.

DOI https://doi.org/10.11588/data/V9CXPR
Related Identifier http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/V9CXPR
Creator Heinzerling, Benjamin
Publisher heiDATA
Contributor Heinzerling, Benjamin
Publication Year 2019
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Heinzerling, Benjamin (Heidelberg University and Natural Language Processing (NLP) Group at the Heidelberg Institute for Theoretical Studies (HITS))
Resource Type Dataset
Format application/gzip
Size 269684399; 576401746; 271080030; 2384679012; 116466035; 2383513486; 2396840014; 1181803888; 2384977735; 269998895; 2386000594; 2386122552; 2394598179; 2384465062; 116007006; 573328017; 1187704405; 2389781119; 2387213423; 2380793180; 2389780312; 1175764072; 2389917593; 2391749555; 116992560; 1186990993; 115662901; 2389336189; 2389482557; 1178902175; 2382380907; 2383202838; 270510532; 1181665313; 2384296710; 575069338; 2384857068; 2393589701; 575324851; 116546335; 117046258; 2387297380; 1176719776; 573728071; 116674903; 1183270336; 2386200088; 271324049; 2391394562; 2384951790; 2384933703; 2386412863; 115671987; 2401037832; 1180235365; 572528711; 2414545448; 271086514; 116473902; 2391193759; 2383771998; 2383841531; 2384846001; 2384944106; 2384018524; 1179704180; 2386608623; 270004590; 2385473709; 55660927; 2395693136; 576975623; 1183562004; 2384320459; 1184511331; 2384268402; 1185513745; 2398355052; 2389402711; 2411022849; 570641911; 2384599979; 1182576987; 2391515728; 271253209; 2386563784; 1184824663; 1179018450; 268764516; 116720162; 2383964579; 1186335016; 2388146423; 2383753127; 2397608068; 1183998584; 2387565294; 2389214696; 1176742747; 2382990196; 574462300; 269239269; 55509080; 2407145023; 2399746223; 2385561436; 2384064396; 116859227; 576378864; 2369598861; 575051241; 2381786106; 570538596; 1183447415; 2388700796; 1188638298; 573004510; 116634540; 116211576; 2390478886; 577763384; 2394078726; 2390778612; 1188323643; 2363785522; 1187110996; 1181458202; 117192246; 2383516337; 2420345967; 574319812; 2389249529; 1181487011; 2382856126; 271200807; 2384261431; 1182880432; 1184531177; 1178538554; 2385964219; 2383855282; 574134110; 2412990102; 573002636; 575368009; 2385219613; 2386003996; 1184435159; 575449672; 2399036718; 1180177945; 6316729; 2394841506; 268583805; 2388157459; 2392059811; 2390758483; 2420462038; 2388492961; 2383150277; 2404195087; 2393312779; 2391960661; 1187700670; 1180139376; 1176313151; 115985201; 2383650071; 2390969391; 2396604791; 24809703; 2384795717; 2384814073; 2384670510; 269634847; 574072897; 269780618; 268542051; 115983688; 2382516490; 1187621508; 570749434; 2396510562; 1186124565; 570524469; 2404854592; 573879718; 2391129903; 1185372559; 576252644; 1179473816; 116995990; 116925227; 2386595611; 2396527802; 2396400426; 271065366; 2392970845; 2384528888; 2395853977; 1177473720; 270107811; 116333314; 2385231364; 1182204444; 2391855670; 576080815; 2398092360; 2390058021; 2385959384; 2385653929; 2403572004; 1179274371; 1184660571; 55611938; 2385022183; 2389235919; 2385677803; 2384105242; 115928979; 1183938495; 1176023890; 2383032317; 270347626; 2389057911; 268157934; 1182096406; 116564728; 2389057716; 2385570353; 2384422439; 2413429531; 2390075752; 1186292773; 2390309052; 268234139; 2392365521; 2395700773; 116006981; 2401761378; 2383031644; 575911468; 269483972; 116248970; 2385741332; 270368000; 2390143076; 116554397; 55333602; 55340657; 1187025928; 1188784501; 2397769669; 2391297259; 2386231688; 2382414659; 2390144198; 1179381164; 116449294; 2384032465; 2397772757; 2398260792; 2387153827; 2400166440; 574514061; 2343919478; 577752016; 1183539557; 2394821377; 2404095269; 1183541939; 573342229; 1181154201; 2333653845; 574318069
Version 1.0
Discipline Other