The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 million words) from 62,000 BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic Slovene. This corpus is made available because the public version of KAS (http://hdl.handle.net/11356/1244) does not contain the front matter, and hence the abstracts.
The abstracts were identified on a per-page basis, and are either in Slovenian (-abs-sl.txt, 47,273 files), English (-abs-en.tx, 49,261 files) or, for cases where the abstracts in both languages were on the same page, in both languages (*-abs-slen.txt, 11,720 files).
The files contain the plain text of the abstracts, one paragraph per line. Note that as the cleaning of source PDF files and identification of the abstracts was done automatically, this corpus contains various types of errors.
The files are stored in the same manner as for the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. The file with the metadata for the corpus texts is also included.
The abstracts can be useful for research in e.g. machine translations and terminology extraction, and, using also the full texts from the KAS corpus, for studies in automatic summarisation.