Slovene instruction-following dataset for large language models GaMS-Instruct-DH 1.0

Dataset

PID

GaMS-Instruct-DH is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional context field, as well as a field in which the source of the information included in the response is listed.

The dataset focuses on prompts from the field of digital humanities and museum documentation. Its primary goal is to provide a resource that allows existing large language models already available for the field of digital humanities to be expanded to cover Slovene and other similar, but less-resourced languages (e.g. Bosnian).

Version 1.0 include approx. 10,000 prompt-response pairs which were compiled entirely by hand by a team of linguists and experts from the field of digital humanities.

Identifier
PID	http://hdl.handle.net/11356/1975
Related Identifier	https://www.cjvt.si/povejmo/en/project/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1975

Provenance
Creator	Šorn, Mojca; Cvek, Ana; Skubic, Jure; Logar, Tamara; Zagoranski, Sašo; Bratanović, Alen
Publisher	Institute of Contemporary History; Semantika d.o.o.; Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics