Slovene instruction-following dataset for large language models GaMS-Instruct-DH 1.0

PID

GaMS-Instruct-DH is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional context field, as well as a field in which the source of the information included in the response is listed.

The dataset focuses on prompts from the field of digital humanities and museum documentation. Its primary goal is to provide a resource that allows existing large language models already available for the field of digital humanities to be expanded to cover Slovene and other similar, but less-resourced languages (e.g. Bosnian).

Version 1.0 include approx. 10,000 prompt-response pairs which were compiled entirely by hand by a team of linguists and experts from the field of digital humanities.

Identifier
PID http://hdl.handle.net/11356/1975
Related Identifier https://www.cjvt.si/povejmo/en/project/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1975
Provenance
Creator Šorn, Mojca; Cvek, Ana; Skubic, Jure; Logar, Tamara; Zagoranski, Sašo; Bratanović, Alen
Publisher Institute of Contemporary History; Semantika d.o.o.; Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics