Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0

Dataset

PID

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of prompts and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications.

The dataset was generated in several steps. After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. The questions followed the context of the instructions of use for the medication in question. Inadequate questions were manually excluded, while the responses were generated entirely automatically using a specialized RAG system.

Please note that the current version of the dataset (containing 18,897 prompt-response pairs) does not guarantee clinical accuracy and may contain errors as a consequence of LLM hallucinations.

Identifier
PID	http://hdl.handle.net/11356/1982
Related Identifier	https://www.cjvt.si/povejmo/en/project/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1982

Provenance
Creator	Tovornik, Robert; Pavlović, Anđela; Plesnik, Emil; Fabjan, Borut
Publisher	Better, d.o.o.; Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2024
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics