Developing Large Language Models for Quantum Chemistry Simulation Input Generation

Dataset

DOI

This repository encompasses all code used to run the experiments described in the study titled "Developing Large Language Models for Quantum Chemistry Simulation Input Generation". In addition to the code for our system architecture, we include the datasets described in the study, which can be used for further research. The repository also contains some generally helpful classes. To reproduce the results from our study, refer to the Scripts folder, where we explain the scripts used to run our experiments and gather data. For more insight into the classes used and how to implement them in your own research, refer to the Classes folder. We for instance show how to easily use our rule-based system to generate different calculations. Additionally, you can inspect and extract the various datasets we used from the Data folder, where all available datasets are explained. The Orca Output folder stores all output files gathered from running ORCA calculations. One important note is that to use the code in this repository, you should configure your own OpenAI API key in your system path. Moreover, to use RAG, one should scrape the ORCA input library with our provided script and add the ORCA manual to the Documents/Regular folder. We do not publish this here as we are not the writers.

Identifier
DOI	https://doi.org/10.34894/WNRHA4
Related Identifier	IsCitedBy https://doi.org/10.26434/chemrxiv-2024-9g2w2
Metadata Access	https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34894/WNRHA4

Provenance
Creator	Pollice, Robert ; Jacobs, Pieter Floris
Publisher	DataverseNL
Contributor	Groningen Digital Competence Centre; Pollice, Robert; Jacobs, Pieter Floris; DataverseNL network
Publication Year	2024
Rights	CC-BY-4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	Groningen Digital Competence Centre (rug.nl)

Representation
Resource Type	Python code, ORCA input files, User prompts; Dataset
Format	application/zip; application/octet-stream; text/plain
Size	1886040; 1612; 265
Version	1.0
Discipline	Chemistry; Natural Sciences
Spatial Coverage	Groningen, The Netherlands