Test dataset for assessment FROGS 16S amplicon methodology

DOI

Grinder (v.0.5.3) (Angly, et al., 2012) was used to simulate the PCR amplification of fulllength (V3V4 and V4) sequences from reference databases. We generated 25 sets of species manually extracted from UTAX (Simulated Data From UTAX = SDFU) and 25 others from SILVA (v123) databank (Quast, et al., 2013) (Simulated Data From SILVA = SDFS) (Figure1 of main text and companion website: tab SDFU/Datasets and SDFS/Datasets). We generated amplicons by (i) filtering out sequences with ambiguous nucleotides, (ii) keeping only bacterial species with non-ambiguous affiliation taxonomy and with pintail>50 for sequences from SILVA (Ashelford, et al., 2005), and (iii) with a match (with 10% of mismatches allowed) for the forward (TACGGRAGGCAGCAG) and reverse (TAGGATTAGATACCCTGGTA) primers in the V3V4 region and for the forward (GTGCCAGCMGCCGCGGTAA) primer in the V4 region,and (iv) maximizing the phylogenetic diversity of the amplicons in the full length 16S phylogenetic tree. This results in 25 increasingly complex nested databases. Grinder requires both error and abundance profiles to generate sequences. We used the following error parameters: the error rate increases linearly from 0.301% to 0.303% per base along the read, 98.6% of errors are SNPs and 1.4% are indels. These parameters were calibrated by mapping reads from a single strain (D'Amore, et al., 2016; Schirmer, et al., 2016; Schirmer, et al., 2015) MiSeq sequencing run to its known sequence to mimic typical MiSeq error profiles and agree with other reported values. We used the default n-mer distribution: 89% of bimeras, 11% trimeras and 0.3% of quadrimera, corresponding to the average values published in Quince et al. 2011 (Quince, et al., 2011). The fraction of chimera increased with the reference database size to reflect increasing sequence similarities: 5% for 20 taxa, 10% for 100 and 200 taxa and 20% for 500 and 1000 taxa. Chimera breakpoints were distributed uniformly along the amplicon. We considered two different abundance profiles: uniform and power law. For a power law abundance profile, parameters were calibrated to set the expected max/min abundance ratio to 100 (20 taxa), 1000 (100 and 200 taxa) or 10000 (500 and 1000 taxa). For each combination of database sizes (20/100/200/500/1000), abundance profiles (uniform, power law), amplicon regions (V3V4/V4), we generated 5 communities with different compositions. We then simulated 10 samples of 100000 reads each from each community. Finally, we used cutadapt (v1.7.1) to trim primers from the generated reads. Trimmed sequences were not preprocessed with quality filters but instead used as such in downstream analyses.

grinder, 0.5.4

Identifier
DOI https://doi.org/10.15454/VGVCIJ
Related Identifier https://doi.org/10.1093/bioinformatics/btx791
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.15454/VGVCIJ
Provenance
Creator Rué, Olivier ORCID logo
Publisher Recherche Data Gouv
Contributor Rué, Olivier; Pascal, Géraldine; Bernard, Maria
Publication Year 2021
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Rué, Olivier (INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement)
Representation
Resource Type Dataset
Format application/gzip
Size 4914792443; 3301562885; 4305506570; 2745409479
Version 1.2
Discipline Life Sciences; Basic Biological and Medical Research; Biology; Medicine; Omics