1) eDNA
The deposited genetic data were obtained from environmental DNA (eDNA) samples collected around Réunion Island in the southwest Indian Ocean between May 2018 and June 2019. Surface water samples were collected from a vessel without access to cold storage or onboard laboratory facilities. To limit cross-contamination, all personnel onboard wore protective gowns and nitrile gloves during sample collection. Sampling was conducted at the front of the vessel to avoid contact with the hull, using 5L sterile bottles at the sea surface interface.
To develop an eDNA protocol, 11 surface water samples were collected during marine mammal sightings using various filtration capsules and preservation buffers. Based on assessments of feasibility, filtration time, DNA concentration, and cost, a protocol was established for sampling 20 sites around Réunion Island. At each site, a 10 L seawater sample (2 x 5 L) was collected from the sea surface interface using a Sterlitech filter, RNAlater solution, a peristaltic pump, and sterile tubing. Samples with marine mammal observations were favored to compare eDNA detections with recorded marine mammal sightings. Of the 20 samples, 14 were collected within close proximity (10-20 meters) of marine mammals.
Following primer testing, the ~230 bp hypervariable region of the 12S rRNA gene (MiMammal) was amplified. DNA amplifications were conducted with 12 PCR replicates in a final volume of 10 μL. The amplification mixture contained 1X Phusion Green Hot Start II High-Fidelity PCR Master Mix (Thermo Scientific), 0.4 μM of each of the tailed primers, 2 μM of our developped human blocking primer, 0.8 μg/μL bovine serum albumin (BSA - Thermo Scientific), 3% of DMSO (Thermo Scientific), 1.5 mM of MgCl2 (Invitrogen), and topped up with PCR grade water (Thermo Scientific). The human blocking primers were added in a 5x concentration relative to the mammal primers. PCR conditions comprised of an initial denaturation at 98 °C for 3 minutes, followed 45 cycles of 20 seconds at 98 °C, 15 seconds at 69 °C, and 15 seconds at 72 °C, and a final elongation step at 72 °C for 5 minutes. To monitor potential contaminants, a total of 3 negative extraction controls, 3 negative PCR controls (ultrapure water, 12 replicates), and 3 positive control sample (a mock community with a known composition) were amplified and sequenced in parallel to the samples. Amplification success was determined by gel electrophoresis. DNA was purified to remove PCR inhibitors using a DNeasy PowerClean Pro Cleanup Kit (Qiagen). Purified DNA extracts were quantified using a Qubit dsDNA HS Assay Kit on a Qubit 3.0 fluorometer (Thermo Scientific). PCR replicates were pooled and sequencing adapters were added. The final library was sequenced using an Illumina MiSeq V2 kit at 15 pM with a 10% PhiX spike.
Sequence data was processed using a NatureMetrics custom bioinformatics pipeline for quality filtering, dereplication, and taxonomic assignment. Samples were demultiplexed based on the combination of the i5 and i7 index tags. Paired-end reads for each sample were merged with USEARCH with a minimum overlap of 20% of the total read length. Forward and reverse primers were trimmed from the merged sequences with CUTADAPT and retained if the trimmed length was between 140bp and 200 bp. These sequences were quality filtered with USEARCH to retain only those with an expected error rate per base of 0.05 or below and dereplicated by sample, retaining singletons. Unique reads from all samples were denoised in a single analysis with UNOISE, requiring retained sequences to have a minimum abundance of 8 in at least one sample. After filtering, taxa were identified by comparing those sequences to the GenBank reference database. A taxon-by-sample table was generated by mapping all dereplicated reads for each sample to the denoised sequences with USEARCH at an identity threshold of 97%. Denoised sequences were identified via BLAST against the nucleotide (nt) database from GenBank. Identifications to species level were based on the highest available percentage identity ≥99%, with an e-score of 1e-20 and a hit length of at least 80% of the query sequence. In cases where multiple reference sequences match equally to the query sequence then a more conservative higher taxonomic classification is given. Only sequences with species - or genus - level identifications were included in the final results. Where a species is represented by multiple Operational Taxonomic Units (OTUs), the sequence with the highest percentage match to that species is taken as the representative. Typically, the other sequences have the same occurrence pattern and the lower sequence similarity can be attributed to PCR or sequencing errors. Only sequences with species- or genus-level identifications were included in the final results.
These genetic data provide a valuable resource for studying marine biodiversity around Réunion Island and will contribute to a better understanding of the distribution and diversity of marine mammals in this region of the Indian Ocean.
2) Sequences
Sequencing was performed using the Sanger method on an ABI 3730xl DNA Analyzer. The amplified PCR products were subjected to cycle sequencing with fluorescently labeled dideoxynucleotides (ddNTPs), followed by separation of fragments via capillary electrophoresis. Contigs were generated for each individual sample using standard bioinformatics tools to produce high-quality, short-read sequences, suitable for downstream phylogenetic and genetic analyses. The two sequences' files are respectively the contigs assembled for Tursiops aduncus (Ta2 and Ta21) and Stenella longirostris (SL7 and SL9) based on Sanger sequencing.
3) Sequence Alignment Script
The R script provided in this dataset was specifically developed to align Stenella longirostris sequences produced during this study (SL7 and SL9) with the eDNA sequence detected around Réunion Island. This script allows for the precise calculation of nucleotide differences between sequences and generates alignment outputs that can be used for further comparative analyses. Although designed for this particular study, the script is versatile and can be applied to align other sequences, whether obtained from Stenella longirostris or other cetacean species. The alignment methodology is based on the Needleman-Wunsch algorithm for global alignment, which provides robust and reliable comparison results even when sequences are of different lengths. The script is fully documented and available in the Dataverse repository alongside this dataset. Researchers are encouraged to adapt and apply the script to their own sequence alignment needs.