Four ethanol-preserved tissues obtained from different cephalopod species belonging to the orders Octopoda, Oegopsida, and Sepiida were analysed (see sample IDs and cephalopod species in Table 1). Species were identified according to morphology and DNA bar coding (Fernando Fernández-Álvarez, personal communication). The genetic p-distances between the selected individuals were between 80.1 and 85.7 for the cytochrome c oxidase subunit I gene (COI) used in this study.
Total DNA was extracted from each individual using the NZY- Tissue gDNA Isolation Kit (NZYTech). DNAs were quantified with the Qubit dsDNA HS Assay Kit (ThermoFisher Scientific) and used as input for the preparation of the libraries.
We followed a standard Illumina library preparation protocol. In brief, we amplified the COI region (i.e., the standard animal barcode, Hebert et al. 2003) and included the Illumina specific adapters and indices by following a two-step PCR approach, slightly modified from Lange et al. (2014). For the sake of clarity, we refer to these PCRs as PCR1 and PCR2.
PCR1 primers were LCO1490 and HCO2198 (Folmer et al. 1994), which proved successful in a previous study in which the same specimens were DNA barcoded (Fernando Fernández-Álvarez et al., personal communication). Oligonucleotide tails bearing the Illumina sequencing primers were attached to the 5´ ends of primers LCO1490 and HCO2198. PCR2 was carried out with tailed primers that bear the indices and adapters and anneal to the Illumina sequencing primers (see Fig. 1 for a schematic representation of the binding process).
195
FIGURE 1 Schematic representation of the primers used for PCR1 and PCR2 (see main text). The positions of the Illumina adapters, indices, and sequencing primers are also shown. Note that primers are not drawn to scale.
PCR1 was carried out using 25 ng of total DNA in a final volume of 25 μL containing 6.50 μL of Supreme NZYTaq Green PCR Master Mix (NZYTech) (nonproofreading polymerase;
error rate of 1 × 10−5 according to the manufacturer), 0.5 μM of each primer, and PCR- grade water up to 25 μL. The thermal cycling conditions were as follows: an initial denaturation step at 95 °C for 5 min, followed by 35, 40, 45, 50, or 55 cycles (see Fig. 2) of denaturation at 95 °C for 30 s; annealing at 53 °C for 30 s; extension at 72 °C for 45 s; and a final extension step at 72 °C for 10 min. The products of PCR1 were purified using the SPRI method (DeAngelis et al. 1995), with Mag- Bind RXNPure Plus magnetic beads (Omega Biotek). The purified products were loaded in a 1% agarose gel stained with GreenSafe (NZYTech) and visualised under UV light.
PCR2 was carried out using 2.5 μL of the purified PCR1 products, and the same conditions as for PCR1 except for the number of cycles, which was set to five (Fig. 2) and the annealing temperature (60 °C). The products obtained were purified following the SPRI method as indicated above. Then, the purified products were loaded in a 1% agarose gel stained with GreenSafe (NZYTech) and visualised under UV light. All samples yielded libraries of the expected size.
196
FIGURE 2 From each cephalopod sample, five different high-throughput DNA barcoding libraries were constructed and sequenced in the Illumina MiSeq platform. In each of these five libraries, the number of PCR cycles during PCR1 was different (35, 40, 45, 50, and 55 cycles).
197
Libraries were quantified using the Qubit dsDNA HS Assay Kit (ThermoFisher Scientific) and pooled in equimolar amounts. The pool was sequenced in a fraction of a 600-cycle run (MiSeq Reagent Kit v3; PE300) of an Illumina MiSeq sequencer along with a PhiX library used to increase sequence diversity of the overall library, in Macrogen (Seoul, Korea).
FASTQ files were demultiplexed using RTA 1.18.54 (Illumina) and checked with FastQC 0.11.3 (http://www.bioinformatics.babraham. ac.uk/projects/fastqc/). Then, they were quality-trimmed using very conservative parameters in Trimmomatic 0.36 (Bolger et al. 2014) with the option SLIDINGWINDOW:1:30. SLIDINGWINDOW starts scanning at the 5´ end and clips the read once the average quality within the window falls below a threshold (Trimmomatic Manual 0.32). We set the size of the window to 1 and the quality threshold to 30 (Phred Quality Score). Therefore, when the quality of a single nucleotide fell below a Phred Quality Score of 30, the read was clipped from this position to the 3´ end. We used these very conservative parameters to make sure that the mutations observed in the sequencing results were due to PCR errors and not to sequencing errors. The quality of the resulting files was checked again with FastQC.
Quality-trimmed FASTQ files were imported into Geneious 8.1.6 (http://www.geneious.com, Kearse et al. 2012). Each pair of R1 and R2 files were set as paired reads to improve the mapping. A map- to-reference analysis was carried out with the Geneious mapper using relaxed parameters (maximum number of mismatches per read, 25%; minimum overlap identity, 80%) to allow potentially mutated reads to map. The DNA barcode sequences from the four cephalopod specimens were set as references (DDBJ/EMBL/ GenBank accession numbers KX078469–KX078472). The results of the map-to-reference analysis were inspected manually to verify that the reads of each library mapped to the correct reference sequence. We obtained 20 assembly files corresponding to the four species by the five PCR treatments.
Regions including the first 50 nucleotides of the mapped R1 and R2 reads (starting immediately after the primer annealing region) were aligned in each assembly with Muscle (Edgar 2004) as implemented in Geneious 8.1.6. We selected these two 50-nucleotide
198
regions because such read length accumulated the maximum number of reads after passing the quality threshold (see above); using larger regions would have reduced the sample size and, therefore, the statistical power of the analysis. Reads were trimmed to the same length to simplify later bioinformatic analyses.
For each alignment file, we calculated the number of mutations per read by comparing every read against the consensus sequence. The consensus sequences obtained from the FASTA files of the same species were identical between them (regardless of the number of PCR cycles) and they were also identical to the corresponding COI sequences available in DDBJ/EMBL/GenBank. For this, we used a custom developed R function (R Core Team 2016) to calculate the number of mutations by multiplying the pairwise genetic p-distance by the total length of our reads. The function treated insertions and deletions (indels) as single mutational steps and the genetic p-distance was calculated with the dist.dna function (raw model) from the ape 3.4 R package (Paradis et al. 2004). Then, we ran a Poisson generalised linear mixed model (GLMM) on the entire resulting data set (glmer function from package lme4 1.1-12; Bates et al. 2015). We considered the number of mutations as the response variable, the number of cycles as the predictor variable, and the species as a random factor. We confirmed assumptions underlying GLMMs by exploring regression residuals for normality against a Q-Q plot.
Finally, to make sure that the PCR1 reaction was still functioning after 55 cycles (i.e., that the emergence of new artificial mutations was still possible), qPCRs were performed in all four samples with the same parameters as in PCR1, but with 60 cycles to cover the whole range of our experiment. The resulting fluorescence versus number of cycles plots were visually analysed, confirming that the reaction was still taking place after 55 cycles.
Results
Due to the stringent quality-filtering, only 2.26% of the raw reads were used for the statistical analyses (see supplementary material, Table S1). The average quality of both the raw and
199
quality-trimmed reads, as measured with FastQC, is available in the supplementary material, Fig. S1.
We detected mutations in 4176 out of the 69 792 reads analysed (i.e., 5.98%), which passed the quality-filtering step, mapped to the correct reference sequence, and were located within the 50-nucleotide stretches after the primer annealing regions.
The number of mutations was consistent across species and the maximum number of mutations per read was three along different treatments (Fig. 3; Table 1). Accordingly, we found no effect of the number of cycles on the number of mutations (Fig. 3; slope ± SE = 0.0002 ± 0.0024, Z = 0.096, P = 0.923).
Table 1. Percentage of reads with 0, 1, 2, or 3 mutations relative to the reference sequence.
Library ID 0 1 2 3 Number of reads
CEP007 (Bathypolypus sponsalis) 94,055 5,772 0,167 0,004 22.105 CEP016 (Ancistroteuthis lichtensteini) 94,169 5,607 0,222 0 14.408
CEP023 (Todaropsis eblanae) 93,409 6,37 0,198 0,022 22.637
SEP006 (Sepietta oweniana) 95,019 4,839 0,14 0 10.642
200
FIGURE 3 Number of mutations relative to the reference sequence observed in each PCR treatment. (a) Bathypolypus sponsalis. (b) Ancistroteuthis lichtensteini. (c) Todaropsis eblanae. (d) Sepietta oweniana.
Discussion
In this work, we investigated whether increasing the number of PCR cycles during library preparation produces a higher number of mutations that could eventually impact the outcome of a high-throughput DNA barcoding study. We demonstrated that even for a high number of cycles (60, i.e., up to 55 cycles for PCR1 and five additional cycles for PCR2) the