Research Article |
Corresponding author: Alexander M. Weigand ( alexander.weigand@mnhn.lu ) Academic editor: Michael Ohl
© 2022 Fernanda Herrera-Mesías, Imen Kharrat Ep Jarboui, Alexander M. Weigand.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Herrera-Mesías F, Ep Jarboui IK, Weigand AM (2022) A metabarcoding framework for wild bee assessment in Luxembourg. Journal of Hymenoptera Research 94: 215-246. https://doi.org/10.3897/jhr.94.84617
|
Wild bees are crucial organisms for terrestrial environments. Their ongoing decline could cause irreparable damage to ecosystem services vital to plant reproduction and human food production. The importance of taking swift action to prevent further declines is widely acknowledged, but the current deficit of reliable taxonomic information complicates the development of efficient conservation strategies targeting wild bees. DNA metabarcoding can help to improve this situation by providing rapid and standardized mass identification. This technique allows the analysis of large numbers of specimens without the need for specialized taxonomic knowledge by matching high-throughput sequencing reads against public DNA barcode reference libraries. However, the validation of this approach for wild bees requires the evaluation of potential error sources on a regional scale. Here we analyzed the effects of three potential error sources on a metabarcoding pipeline customized for the wild bee fauna of Luxembourg. In an in silico study, we checked the completeness of the BOLD reference library for 349 species found in the country, the correspondence between molecular and morphological species delimitation for these taxa, and the amplification efficiency of three commonly used metabarcoding primer pairs (mlCOlintF/HCO2198, LepF1/MLepF1-Rev and BF2/BR2). The detection power of the pipeline was evaluated based on the species recovery rates from mock communities of known composition under variable DNA concentration treatments. The reference barcode library evaluation results show that 97% of the species have at least a single barcode in BOLD Systems (minimal length 196 bp) and that 85% of species have ≥ 5 barcodes in the public domain. The mlCOlintF/HCO2198 target fragment presented the highest coverage (77.94% of the species with full barcode sequences), followed by the target fragments of LepF1/MLepF1-Rev (77.65%) and BF2/BR2 (68.48%). Only 60% of the morphospecies presented a complete coverage of the prominent Folmer region (658 bp). The in silico amplification efficiency analysis shows that the BF2/BR2 primer pair has the best-predicted amplification performance, but none of the primer combinations evaluated can be expected to efficiently amplify all local wild bee genera. Finally, all species detection rates in the mock communities, except for the sample with the most discrepant DNA concentrations, were above 97%, with no significant differences found among treatments. These results indicate that the detection capacity of the pipeline is robust enough to be used for the reliable assessment of local wild bee biodiversity, even if species from various size categories are pooled together. Primer bias has a major effect on species detection, which can be acknowledged with a preliminary assessment of primer-template mismatch and sophisticated methodological designs (e.g. mock community controls, replicates). Overall, the metabarcoding pipeline here described provides a suitable tool for quick and reliable taxonomic identification of the regional wild bee fauna to aid conservation initiatives in Luxembourg – and beyond.
biodiversity, COI, conservation, molecular taxonomic tools, pollinators, primer evaluation
Bees (Hymenoptera, Anthophila) are important insect pollinators of Angiosperms with critical ecological functions and high economic value (
Despite the high importance of the ecological services provided by insects (
According to the European Red List of Bees, 1,101 species (57% of the total) are classified as “data deficient”(
DNA-based approaches and in particular DNA metabarcoding might act as a game changer for wild bee assessments (
In this study, we tested the suitability of a DNA metabarcoding approach customized for the assessment of the wild bee biodiversity of the Grand Duchy of Luxembourg. Early metabarcoding data already indicate a potential benefit of this approach for assessing Central European wild bees (
Our central aim here is to propose an effective DNA metabarcoding approach, which ultimately can provide robust data on the wild bee species diversity and distribution in Luxembourg, while preserving bulk samples of wild bees as vouchers. Although this at first glance contradicts the often-highlighted “time-and-cost” benefits of DNA metabarcoding approaches, this strategy will enable subsequent morphological investigations in case of peculiar or doubtful findings, and allows the integration of selected specimens in the reference collection of the National Museum of Natural History Luxembourg (MNHNL).
From a technical point of view, the following methodological aspects were evaluated:
The commonly used Cytochrome C Oxidase Subunit I (COI) barcode region has a high species discrimination power in Hymenoptera (Smith et al. 2008) and a barcoding library for Central European wild bee species has been available for some time (
Numerous primers targeting the COI region have been designed for or applied in metabarcoding studies of insects (e.g.
Wild bees are a phenotypically diverse pollinator group with considerable inter-specific variation in body size (
Thus, for this study, in silico, and in vitro approaches were combined to evaluate the sensitivity of a customized metabarcoding pipeline targeting regional wild bee species to common potential error sources: reference library completeness, primer and biomass-related bias. We compared the in silico performance of three popular metabarcoding primer pairs from the literature and then tested these expectations in the laboratory using mock communities. With this strategy, we aim to determine the best candidate for regional wild bee metabarcoding and to evaluate the suitability of the proposed workflow as a potential identification tool to be used in national conservation initiatives in Luxembourg.
Barcode coverage analyses were performed for three different metabarcoding primer pairs: BF2/BR2, mlCOlintF/HCO2198 and LepF1/MLepF1-Rev (Table
Primer name | Orientation | Fragment length (bp) | Sequence (5‘-3‘) | Reference |
---|---|---|---|---|
mlCOlintF | forward | 313 | GGWACWGGWTGAACWGTWTAYCCYCC |
|
HCO2198 | reverse | TAAACTTCAGGGTGACCAAAAAATCA |
|
|
BF2 | forward | 421 | GCHCCHGAYATRGCHTTYCC |
|
BR2 | reverse | TCDGGRTGNCCRAARAAYCA |
|
|
LepF1 | forward | 218 | ATTCAACCAATCATAAAGATATTGG |
|
MLepF1-Rev | reverse | CGTGGAAAWGCTATATCWGGTG |
|
|
LCO1490 | forward | 658 | GGTCAACAAATCATAAAGATATTGG |
|
HCO2198 | reverse | TAAACTTCAGGGTGACCAAAAAATCA |
|
Since the DNA barcode sequences deposited in the BOLD reference library may cover different regions of the COI gene, the coverage of the specific amplicon of each primer pair was individually checked for the local wild bee fauna. Additionally, the coverage of the prominent COI Folmer region (i.e. the traditional 658 bp long DNA barcode fragment used for animal barcoding) was also checked.
For this purpose, the R package PrimerMiner version 0.18 (
From this original dataset, identical sequences were automatically reduced to singletons and clustered into Molecular Operational Taxonomic Units (MOTUs) based on a 3% sequence similarity threshold to reduce the bias introduced by unequal representation of sequences in the database (
To further validate the correspondence of each MOTU consensus sequence with the taxonomic data from the BOLD record details of the original barcodes, the resulting fasta files were compared against BOLD Systems using BOLDigger (
Even if only sequences uploaded under regional wild bee species binomial names were downloaded, the best BOLD match for some MOTU consensus sequences was not a wild bee present in Luxembourg. For example, OTU 416 consists of a single sequence (BOLD Process ID NOBEE085-09). Even if this sequence was downloaded as Lasioglossum fratellum, the available barcode matches the mosquito Aedes canadensis. Problematic MOTUs such as this one were deleted from the dataset, keeping only consensus sequences matching Luxembourgish wild bees up to species level.
These validated MOTU consensus sequences representing each target species were subsequently analyzed thus to determine their COI coverage for all three metabarcoding primer-pairs (Suppl. material
The PrimerMiner package was used to perform an in silico evaluation of the amplification efficiency of each metabarcoding primer based on the dataset of validated MOTU consensus sequences. Scores for primer-template mismatches were assigned based on position and mismatch type under default settings, using the tables included in the package. These scores were summed up to calculate individual penalty scores for each primer or primer pair (
Consensus sequences were visualized with Mesquite v3.6 (
To compare the overall performance of the primers across different taxonomic groups, mean penalty scores were calculated by averaging the penalty scores of all the MOTUs within each wild bee genus. Mean values were transformed with a Tukey’s Ladder of Powers transformation (λ = 0.375) to correct for skewness caused due to the presence of outliers (Suppl. material
To determine whether there were significant differences among the primer pairs regarding their mean in silico scores across the wild bee genera, the transformed values were compared with a weighted One-Way ANOVA, using the number of MOTUs in each genus as weights and the primer pairs as the grouping variable. A Tukey Honest Significant Differences (Tukey’s HSD) test was used to calculate pairwise-comparisons between the mean scores of the primer pairs. Both tests were performed in the R package “stats”.
Wild bees were sorted from collections taken in spring and summer 2019 across Luxembourg and the nearby Federal State of Rhineland-Palatinate (Germany) using sweep netting, opportunistic sampling of dead specimens and different kinds of passive trapping (pan traps, vane traps and malaise traps). Wild bee specimens were morphologically identified to the level of species or genus using the taxonomic keys of
In the case of the wild bees from Luxembourg, samples were collected over several days using traps filled with 80% propylene glycol and soap or soapy water (
Wild bees collected in Rhineland-Palatinate were stored in 80% ethanol after sampling, pinned by the end of the field season and kept dry in a drawer. Except for Ceratina chalybea, all the specimens from Germany corresponded to species present in Luxembourg (Suppl. material
For validation purposes, as well as to check the general suitability of the obtained tissue material for molecular analysis, all specimens were individually Sanger-sequenced using the Folmer primer pair LCO1490/HCO2198. All DNA extractions were performed by grinding a single mid-leg of each specimen in a Retsch TissueLyser Mixer Mill model MM200 using 3 mm beads made of either plastic (41 specimens) or metal (2 specimens), as described in the laboratory protocols of
The final assortment of specimens used for the mock community design included 43 adult females. Of them, 28 specimens (25 fresh and 3 dry) were used for “concentration adjustment” mock communities and 14 specimens (7 fresh and 7 dry) for “regular” bulk extraction mock communities, plus a single dry specimen that was used for both treatments (Suppl. material
To study the in vitro effect of primer bias and the impact of biomass differences on the metabarcoding pipeline detection capacity, three experimental set-ups (“mock communities”) were designed (Suppl. material
The first mock community (homogeneous, HOMO) was arranged by pooling 10 ng of DNA of each species. In two cases (Hylaeus nigritus and Lasioglossum morio), just 5 to 6 ng were added due to a lack of further tissue material. This roughly homogeneous treatment provides a theoretically biomass-related bias free scenario, in which differences in the detection rates are more likely to be caused by factors such as primer bias and the stochasticity of the PCR process.
The second mock community (heterogeneous, HETE) was assembled by pooling 1 ul of variable DNA concentration obtained from a single mid-leg from each specimen. This setup provides information regarding the detection limits of the pipeline when DNA from one leg per specimen is analysed and as such, how species are recovered by metabarcoding when the amount of species-specific template DNA is unequal in the PCR. In this case, false negatives are expected to be caused by biomass-related bias under unaltered conditions.
The third mock community (gradient, GRAD) uses the same specimens as in the two previous mock communities, but modifying their concentrations based on the concentration categories previously assigned to each specimen. The DNA of the bees from the “S” category was diluted with buffer in a proportion of 1:100. Six bees from the “M” category were randomly selected and their isolated DNA was diluted to approach concentrations similar to the bees from the “S” category. The bees belonging to the “L” category were not modified. This treatment creates a gradient of DNA concentrations to test the effectiveness of the pipeline under variable DNA concentrations, indicating the impact of the biomass-related bias under more extreme conditions.
Additionally, two “regular mock communities” (RmockA and RmockB) were analyzed for reference purposes (Suppl. material
Three PCR replicates for each mock community (i.e. of HETE, HOMO, GRAD, RmockA and B) were set-up and sequenced. The 16 samples (15 mock community replicates plus a negative control) were amplified using the primer set showing the best performance in the in silico evaluation. A two-step PCR protocol was used. The first PCR reaction consisted of 1× Master Mix (GoTaq G2 Hot Start Colorless), 0.5 uM of each primer, 25 ng of DNA and Nuclease-Free H2O to a final volume of 25 ul. For the second PCR, 1 ul of the amplicon (without cleanup) was used as template and the amount of reactives was modified to a final volume of 50 ul. Both PCRs were run on an Eppendorf Mastercycler nexus eco thermocycler using thermal profiles based on the ones described in
Dereplication of the samples in the same sequencing run based on inline tag combinations was done using scripts (“Demultiplexer”) developed by the Aquatic Ecosystem Research Group of the University of Duisburg-Essen. Reads that were unmatched after this module were mapped against PhiX to check for the presence of virus genome. The demultiplexed data was further processed using the JAMP (“Just Another Metabarcoding Pipeline”) R package (https://github.com/VascoElbrecht/JAMP). This package consists of a modular metabarcoding pipeline that provides extended quality filtering options and automatically generated summary statistics, integrating different functions from external programs to produce the output of the different steps (
Taxonomic sorting was performed by comparing the resulting MOTU fasta files against sequences stored in BOLD Systems using BOLDigger. The same thresholds for taxonomic identification used in the in silico evaluation were used here. The resulting data were pruned using TaxonTableTools (
The detection capacity of the metabarcoding pipeline was evaluated based on the percentage of intentionally pooled species retrieved from each treatment (“detection rates”). Kruskal-Wallis and Wilcoxon rank sum tests (both default R package “stats 3.6.2”) were used to determine whether there were significant differences among the HETE, HOMO and GRAD mock communities regarding the average read numbers per species obtained after combining the sequencing results of all replicates.
Of the 349 wild bee species evaluated, 96.84% presented at least one COI barcode sequence available in the BOLD Systems public library (Fig.
The 7,317 de-replicated sequences considered (i.e. after removing identical sequences from the set of 11,810 downloaded sequences) were clustered into 558 MOTUs. From them, the consensus sequences of 433 were included in the final dataset based on the combined assessment of their 20 top matches using BOLDigger, and supporting their identification as local wild bee taxa.
Barcode coverage of the regions targeted by the three considered metabarcoding primer pairs presented little variation (Fig.
Only 39.05% (132/338) of the morphospecies considered in the final dataset fulfilled the expected correspondence of one MOTU per Linnaean species (Suppl. material
The results of the in silico analysis of the metabarcoding primers were first sorted by genera to assess their performances across different taxonomic groups of interest. When all MOTUs are considered, the expected amplification success rates of the individual primers varied across 25 wild bee genera. However, in the majority of the cases the combined outcomes of the metabarcoding primer pairs were higher or equal to the ones of the standard Folmer barcoding primers (Suppl. material
The BF2/BR2 primer pair had the highest mean in silico amplification success rate (86.52% of the species with binding site sequence data were expected to correctly amplify), while LepF1/MLepF1-Rev (16.88% of the species) and LCO1490/HCO2198 (17.65% of the species) had the lowest success rates. The primer pair mlCOIintF/HCO2198 showed an intermediate in silico performance (amplification is expected successful for 37.50% of the species). The expected amplification success rates of the primer pairs mlCOIintF/HCO2198 and BF2/BR2 were identical for 48.57% of the wild bee genera considered. However, BF2/BR2 consistently outperforms mlCOIintF/HCO2198 in 83.33% of the remaining cases, while mlCOIintF/HCO2198 only shows higher amplification success rates than BF2/BR2 in three genera: Nomada, Heriades and Melitta.
Regarding the average penalty scores obtained from all the MOTUs within each wild bee genus, the transformed scores for BF2/BR2 were within the accepted values of amplification success, with the exception of the mean penalty scores of Anthophora, Eucera, Halictus, Melitta and Nomada (Fig.
Distribution of transformed mean penalty scores by primer pair and genus. The sizes of the diamonds that represent the mean of the genera are proportional to the number of MOTUs in each genus. The overall mean value and standard deviations for each primer pair are shown in black. Means with a penalty above the red line (penalty score of 100) are considered in silico performance failures. The LCO1490/HCO2198 primer pair is indicated for reference purposes only, and not included in the statistical analysis.
The results of the in silico analysis vary slightly when only the MOTU with the best score for a distinctive morphospecies (in the case of “multi-MOTU” species) is considered as an outcome (Suppl. material
Overall, multi-MOTU morphospecies presented congruent results for the same primer pair, despite variable penalty scores for each of their MOTUs. The exception to this were four species (A. plumipes, B. terrestris, S. albilabris, and S. geoffrellus), which presented MOTUs with scores both above and below the threshold for one or more primer combinations. Bombus terrestris presented discrepancies for all primer pairs but LepF1/MLepF1-Rev. Two species (A. plumipes, and S. geoffrellus) showed discrepancies for mlCOIintF/HCO2198 and BF2/BR2. Finally, S. albilabris only presented discrepancies for mlCOIintF/HCO2198.
A total of 6,902,568 high quality reads from the original 11,701,736 read pairs remained after trimming and quality filtering (Short Read Archive bioproject number PRJNA867321). The percentage corresponding to PhiX found in the unassigned reads (64% of the 2,251,231 reads in “no match”) was in agreement with the procedures of the sequencing center. From the original 328 MOTUs generated, 118 MOTUs remained after the 0.01% abundance filters (Suppl. material
For species detection rate assessment within mock communities, 53 Hymenoptera MOTUs – identified to species level and present in at least two replicates – were considered (Suppl. material
In the main three experimental set-ups (HETE, HOMO, GRAD), sequence reads of Andrena cineraria dominated the results, with over 30% of the average reads in all three mock communities and replicates (Fig.
Proportion of sequencing reads per species in each mock community. Each community was assembled from the DNA of the same 29 specimens (25 fresh ones and 4 dry ones), all of them belonging to different species; the HOMO community was based on equimolar pools of the individual DNA extractions; the HETE community was assembled by pooling 1 ul of isolated DNA from a single leg from each specimen and the GRAD community was made by modifying the original concentrations of each species based on their concentration categories in order to exaggerate existing biomass differences. Relative read numbers were obtained by averaging absolute read numbers from all three replicates of each mock community and then correcting by the total number of reads in each treatment. A significant difference was found only between GRAD and HOMO (see text for details).
The results of the Kruskal-Wallis rank sum test and Wilcoxon rank sum test indicate the presence of significant differences in average read numbers per species only between the GRAD and the HOMO mock community at a 95% confidence level (Kruskal-Wallis x2(2) = 8.12, p = 0.017; Wilcoxon rank sum test with Bonferroni correction p < 0.05 only for GRADxHOMO comparison). No significant differences in average read numbers per species were found between the HETE mock community and the two other treatments. Data is not normally distributed but homoscedastic (Shapiro-Wilk test: W = 0.46, p = 2.597e-16, Levene’s Test: F (2) = 0.004, p = 0.96). Also, it is important to mention that nine morphospecies were represented by multiple MOTUs in the metabarcoding results, despite only one specimen being pooled in the mock community mixtures (Suppl. material
Non-Hymenoptera metazoan MOTUs found in the samples, such as Parus major and Nephrotoma appendiculata, are likely the result of contamination with organisms present in the field. The fungi and plant DNA found in the samples are also likely to be due to carry over from the field or rather contaminations with materials from other research groups at the laboratory of the MNHNL.
The availability of reference barcode sequences is a central requirement when evaluating the performance power of a DNA-based identification method for a certain taxonomic group, geographical region or environment (
With the applied clustering threshold value of 3% sequence similarity, only 39% (132/338) of the evaluated wild bee species met the expectation of one MOTU per morphospecies based on a Linnaean species delimitation concept. In 29% of the cases, a morphospecies split into multiple MOTUs, while in 9% of the cases sequences from multiple morphospecies lumped together. Additionally, 23% of the cases showed a variable combination of both effects, including multiple barcodes merging into mixed species MOTUs. For example, the COI barcodes downloaded for Andrena bimaculata split into two MOTUs. Three barcodes clustered together with Andrena tibialis in MOTU 203, while the remaining five formed a mixed species MOTU (MOTU 345). These deviations are in agreement with the incongruences described by
It is worth noticing that at least part of the splitting and lumping situation observed here is potentially the result of sequences uploaded under incorrect species annotation into BOLD. Outstanding examples can be found in the DNA barcode material of Nomada striata, which split into three MOTUs and then lumped with different morphospecies in each mixed MOTU (MOTU20: N. ruficornis and N. fulvicornis; MOTU259: N. alboguttata; MOTU310: N. zonata). Furthermore, the BOLD_BIN ABY7961 of N. striata not only includes annotated specimens of this species, but N. villosa (4 specimens) and N. symphyti (1) -two species so far not reported for Luxembourg (
Even in cases when a reliable reference barcode library is available for the target taxa, primer bias can lead to false negatives and/or reduced detection rates (
COI metabarcoding approaches rely on degenerate primers such as BF2/BR2 to maximize taxon recovery, as this allows matching at variable binding sites and the amplification of as many (target) input sequences as possible (
In principle, it must be highlighted that a highly degenerate primer pair can generally perform well in an in silico analysis, but might mal-perform in vitro due to the co-amplification of non-target taxa.
In our study, we tested the predictions of the in silico analysis by sequencing five distinct mock communities using our best performing primer pair (i.e. homogeneous, heterogeneous, gradient and two regular mock communities). The final detection rates of input species for the HOMO and HETE mock communities were the same (97%), while the detection rate of the GRAD mock community was considerably lower (72%). The missing species in the HOMO mock community (T. byssina, “L“ category) likely represents an artifact, considering that a 7-year-old museum sample with unknown initial preservation conditions was used. This hypothesis is supported by the fact that the fresh specimen of T. byssina used for bulk extraction in the regular mock communities was found in all replicates. DNA degradation over time in insect museum samples is a well-known phenomenon and models have been developed to characterize the level of molecular damage (
Since a single specimen per species was pooled in our mock communities, the proportion of sequence reads per species should be similar in all experimental set-ups, unless error sources (i.e primer mismatch, biomass bias, etc) were biasing the relative read abundances, favoring some taxonomic groups over others (
In all three main mock communities (HETE, HOMO, GRAD), 30% to 45% of all reads corresponded to A. cineraria, a species that has a considerable biomass (pre-PCR DNA concentration: 48.8 ng/ml, dry weight: 31.9 mg) and a very low primer-template mismatch (penalty score: 18.32). In the GRAD and HETE mock communities, biomass-rich species from the “L” category tended to have higher overall read numbers. However, no significant differences were found in detection rates or in read numbers per species among the HOMO and the HETE mock communities. Therefore, there is no evidence suggesting that correcting for biomass differences (e.g. size-sorting) has a significant effect in the general assessment outcome of our wild bee metabarcoding approach, at least under the conditions here proposed. Hence, isolating a single leg from each wild bee specimen should be sufficient for its detection in an average bulk sample under the described sequencing depth. Nevertheless, it is important to acknowledge that challenging bulk sample mixtures consisting of few small-sized taxa and an overabundance of large-sized bees might result in further problems not evaluated in this study.
In summary, the comparison between the results of the HETE and the HOMO mock communities suggest that the differences found in the proportion of read numbers per species are likely due to differential amplification resulting from primer bias. In the case of the GRAD community, the proportion of input species read numbers was not significantly different from the HETE mock community and the overall detection rate was only mildly affected. Overall, these results suggest that primer bias was the principal driver behind the unequal representation of species in the mock communities, with biomass differences only adding to the effect as a secondary factor.
The results found in the HETE mock community suggest a general trend of biomass-rich bee taxa to have higher read numbers. However, it is unlikely that this information can be used to retrieve accurate quantitative results regarding species biomass or specimen abundances. If PCR-based approaches are used in a metabarcoding set-up, the effect of differential amplification efficiency would make extremely difficult to estimate any of these parameters based on the final read numbers (
It is noteworthy that multiple MOTUs from the same species were found among the mock community metabarcoding results, despite a single specimen being used for the design. The presence of multiple MOTUs may have been caused by the effect of mitochondrial heteroplasmy or by nuclear copies of mtDNA (numts). The presence of multiple mitochondrial DNA haplotypes coexisting in a single organism remains a potential problem for the use of DNA (meta)barcoding as a molecular taxonomic tool (
Alternatively, these peculiarities in the dataset may be explained by nuclear mitochondrial DNA (NUMT) sequences. NUMTs are the result of non-translated and non-transcribed regions from the mitochondrial DNA transferred to the nuclear genome, which can be amplified if effective primer binding sites are still existing (
Further studies should determine the presence and the potential impact of heteroplasmy and NUMTs in the effectiveness of barcoding identification of potentially heteroplasmic wild bees of both sexes, as well as the impact of multiple MOTUs originating from single specimens on diversity estimates.
The in silico and in vitro analyses highlight the influence of primer bias on the performance of the proposed metabarcoding approach. However, it is possible to reduce its effect by selecting the most suitable primer combinations for the taxa of interest. This can be achieved by comparing the in silico amplification efficiency of primer pair candidates and then experimentally testing the capacity of the best performing pairs in the laboratory. Among the metabarcoding primer pairs here evaluated, no combination can be expected to correctly amplify all wild bee taxa and some genera in particular are predicted to be prone to amplification problems, ultimately translating into a higher probability of producing false negatives. Therefore, primers have to be evaluated on a case-by-case basis against the target taxa at hand. Nevertheless, from the combinations available, the highly degenerate primer pair BF2/BR2 provided the best results for our regional wild bee fauna, with over 85% of available MOTUs and morphospecies expected to correctly amplify when this primer pair is used. Our experimental set-ups support these results as over 97% of the species were retrieved from four out of five mock community trials using the metabarcoding approach that incorporates this primer pair.
A deficiency of DNA barcodes in the public reference library BOLD does not seem to be a major error source for the identification of the regional wild bee species using molecular taxonomic tools. In total, 97% of the currently known morphospecies in Luxembourg present at least one barcode in BOLD, and 85% of them can be considered well covered. However, for the ~30% of the taxa whose identification might be obscured due to lumping with other wild bee species, the definition of potentially diagnostic barcodes or a multi-marker DNA metabarcoding approach (i.e. incorporating nuclear markers) may be considered as alternative strategies to discriminate lumped species. Finally, new sampling campaigns and collection revisions are likely to provide material to fill the few remaining gaps in the database, as well as to produce barcodes originating from regional specimens.
The results of the mock community experiments indicate that the overall output of the metabarcoding pipeline is expected to be robust, despite biomass differences among the wild bee specimens. Even if these biomass differences affect the number of reads per taxonomic group, the detection rates of input species (i.e. taxalists) remained stable, with the exception of the gradient treatment. Biomass-related bias is likely to have a higher impact under more extreme scenarios, where the size difference of the pooled specimens is higher (e.g. in Malaise traps). Moreover, due to the small numbers of specimens included in this analysis, a higher effect of this type of bias in bulk samples combing numerous biomass-rich specimens and few biomass-low ones cannot be ruled out. However, strategies can be used to compensate for this issue under reasonable conditions. In general, processing separately the fraction of smallest wild bee specimens in a sample should provide an appropriate countermeasure to avoid false negative results due to biomass differences, especially for genera with negative primer bias (e.g. Nomada spp.). Moreover, as the proposed metabarcoding pipeline only uses one leg for bulk extraction, the voucher specimens can be traced back for complementary analysis with Sanger sequencing or traditional morphotaxonomy, thus to provide identifications validated by multiple approaches.
Even if only a few specimens were used here to set up the mock community trial, the layout of the metabarcoding pipeline in this study can be used to analyze much larger samples. Sequencing costs for a HTS run on an Illumina platform remain stable independently of how many individuals are included in each bulk sample, and the BF2/BR2 tagging primer combinations allow the tagging of up to 288 samples within the same run (
Overall, our customized metabarcoding pipeline represents a promising alternative taxonomic identification tool to analyze large numbers of wild bees in the context of local conservation biology initiatives. As such, the further improvement of this technique would benefit projects dealing with many specimens to be swiftly analyzed, as well as restricted time frames and limited access to taxonomic specialists.
We thank Joana Margarida Teixeira Lopes, Claude Kolwelter and Dylan Thissen for their help during the wild bee fieldwork in 2019, as well as Sophie Ogan for providing the samples from the Rhineland-Palatinate state used in this work. Nico Schneider is acknowledged for his feedback on the taxonomic identification of wild bees and Gerhard Haszprunar for his collaboration. We would also like to thank Rashi Halder from the Luxembourg Centre for Systems Biomedicine in Belval for her work on the high-throughput sequencing of the library and the Aquatic Ecosystem Research Group of the University of Duisburg-Essen for their help with the bioinformatic analysis. We also thank the Zoology department at the Musée national d’histoire naturelle Luxembourg (MNHNL), especially Amanda Luttringer and Stéphanie Lippert for their useful comments on the manuscript.
Finally, we would also like to thank Christophe Praz for the constructive feedback during the revision of this manuscript. Financial support was received under the Bauer and Stemmler foundations programme “FORSCHUNGSGEIST! Next Generation Sequencing in der Oekosystemforschung”. Collection permit was issued by the Ministère de l’Environnement, du Climat et du Développement durable (MECDD) Luxembourg.
In silico penalty scores, barcode coverage and congruency analysis of the wild bee species of Luxembourg
Data type: tables (excel file)
Explanation note: In silico scores, barcode coverage and species delimitation congruence of the 349 wild bee species from Luxembourg evaluated in this study.
Mean wild bee genus penalty scores sorted by primer pair
Data type: table (excel file)
Explanation note: Mean in silico penalty score and transformed mean penalty score (T-Score) within each wild bee genus considered in this study. Number of MOTUs used as weights in the ANOVA analysis are also given.
Summary and metadata of the wild bee samples from Luxembourg and Germany used in the mock communities
Data type: table (excel file)
Explanation note: Information sheet regarding the wild bee specimes used in the mock communities. Morphological identification, molecular identification, % of identity with their best BOLD match, concentration category and set-up in which they were used are included.
Tagged primer combinations used in the mock community experiment
Data type: table (excel file)
Explanation note: Primer tags. Combinations from Elbrecht V, Leese F (2017) Validation and development of COI metabarcoding primers for freshwater macroinvertebrate bioassessment. Frontiers in Environmental Science 5: 11. Specifications about the tag combinations used in each mock community replicate are also given.
Overiew of MOTU data per mock community
Data type: table (excel file)
Explanation note: Number of reads per MOTU found in each mock community after applying the 0.01% filters. Includes non Hymenoptera MOTUs.
Mock community metabarcoding results
Data type: table (excel file)
Explanation note: Sequencing results of the three PCR replicates of each mock community. Only Hymenoptera MOTUs identified to species level are considered. Number of input species detected in each set-up is also given.
Species delimitation congruence, comparing Linnaean species assignment of the original sequences retrieved from BOLD v/s results of MOTU clustering
Data type: Image (JPG file)
Explanation note: The number of species presenting incongruent clustering and the type of anomaly is shown in each case. Cases in which the incongruence may be due to sequences uploaded under incorrect species names in the database are shown in light gray.
In silico primer performance evaluation using PrimerMiner with MOTU data sorted by wild bee genus
Data type: Image (JPG file)
Explanation note: Amplification success rates are shown for each genus (dark yellow areas = successful cases with a penalty score below 100; light yellow areas = failed cases, khaki colored areas = missing information). Missing data was excluded from calculations. Mean amplification success rates based on the whole dataset are indicated at the bottom.
Primer pair amplification success rates based on Linnaean species
Data type: Image (JPG file)
Explanation note: Squares correspond to distinct morphospecies. Dark yellow areas represent species with a penalty score below 100, light yellow areas represent species above the threshold and striped squares represent cases of discrepancy (i.e. having MOTUs in both categories). Only sequences with full length target regions were considered.