This website, and SingleM itself, is the result of a collaboration between (in no particular order) Ben Woodcroft, Rossen Zhao, Mitchell Cunningham, Joshua Mitchell, Samuel Aroney, Linda Blackall, Gene Tyson, Raphael Eisenhofer and Antton Alberdi.
Most of us are at the Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba, Australia. Mitchell Cunningham and Linda Blackall are at the School of BioSciences, The University of Melbourne, Victoria, Australia. Raphael Eisenhofer and Antton Alberdi are at the Centre for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Denmark.
The overall citation for SingleM/Sandpiper:
Woodcroft, Ben J., Samuel TN Aroney, Rossen Zhao, Mitchell Cunningham, Joshua AM Mitchell, Linda Blackall, and Gene W. Tyson. SingleM and Sandpiper: Robust microbial taxonomic profiles from metagenomic data. bioRxiv (2024): 2024-01. https://doi.org/10.1101/2024.01.30.578060
The microbial fraction (SMF) mode of SingleM:
Eisenhofer, Raphael, Antton Alberdi, and Ben J. Woodcroft. Large-scale estimation of bacterial and archaeal DNA prevalence in metagenomes reveals biome-specific patterns. bioRxiv (2024): 2024-05. https://doi.org/10.1101/2024.05.16.594470
Community profiles for all runs are available for download from Zenodo. The data behind older versions of Sandpiper can also be downloaded there.
Please feel free to get in touch with us if you have any questions or comments, or want data of other kinds.
The data underlying Sandpiper was generated using the SingleM pipeline, applied to public metagenome datasets listed in the NCBI SRA that were designated as metagenomic, or derived from "metagenomic" organisms such as "soil metagenome". This list of public metagenomes which was generated on .
SingleM is a tool to find the abundances of discrete operational taxonomic units (OTUs) directly from shotgun metagenome data, without heavy reliance on reference sequence databases. It operates by scanning for reads that cover highly conserved regions of single copy marker genes (35 bacterial, 37 archaeal, 59 total) when translated into amino acids. The nucleotides from each read that cover these conserved gene sections are then clustered into operational taxonomic units (OTUs). Importantly, this clustering happens before the taxonomy of the cluster is determined, setting it apart from methods which rely more heavily on reference databases. With SingleM, multiple OTUs can be assigned to one taxa, indicating e.g. strain heterogeneity within a species, or multiple families from a novel taxa.
The OTU tables generated for each marker gene are then combined ("condensed") into a single taxonomic profile, representing the read coverage of each taxa in the metagenome. From this read coverage, relative abundance is found by dividing the read coverage of each taxa by the total read coverage of the metagenome. This relative abundance is then used to generate the Sandpiper visualisations.
These raw SingleM taxonomic profiles, which contain OTUs derived from the 59 genes, are available for download from each run's page. However, for ease of interpretation and search, runs on this website are usually represented as a 'condensed' profile. These condensed profiles are a unified version of the profiles derived from each marker gene, so there is only one profile to inspect (instead of 59), though condensed profiles collapse the OTUs from each taxon into a single group.
For more complicated analyses, such as searching for OTUs that cannot be easily isolated through their taxonomy (e.g. if they are novel), a more bespoke search procedure might be more appropriate. These kinds of analyses cannot currently be done on the sandpiper website, but in such cases please get in touch with us.
To estimate the number of reads in each metagenome which are either bacterial or archaeal, the microbial_fraction (SMF) mode of SingleM was used. SMF bases its estimate on the coverage of each taxon in the SingleM community profile, the genome lengths of those taxons, and the total size of the metagenome. It does not rely on mapping reads to non-microbial reference genomes.
Each metagenome was predicted as either host-associated or ecological based upon a machine learning algorithm (an XGBoost one achieving ~93% accuracy), using the "organism" metadata field recorded at NCBI as the target for prediction, and the taxonomic profile as the input data. Metagenomes are either classified as "eukaryote host-associated" or "ecological". Host-associated samples are recorded or predicted to be under the organismal metagenome NCBI taxonomy, ecological ones are all others. We anticipate that predictions based on microbial community profiles will become an increasingly important method for characterising microbiomes in the future, and we hope that future versions of this website will provide more detailed predictions about each community.
Sandpiper uses data from the NCBI SRA and associated databases to add metadata to each sequence dataset. Many times, this metadata is incorrect, vague or missing. If you notice something like this, we are collecting and correcting them in public repository. Any corrections submitted there (or submitted directly upstream e.g. to NCBI) are appreciated.
Development of Sandpiper and SingleM was funded through Australian Research Council Future Fellow (#FT210100521), Discovery Project (#DP230101171) and Discovery Early Career Research Award (#DE160100248) grants, as well as the EMERGE National Science Foundation (NSF) Biology Integration Institute (#2022070) and Genomic Science Program of the United States Department of Energy (DOE) Office of Biological and Environmental Research (BER), grants DE-SC0004632, DE-SC0010580 and DE-SC0016440. Cloud computing was generously contributed by Amazon Web Services (AWS) and Google Cloud (GCP).
The sandpiper background image on the front page was derived from Frans Vandewalle (CC-NC).