Main Article Content
Motivation: Massive parallel phylogenetic analyses allow to reconstruct phylogenetic trees for every gene in
genome, typically using a set of potential homologues detected by similarity search against reference databases via
BLAST or BLAST-like algorithms. However, given that the amount of similarity hits between query sequence and
targets is often too high, it may be necessary to reduce number of sequences for downstream pelogenetic analyses..
Currently available automatic and semi-automatic methods for dataset reduction are error-prone and may depend
on additional metadata, whereas reduction “by hand” is labour-intensive and becomes intractable once
phylogenetic analysis of multiple genes is to be performed.
Results: We propose a distance-based algorithm, termed Distant Joining, for phylogenetic dataset reduction that
does not require additional input except sequences analyzed. DJ was shown to robustly subsample a set of sequences
with minimal loss of dataset divergence from large and complex sequence data sets. In the context of out study, the
underlying assumptions and limitations of different subsampling approaches are discussed, and directions for
selection of the subsampling method to build phylogenomic pipelines are provided.
Availability: Proof-of-concept Python implementation is available at https://github.com/SynedraAcus/sampler under
the terms of CC-BY-4.0 license.
Supplementary information: Supplementary data are available at Journal of Bioinformatics and Genomics online.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. 2006. “Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB”. Appl Environ Microb, 72: 5069-5072
Huerta-Ceras J, Capella-Gutiérrez S, Pryszcz LP, Marcet-Houben M, Gabaldon T. 2014. “PhylomeDB v4: zooming into plurality of evolutionary histories of a genome”. Nucleic Acids Res. 42:D897-D902.
Morozov AA, Likhoshway YV. 2016. “Evolutionary history of the chitin synthases of eukaryotes”. Glycobiology doi: 10.1093/glycob/cww018.
Murray GGR, Weinert LA, Rhule EL, Welch JJ. 2016. “The phylogeny of Rickettsia using different evolutionary signatures: how tree-like is bacterial evolution?”. Syst Biol. 65(2): 265-279.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. 2013. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl Acids Res. 41: D590-D596.
Rice P, Longden I, Bleasby A. 2000. “EMBOSS: the european molecular biology open software suite”. Trends in Genetics 16(6):276-277.
Saitou N, Nei M. 1987. “The neighbor-joining method: a new method for reconstructing phylogenetic trees”. Mol Biol Evol 4(4): 406-425.
Sonnhammer ELL, Hollich V. 2005. “Scoredist: a simple and robust protein sequence distance estimator”. BMC Bioinformatics 6:108.
Spielman SJ, Wilke CO. 2015. “Pyvolve: a flexible Python module for simulating sequences along phylogenies”. PloS ONE. 10(9): e0139047.
The UniProt Consortium. 2015. “Uniprot: a hub for protein information”. Nucl Acids Res 43: D204-D212.
Wild M, Janson S, Wagner S, Laurie D. 2012. “Coupon collecting and traversals of hypergraphs”. arXiv:1107.1401v3.
Zhou C, Mao F, Yin Y, Huang J, Gogarten JP, Xu Y. 2014. AST: and automated sequence-sampling method for improving the taxonomic diversity of gene phylogenetic trees. PLOS ONE 9(6): e98844.