Main Article Content
Motivation: Massive parallel phylogenetic analyses allow to reconstruct phylogenetic trees for every gene in genome, typically using the set of potential homologues detected via BLAST or analogue. However, if the amount of hits is too high, the dataset should be reduced to tractable size, preferably without human intervention. Currently available methods are error-prone on at least some datasets and some of them also depend on additional data which may not be available.
Results: We propose a distance-based algorithm, termed Distant Joining, for phylogenetic dataset reduction that does not require any input besides sequences themselves. It was shown to be robust to both complex evolutionary histories and large data sets. We also discuss the assumptions and limitations of different sequence sampling approaches, and provide guidelines to selection of the method for a phylomic pipeline.
Availability: Proof-of-concept Python implementation is available at https://github.com/SynedraAcus/sampler under the terms of CC-BY-4.0 license. Please check README for dependencies.
Supplementary information: Supplementary data are available at Journal of Bioinformatics and Genomics online.
Huerta-Ceras J, Capella-Gutiérrez S, Pryszcz LP, Marcet-Houben M, Gabaldon T. 2014. “PhylomeDB v4: zooming into plurality of evolutionary histories of a genome”. Nucleic Acids Res. 42:D897-D902.
Morozov AA, Likhoshway YV. 2016. “Evolutionary history of the chitin synthases of eukaryotes”. Glycobiology doi: 10.1093/glycob/cww018.
Murray GGR, Weinert LA, Rhule EL, Welch JJ. 2016. “The phylogeny of Rickettsia using different evolutionary signatures: how tree-like is bacterial evolution?”. Syst Biol. 65(2): 265-279.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. 2013. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl Acids Res. 41: D590-D596.
Rice P, Longden I, Bleasby A. 2000. “EMBOSS: the european molecular biology open software suite”. Trends in Genetics 16(6):276-277.
Saitou N, Nei M. 1987. “The neighbor-joining method: a new method for reconstructing phylogenetic trees”. Mol Biol Evol 4(4): 406-425.
Sonnhammer ELL, Hollich V. 2005. “Scoredist: a simple and robust protein sequence distance estimator”. BMC Bioinformatics 6:108.
The UniProt Consortium. 2015. “Uniprot: a hub for protein information”. Nucl Acids Res 43: D204-D212.
Wild M, Janson S, Wagner S, Laurie D. 2012. “Coupon collecting and traversals of hypergraphs”. arXiv:1107.1401v3.
Zhou C, Mao F, Yin Y, Huang J, Gogarten JP, Xu Y. 2014. AST: and automated sequence-sampling method for improving the taxonomic diversity of gene phylogenetic trees. PLOS ONE 9(6): e98844.