FUSTr: a tool to find gene families under
selection in transcriptomes
T. Jeffrey Cole and Michael S. Brewer
Department of Biology, East Carolina University, Greenville, NC, United States of America
ABSTRACT
Background. The recent proliferation of large amounts of biodiversity transcriptomic
data has resulted in an ever-expanding need for scalable and user-friendly tools capable
of answering large scale molecular evolution questions. FUSTr identifies gene families
involved in the process of adaptation. This is a tool that finds genes in transcriptomic
datasets under strong positive selection that automatically detects isoform designation
patterns in transcriptome assemblies to maximize phylogenetic independence in
downstream analysis.
Results. When applied to previously studied spider transcriptomic data as well as
simulated data, FUSTr successfully grouped coding sequences into proper gene families
as well as correctly identified those under strong positive selection in relatively little
time.
Conclusions. FUSTr provides a useful tool for novice bioinformaticians to charac-
terize the molecular evolution of organisms throughout the tree of life using large
transcriptomic biodiversity datasets and can utilize multi-processor high-performance
computational facilities.
Subjects Bioinformatics, Evolutionary Studies, Genomics
Keywords Transcriptomics, Positive selection, Molecular evolution, Gene family reconstruction
Submitted 15 July 2017
Accepted 15 December 2017 BACKGROUND
Published 17 January 2018 Elucidating patterns and processes involved in the adaptive evolution of genes and
Corresponding authors genomes of organisms is fundamental to understanding the vast phenotypic diversity
T. Jeffrey Cole,
coleti16@students.ecu.edu found in nature. Recent advances in RNA-Seq technologies have played a pivotal role in
Michael S. Brewer, expanding knowledge of molecular evolution through the generation of an abundance
brewermi14@ecu.edu
of protein coding sequence data across all levels of biodiversity (Todd, Black & Gemmell,
Academic editor
Li Shen 2016). In non-model eukaryotic systems, transcriptomic experiments have become the
Additional Information and de facto approach for functional genomics in lieu of whole genome sequencing. This is
Declarations can be found on due largely to lower costs, better targeting of coding sequences, and enhanced exploration
page 7 of post-transcriptional modifications and differential gene expression (Wang, Gerstein &
DOI 10.7717/peerj.4234 Snyder, 2009). This influx of transcriptomic data has resulted in a need for scalable tools
Copyright capable of elucidating broad evolutionary patterns in large biodiversity datasets.
2018 Cole and Brewer Billions of years of evolutionary processes gave rise to remarkably complex genomic
Distributed under architectures across the tree of life. Numerous speciation events along with frequent
Creative Commons CC-BY 4.0 whole genome duplications have given rise to myriad multigene families with varying
OPEN ACCESS roles in the processes of adaptation (Benton, 2015). Grouping protein encoding genes
How to cite this article Cole and Brewer (2018), FUSTr: a tool to find gene families under selection in transcriptomes. PeerJ 6:e4234;
DOI 10.7717/peerj.4234
into their respective families de novo has remained a difficult task computationally. This
typically entails homology searches in large amino acid sequence similarity networks with
graph partitioning algorithms to cluster coding sequences into transitive groups (Andreev
& Racke, 2006). This is further complicated in eukaryotic transcriptome datasets that
contain several isoforms via alternative splicing (Matlin, Clark & Smith, 2005). Further
exploration of Darwinian positive selection in these families is also nontrivial, requiring
robust Maximum Likelihood and Bayesian phylogenetic approaches.
Here we present a fast tool for finding Families Under Selection in Transcriptomes
(FUSTr), to address the difficulties of characterizing molecular evolution in large-scale
transcriptomic datasets. FUSTr can be used to classify selective regimes on homologous
groups of phylogenetically independent coding sequences in transcriptomic datasets and
has been verified using large transcriptomic datasets and simulated datasets. The presented
pipeline implements a simplified user experience withminimized third-party dependencies,
in an environment robust to breaking changes to maximize long-term reproducibility.
While FUSTr fills a novel niche among sequence evolution pipeline, a recent tool, VESPA
(Webb, Thomas & Mary, 2017), performs several similar functions. Our tool differs in that
it can accept de novo transcriptome assemblies that are not predicted ORFs. VESPA requires
nucleotide data to be in complete coding frames and does not filter isoforms or utilize
transitive clustering to deal with domain chaining. Additionally, VESPA makes use of slow
maximum likelihood methods for tests of selection and provides no information about
purifying selection, whereas FUSTr utilizes a Fast Unconstrained Bayesian Approximation
(FUBAR) (Murrell et al., 2013) to analyze both pervasive and purifying regimes of selection.
IMPLEMENTATION
FUSTr is written in Python with all data filtration, preparation steps, and command line
arguments/parameters for external programs contained in the workflow engine Snakemake
(Köster & Rahmann, 2012). Snakemake allows FUSTr to operate on high performance
computational facilities, while also maintaining ease of reproducibility. FUSTr and all
third-party dependencies are distributed as a Docker container (Merkel, 2014). FUSTr
contains ten subroutines that takes transcriptome assembly FASTA formatted files from
any number of taxa as input and infers gene families that are either under diversifying or
purifying selection. A graphical overview of this workflow and parallelization scheme has
been outlined in Fig. 1.
Data Preprocessing. The first subroutine of FUSTr acts as a quality check step to ensure
input files are in valid FASTA format. Spurious special characters resulting from transferring
text files between multiple operating system architectures are detected and removed to
facilitate downstream analysis.
Isoform detection. Header patterns are analyzed to auto-detectwhether the given assembly
includes isoforms by detecting naming convention redundancies commonly used in isoform
designations, in addition to comparing the header patterns to common assemblers such
as Trinity de novo assemblies (Haas et al., 2013) and Cufflinks reference genome guided
assemblies (Trapnell et al., 2014).
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 2/9
Clean
New 
headers 
Gene 
prediction
Longest 
Isoform
Concatenate
BLAST
SiLiX
Gene 
families
mafft
trimAL
PHYLIP FastTree
revTrans
FUBAR
Statistics
Results
Figure 1 Parallelization scheme and workflow of FUSTr. Color coding denotes functional subroutines
in the pipeline: preparation and open reading frame prediction (red); homology inferenece and gene fam-
ily clustering (green); multiple sequence alignment, phylogenetics, and selection detection (brown); and
model selection and reconciliation (blue).
Full-size DOI: 10.7717/peerj.4234/fig-1
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 3/9
Gene prediction. Coding sequences are extracted from transcripts using Transdecoder
v3.0.1 (Haas et al., 2013). Transdecoder predicts Open Reading Frames (ORFs) using
likelihood-based approaches. A single best ORF for each transcript with predicted coding
sequence is extracted providing nucleotide coding sequences (CDS) and complementary
amino acid sequences. This facilitates further analyses requiring codon level sequences
while using the more informative amino acid sequences for homology inferences and
multiple sequence alignments. If the data contain several isoforms of the same gene, at
this point only the longest isoform is kept for further analysis to ensure phylogenetic
independence. The user may customize the use of TransDecoder by changing minimum
coding sequence length (default: 30 codons) or strand-specificity (default: off). Users also
have the option to only retain ORFs with homology to known proteins through a BLAST
search against Uniref90 or Swissprot in addition to searching PFAM to identify common
protein domains.
Homology search. The remaining coding sequences are assigned a unique identifier and
then concatenated into one FASTA file. Homologies among peptide sequences are assessed
via BLASTP acceleration through DIAMOND (v.0.9.10) with an e-value cutoff of 10?5.
Gene Family inference. The resulting homology network is parsed into putative gene
families using transitive clustering with SiLiX v.1.2.11, which is faster and has better
memory allocation than other clustering algorithms such as MCL and greatly reduces the
problem of domain chaining (Miele, Penel & Duret, 2011). Sequences are only added to
a family with 35% minimum identity, 90% minimum overlap, with minimum length to
accept partial sequences in families as 100 amino acids, and minimum overlap to accept
partial sequences of 50%. These are the optimal configurations of SiLiX (Bernardes et al.,
2015), but the user is free to configure these options.
Multiple sequence alignment and phylogenetic reconstruction. Multiple amino acid
sequence alignments of each family are then generated using the appropriate algorithm
automatically detected using MAFFT v7.221 (Katoh & Standley, 2013). Spurious columns
in alignments are removed with Trimal v1.4.1’s gappyout algorithm (Capella-Gutiérrez
& Silla-Martínez, 2009). Phylogenies of each family’s untrimmed amino acid multiple
sequence alignment are reconstructed using FastTree v2.1.9 (Price, Dehal & Arkin, 2010).
Trimmed multiple sequence codon alignments are then generated by reverse translation of
the amino acid alignment using the CDS sequences.
Tests for selective regimes. Families containing at least 15 sequences have the necessary
statistical power for tests of adaptive evolution (Wong et al., 2004). Tests of pervasive
positive selection at site specific amino acid level are implemented with FUBAR (Murrell
et al., 2013). Unlike codeml, FUBAR allows for tests of both positive and negative selection
using an ultra-fast Markov chain Monte Carlo routine that averages over numerous
predefined site-classes. When compared to codeml, FUBAR performs as much as 100
times faster (Murrell et al., 2013). Default settings for FUBAR, as used in FUSTr, include
twenty grid points per dimension, five chains of length 2,000,000 (with the first 1,000,000
discarded as burn-in), 100 samples drawn from each chain, and concentration parameter
of the Dirichlet prior set to 0.5.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 4/9
Users have the option to also run tests for pervasive selection using the much slower
CODEML v4.9 (Yang, 2007) with the codon alignments and inferred phylogenies. Log-
likelihood values of codon substitution models that allow positive selection are then
compared to respective nested models not allowing positive selection (M0/M3, M1a/M2a,
M7/M8, M8a/M8); Bayes Empirical Bayes (BEB) analysis then determines posterior
probabilities that the ratio of nonsynonymous to synonymous substitutions (dN/dS)
exceeds one for individual amino acid sites.
Final output and results. The final output is a summary file describing which gene
families were detected and those that are under strong selection and the average dN/dS
per family. A CSV file for each family under selection is generated giving the following
details per codon position of the family alignment: alpha mean posterior synonymous
substitution rate at a site; beta mean posterior non-synonymous substitution rate at a site;
mean posterior beta-alpha; posterior probability of negative selection at a site; posterior
probability of positive selection at a site; Empiricial Bayes Factor for positive selection at a
site; potential scale reduction factor; and estimated effective sample site for the probability
that beta exceeds alpha.
Validation
We tested FUSTr on six published whole body transcriptome sequences from an
adaptive radiation of Hawaiian Tetragnatha spiders (NCBI Short Read Archive accession
numbers: SRX612486, SRX612485, SRX612477, SRX612466, SRX559940, SRX559918)
assembled using the same methods as the original publication (Brewer et al., 2015). Spider
genomes contain numerous gene duplications lending to gene family rich transcriptomes.
Additionally, this adaptive radiation has been shown to facilitate strong, positive, sequence-
level selection in these transcriptomes (Brewer et al., 2015). This dataset provides an ideal
case use for FUSTr.
A total of 273,221 transcripts from all six Tetragnatha samples were provided as input
for FUSTr, and a total of 4,258 isoforms were removed leaving 159,464 coding sequences
for analysis after gene prediction. The entire analysis ran in 13.7 core hours, completing
within an hour when executed on a 24-core server. Time to completion and memory usage
for each of FUSTr’s subroutines performance in this analysis is reported in Table 1. FUSTr
recovered 134 families containing at least 15 sequences. Of these 46 families contained
sites under pervasive positive selection while all families also contained sites under strong
purifying selection. This can be contrasted with the analysis by Brewer et al. (2015), which
found 2,647 one-to-one six-member orthologous loci (one ortholog per each of the same
samples), with 65 loci receiving positive selection based on branch-specific analysis. The
original analysis did not allow paralogs whereas FUSTr does not reconstruct one-to-one
orthogroups but entire putative gene families, and the selection analysis utilized by FUSTr
is site-specific and not branch-specific. Thus, it is not expected that the results from FUSTr
would perfectly match up with the original analysis; however, five of the 46 families FUSTr
found to be under selection included loci from Brewer et al.’s (2015) original 65 under
selection based on branch-specific analysis.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 5/9
Table 1 Benchmarks for each subroutines’ time andmemory used for the Tetragnatha transcriptome
assembly analysis. Red highlighted row represents subroutine consuming the most memory and time per
task, blue highlighted row represents subroutine consuming the most memory and time in total.
Subroutine Tasks Seconds per Total seconds RAM per Total RAM (MiB)
task task (MiB)
Clean fastas 6 1.40 8.38 46.5 278.9
New headers 6 1.65 9.90 43.6 261.5
Long isoform 6 0.512 3.07 51.5 309.13
Transdecoder 1 10,436.7 10,436.7 3,249.8 3,249.8
Diamond 1 32.1 32.1 234.0 234.0
SiLiX 1 4.51 4.51 22.8 22.8
Mafft 135 3.24 437.8 18.3 2,466.5
FastTree 135 3.09 417.4 18.5 2,491.3
TrimAL 135 1.87 252.2 17.9 2,415.6
FUBAR 135 278.6 37,605.5 28.8 3,886.2
The same 273,221 transcripts were entered as input for VESPA as a comparative analysis.
Because VESPA cannot detect and filter ORFs in transcripts, it was unable to infer proper
coding sequences. In its first phase of cleaning input FASTA files, 86,269 transcripts were
wrongly removed for having ‘‘internal stop codons’’ via improper reading frame inference,
and 182,000 transcripts were removed due to ‘‘abnormal sequence length.’’ Approximately
98% of the transcripts were removed in the first phase of VESPA with no gene predictions,
rendering further analysis unnecessary for proper comparison of the performance of the
two pipelines.
We further validated FUSTr using coding sequences from simulated gene families with
predetermined selective regimes. We used EvolveAGene (Hall, 2007) on 3,000 random
coding sequences of a random length of 300–500 codons to generate gene families
containing 16 sequences evolved along a symmetric phylogeny each with average branch
lengths chosen randomly between 0.01–0.20 evolutionary units. Selective regimes with a
selection modifier of 3.0 were randomly chosen for each family so that a random 10%
partition of the family received pervasive positive selection, purifying selection, or constant
selection. All other settings for EvolveAGene were left as their defaults: the probability of
accepting an insertion= 0.1, the probability of accepting a deletion= 0.025, the probability
of accepting a replacement = 0.016, and no recombination was allowed. A visual schema
for these simulations can be found in Fig. 2.
The resulting 48,000 simulated sequences were used as input for FUSTr with
TransDecoder set to be strand-specific. FUSTr correctly recovered all 3,000 families,
and all 975 that were randomly selected to undergo strong positive selection were
correctly classified as receiving pervasive positive selection. Additionally, the families
selected to undergo purifying selection were correctly classified, and families selected
to receive constant selection were classified as not having any specific sites undergoing
purifying or pervasive positive selection. Scripts for these simulations can be found at
https://github.com/tijeco/FUSTr.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 6/9
ATG GAG AAT GAC GCC CTC TAA
ATG GAG AAT GAC GCC CTT TAA
ATG GAG AAT GAT GCC CTA TAA
ATG GAG AAT GAT GCA CTC TAA
1. Seed sequence (300-500 codons)
ATG GAA AAT GAC GAA CTA TAA
ATG GAA AAT GAC GAA CTA TAA
ATG GAG AAT GAC GAA CTA TAA
ATG GAG AAC GAT GAG TTA TAA
ATG GAA AAT GAT GAG TTA TAA
ATG GAG AAT AAC GGC CTT TAA
ATG GAG AAT AAC GGA CTC TAA
ATG GAG AAC AAC GAA CTA TAA
ATG GAG AAT AAT GAA CTA TAA
2. Evolve gene along 16 node phylogeny
    (x ?branch length of 0.01-0.20) ATG GAA AAT AAT GAG CTA TAA
ATG GAA AAT AAC GAA CTA TAA
ATG GAA AAT AAT GAA CTA TAA
ATG GAA AAC AAT GAA CTT TAA
3. Predefine selective regime 
    (positive, purifying, constant)               
    across sequences
Figure 2 Schematic of EvolveAGene methods used to simulate sequences for the validation of FUSTr.
Sequences were randomly generated and evolved along a symmetric phylogeny under a given selective
regime (positive, negative, or constant across the entire gene).
Full-size DOI: 10.7717/peerj.4234/fig-2
CONCLUSIONS
Current advances in RNA-seq technologies have allowed for a rapid proliferation of
transcriptomic datasets in numerous non-model study systems. It is currently the only tool
equipped to deal with the nuances of transcriptomic data, allowing for proper prediction
of gene sequences and isoform filtration. FUSTr provides a fast and useful tool for novice
bioinformaticians to detect gene families in transcriptomes under strong selection. Results
from this tool can provide information about candidate genes involved in the processes of
adaptation, in addition to contributing to functional genome annotation.
ACKNOWLEDGEMENTS
This work would not have been possible without XSEDE computational allocations
(BIO160060). We also thank Chris Cohen for editing this manuscript.
ADDITIONAL INFORMATION AND DECLARATIONS
Funding
This work was supported by National Science Foundation Graduate Research Fellowship
and the East Carolina University Department of Biology. The funders had no role in study
design, data collection and analysis, decision to publish, or preparation of the manuscript.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 7/9
Grant Disclosures
The following grant information was disclosed by the authors:
National Science Foundation Graduate Research Fellowship.
East Carolina University Department of Biology.
Competing Interests
The authors declare there are no competing interests.
Author Contributions
• T. Jeffrey Cole conceived and designed the experiments, performed the experiments,
analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the
paper.
• Michael S. Brewer conceived and designed the experiments, performed the experiments,
analyzed the data, contributed reagents/materials/analysis tools, wrote the paper,
reviewed drafts of the paper.
Data Availability
The following information was supplied regarding data availability:
Github: https://github.com/tijeco/FUSTr.
REFERENCES
Andreev K, Racke H. 2006. Balanced graph partitioning. Theory of Computing Systems
39:929–939 DOI 10.1007/s00224-006-1350-7.
Benton R. 2015.Multigene family evolution: perspectives from insect chemoreceptors.
Trends in Ecology & Evolution 30:590–600 DOI 10.1016/j.tree.2015.07.009.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G. 2015. Evaluation and improvements
of clustering algorithms for detecting remote homologous protein families. BMC
Bioinformatics 16:34 DOI 10.1186/s12859-014-0445-4.
Brewer MS, Carter RA, Croucher PJP, Gillespie RG. 2015. Shifting habitats, morphol-
ogy, and selective pressures: developmental polyphenism in an adaptive radiation of
Hawaiian spiders. Evolution 69:162–178 DOI 10.1111/evo.12563.
Capella-Gutiérrez S, Silla-Martínez JM. 2009. trimAl: a tool for automated alignment
trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973
DOI 10.1093/bioinformatics/btp348.
Haas BJ, Papanicolaou A, YassourM, Grabherr M, Blood PD, Bowden J, Couger MB,
Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F,
Weeks N,Westerman R,William T, Dewey CN, Henschel R, Leduc RD, Friedman
N, Regev A. 2013. De novo transcript sequence reconstruction from RNA-seq
using the Trinity platform for reference generation and analysis. Nature Protocols
8:1494–1512 DOI 10.1038/nprot.2013.084.
Hall B. 2007. EvolveAGene 3: a DNA coding sequence evolution simulation program.
Molecular Biology and Evolution 25(4):688–695 DOI 10.1093/molbev/msn008.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 8/9
Katoh K, Standley DM. 2013.MAFFT multiple sequence alignment software version
7: improvements in performance and usability.Molecular Biology and Evolution
30(4):772–780 DOI 10.1093/molbev/mst010.
Köster J, Rahmann S. 2012. Snakemake—a scalable bioinformatics workflow engine.
Bioinformatics 28:2520–2522 DOI 10.1093/bioinformatics/bts480.
Matlin A, Clark F, Smith C. 2005. Understanding alternative splicing: towards a cellular
code. Nature Reviews Molecular Cell Biology 386–398 DOI 10.1038/nrm1645.
Merkel D. 2014. Docker: lightweight linux containers for consistent development and
deployment. Linux Journal 239:2.
Miele V, Penel S, Duret L. 2011. Ultra-fast sequence clustering from similarity networks
with SiLiX. BMC Bioinformatics 12:1–9 DOI 10.1186/1471-2105-12-116.
Murrell B, Moola S, Mabona A,Weighill T, Sheward D, Pond S, Scheffler K. 2013.
FUBAR: a fast, unconstrained bayesian approximation for inferring selection.
Molecular Biology and Evolution 30:1196–1205 DOI 10.1093/molbev/mst030.
Price MN, Dehal PS, Arkin AP. 2010. FastTree 2—approximately maximum-likelihood
trees for large alignments. PLOS ONE 5(3):e9490 DOI 10.1371/journal.pone.0009490.
Todd E, BlackM, Gemmell N. 2016. The power and promise of RNA-seq in ecology and
evolution.Molecular Ecology 25(6):1224–1241 DOI 10.1111/mec.13526.
Trapnell C, Roberts A, Goff L, Pertea G, KimD, Kelley D, Pimentel H, Salzberg S,
Rinn J, Pachter L. 2014. Differential gene and transcript expression analysis of
RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 562–578
DOI 10.1038/nprot.2012.016.
Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcrip-
tomics. Nature Reviews Genetics 57–63 DOI 10.1038/nrg2484.
Webb AE, Thomas AW,Mary JO. 2017. VESPA: very large-scale evolutionary and
selective pressure analyses. PeerJ Computer Science 3:e118 DOI 10.7717/peerj-cs.118.
WongWSW, Yang Z, Goldman N, Nielsen R. 2004. Accuracy and power of statistical
methods for detecting adaptive evolution in protein coding sequences and for
identifying positively selected sites. Genetics 168(2):1041–1051
DOI 10.1534/genetics.104.031153.
Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood.Molecular Biology
and Evolution 24(8):1586–1591 DOI 10.1093/molbev/msm088.
Cole and Brewer (2018), PeerJ, DOI 10.7717/peerj.4234 9/9