Genomic Database Conundrum: Widespread Misannotation of rRNA Sequences as Protein Sequences

Raymond, Miranda

Genomic Database Conundrum: Widespread Misannotation of rRNA Sequences as Protein Sequences

Files

RAYMOND-HONORSTHESIS-2017.pdf (1.27 MB)

Date

2017-12-11

Authors

Raymond, Miranda

Publisher

East Carolina University

Abstract

The genomics revolution introduced affordable technology capable of rapidly analyzing and comparing massive amounts of biological sequence data. Using the Basic Local Alignment Search Tool (BLAST) program on the National Center for Biotechnology Information (NCBI) website, a highly expressed gene sequence obtained from the plant Leptosiphon jepsonii was analyzed. This sequence was compared against other sequences archived in the NCBI database for similarities. These comparisons encompassed various phyla of life including other green plants, fungi, metazoans, algae and single-celled organisms. The original sequence query was compared to inferred protein sequences. Then the mRNA sequences corresponding to these proteins were analyzed against complete nucleotide accessions through reciprocal BLAST searches to ensure accuracy of results. The most similar sequences from these reciprocal BLAST searches were rRNA rather than mRNA sequences. This result indicates that numerous accessions in NCBI are inappropriately characterized as mRNAs and proteins, rather than ribosomal sequences. To explore the breadth of this misannotation issue, sequences from a wide range of organisms, including model genomes, were also examined. This study indicates that rapid, automated computational analyses of massive amounts of sequence data, combined with a heightened focus on novel findings, has led to a sizable influx of erroneous data within even the most reputable databases.