Thursday, June 22, 2023

It's In The GenBank - 8

Fort Knox Mythology
I. Background

Previous to this post (It's In The GenBank - 7) Dredd Blog took a look at a small percentage of genome record errors involving a mismatch between the amino acids said to be associated with codons in the CDS segments in .GBFF files.

In other words, if in a range of nucleotides a codon therein was supposed to "code for" a particular amino acid, did the GenBank record have the correct amino acid listed in the "/translation=" section of that .gbff file?

Small percentages of errors have been discovered in public databases for many a year:

"Overall, we conclude that, as a conservative estimate, 1 in every 20 public database records is likelyto be corrupt. Our results support concerns recently expressed over the quality of the public repositories. With16S rRNA sequence data increasingly playing a dominant role in bacterial systematics and environmental biodiversity studies, it is vital that steps be taken to improve screening of sequences prior to submission." 

(At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies, 2005, emphasis added). Got 'science' (If science was perfect, it wouldn’t be science)?

The errors in that cited paper took an incredible amount of work to discover (ibid), however, the errors Dredd Blog is pointing are very easy to detect, and are not related to sample collection errors as was the case in the cited paper.

Anyway, the appendices in the previous version (A-K) are supplemented with Appendix L in today's post.

These translation errors are based solely on the data in GenBank sequence files, where a particular codon is not the codon that codes for the specified amino acid specified.

The errors could be a word processor user's typo, or any of several other causes.

The purpose of this post and the previous one in this series is not to determine the cause of the error, the purpose is simply to point out that some errors do exist in the public database arena.

Users should know how to detect and correct them prior to using them in any number of pursuits.

II. Closing Comments

The GenBank is a wonderful public database, and a wonderful public service.

Links have now been provided (HTTP, FTP, etc.) to that rich source of information concerning many microbial genomes found on our planet.

I hope it provides some benefits to Dredd Blog users. 

The previous post in this series is here.

No comments:

Post a Comment