Fig. 1 Transcription: DNA~>mRNA |
I. Background
Dredd Blog posts have featured genes of microbes and viruses for years (On the Origin of the Genes of Viruses, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17).
The general focus and effort is to find details less covered by mainstream media, including not only those which may impact disease outbreaks locally, but also those that may become pandemics (On The Origin Of The Home Of COVID-19, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27).
II. Lately
I have been trying to persuade microbiologists, virologists, and government (e.g. NCBI GenBank) to change errors in the format used in genome nucleotide (base pair) sequencing (It's In The GenBank, 2, 3; Small Things Considered).
My requests for a discussion made on 'scientific blogs' tend to be either marked as spam, deleted, delayed or otherwise 'disappeared'.
But "lo and behold" honest, full, professional, and friendly consideration was given by the esteemed and helpful NCBI (of the dot gov realm) when Eric Cox forwarded my email to Wayne Matten:
[my email to Customer services was forwarded to a specialist]
NCBI Customer Services Division NCBI | NLM | NIH
Case Information: Case #: CAS-867526-D3B5Z4 Customer Name: unknown unknown Customer Email: dreddblog@gmail.com Case Created: 3/11/2022, 10:59:57 AM Summary: Feedback via Datasets feedback form Details:
Hi Helpdesk, We got the following feedback via the Datasets feedback button. Since this is not a Datasets issue but rather an NCBI policy (and I don't know the answer!), I'm forwarding to you guys. [my email:] "I have tried and tried to communicate with anyone having information about why RNA genomes are not represented (in GenBank GBFF or FASTA files) with a 'U' (Uracil) nucleotide, but are instead represented by a 'T' (Thymine) nucleotide. see https://blogdredd.blogspot.com/2022/02/its-in-genbank-3.html (email: dreddblog@gmail.com)"
Thanks! Eric Cox
[this is the result which was sent back to Dredd Blog]
Re: case #CAS-867526-D3B5Z4: Feedback via Datasets feedback form TRACKING:000270000004949
Hello, Thank you for your question. The short answer is that this is the convention for GenBank and the other members of the INSDC, http://www.insdc.org/ [https://www.insdc.org/documents/feature-table#7.1.2]
The practical explanation is that RNA sequences are generated via cDNA technology, so all U's are replaced by T's.
The cDNA sequences are then submitted to one of the INSDC members. Directly sequencing RNA is possible, although on a smaller scale than DNA sequencing, yet the INSDC maintains the convention, most likely to simplify data management.
Best regards, Wayne
-=-=-=-=-=-=-=-=-=-=-=-
Wayne Matten, PhD
At least the .gov sites did not censor Dredd Blog, or mark my errors as spam, which was a pleasant surprise.
I responded as follows:
Wayne replied,
Wayne
I replied:
Some example files that were in the .zip file I sent to Wayne are in today's Appendix.
III. Closing Comments
The graphic at Fig. 1 reveals how RNA is depicted in drawings intended to show the chemical makeup of RNA ('U' for uracil).
Fig. 2 GenBank can do this too |
In the case of the scientific paper (see link at Fig. 2) if 'T' for thymine had been used instead of 'U' for uracil, the paper would not have been published in The Journal of Virology.
I don't think GenBank (a very good source of data) should continue its practice of using a 'T' in place of the 'U' when placing data in its vast store of database information.
We must take this and related issues very seriously:
"Assembly pipelines often result in viral genomes contaminated with host genetic material, some of which are currently deposited into public databases ... For years now the study of viruses and their genetic composition has been important in their identification and classification. Especially in these times of the pandemic turmoil, accurate knowledge of a virus’ exact genetic composition can help identify its strengths and weaknesses allowing us to track its evolution and assist in the development of vaccines and antiviral agents. The reconstruction of these genomic sequences is called the assembly process, a bioinformatics approach which can be complicated and full of pitfalls. This work identifies one such issue, concerning artifacts introduced in viral genomes from the new technologies of nucleic acid sequencing. The proposed algorithm helps alleviate this problem by tentatively removing these problematic regions while keeping the vast majority of the genetic information required to produce a more complete viral genome. This work is anticipated to assist in the submission of higher integrity and accuracy viral genomes in public databases used for novel virus identification and characterization ... Open databases, such as GenBank, while they contain most of the current sequences, are poorly curated regarding the integrity and the accuracy of the submitted sequences. Indeed, various reports have highlighted contamination of sequences with bacterial moieties that were erroneously incorporated in the final assembly. Such errors are especially important as they result in false positive identification of viruses that happen to contain in their proposed viral genomes parts of host DNA or RNA."
(ZWA: Viral genome assembly and characterization hindrances from virus-host chimeric reads; a refining approach, emphasis added; cf. Some Of My Best Friends Are Germs - 2)."Parts is parts" is not the way to deal with this.
On another front, Dredd Blog recently pointed out that the IPCC's latest report is critical of the way officialdom reacts to problems that are a concern to the public, calling it a "crime" (How Microbes Communicate In The Tiniest Language - 3).
The next post in this series is here, the previous post in this series is here.
Welcome to Wayne's World: