Dredd Blog: 2022-03-13

Tuesday, March 15, 2022

It's In The GenBank - 4

I. Background

Dredd Blog posts have featured genes of microbes and viruses for years (On the Origin of the Genes of Viruses, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17).

The general focus and effort is to find details less covered by mainstream media, including not only those which may impact disease outbreaks locally, but also those that may become pandemics (On The Origin Of The Home Of COVID-19, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27).

II. Lately

I have been trying to persuade microbiologists, virologists, and government (e.g. NCBI GenBank) to change errors in the format used in genome nucleotide (base pair) sequencing (It's In The GenBank, 2, 3; Small Things Considered).

My requests for a discussion made on 'scientific blogs' tend to be either marked as spam, deleted, delayed or otherwise 'disappeared'.

But "lo and behold" honest, full, professional, and friendly consideration was given by the esteemed and helpful NCBI (of the dot gov realm) when Eric Cox forwarded my email to Wayne Matten:

[my email to Customer services was forwarded to a specialist]

NCBI Customer Services Division NCBI | NLM | NIH

Case Information: Case #: CAS-867526-D3B5Z4 Customer Name: unknown unknown Customer Email: dreddblog@gmail.com Case Created: 3/11/2022, 10:59:57 AM Summary: Feedback via Datasets feedback form Details:

Hi Helpdesk, We got the following feedback via the Datasets feedback button. Since this is not a Datasets issue but rather an NCBI policy (and I don't know the answer!), I'm forwarding to you guys. [my email:] "I have tried and tried to communicate with anyone having information about why RNA genomes are not represented (in GenBank GBFF or FASTA files) with a 'U' (Uracil) nucleotide, but are instead represented by a 'T' (Thymine) nucleotide. see https://blogdredd.blogspot.com/2022/02/its-in-genbank-3.html (email: dreddblog@gmail.com)"

Thanks! Eric Cox

[this is the result which was sent back to Dredd Blog]

Re: case #CAS-867526-D3B5Z4: Feedback via Datasets feedback form TRACKING:000270000004949

Hello, Thank you for your question. The short answer is that this is the convention for GenBank and the other members of the INSDC, http://www.insdc.org/ [https://www.insdc.org/documents/feature-table#7.1.2]

The practical explanation is that RNA sequences are generated via cDNA technology, so all U's are replaced by T's.

The cDNA sequences are then submitted to one of the INSDC members. Directly sequencing RNA is possible, although on a smaller scale than DNA sequencing, yet the INSDC maintains the convention, most likely to simplify data management.

Best regards, Wayne

-=-=-=-=-=-=-=-=-=-=-=-

Wayne Matten, PhD

At least the .gov sites did not censor Dredd Blog, or mark my errors as spam, which was a pleasant surprise.

I responded as follows:

Hello Wayne,

Thank you kindly for your response.

The problem I have is that RNA does not contain 'T' (thymine) and DNA does not contain 'U' uracil (NIH, Some Of My Best Friends Are Germs).

When we place the 'T' in a database of RNA nucleotides we are promoting false information.

Attached are some examples in a .zip file of what they would look like if corrected.

Dredd

Wayne replied,

Hello Dredd,

I see your point of view. Another way to look at it is, since researchers submit their sequences with T's in place of U's, we would be misrepresenting their data by replacing the T's with U's.

The reason researchers submit sequences the way they do is because that is the convention. So within the constraints of the convention, the data are not false.

Here is a bit of official documentation:

https://www.insdc.org/files/feature_table.html#7.4.1

Best regards,
Wayne

I replied:

Wayne,

That is not the whole picture. Just misrepresenting 'U' uracil for 'T' thiamine is not the only falsification. The sequence itself changes. The positions of the base pairs change. "We've been doing it wrong so long that makes it right" is not a practice I subscribe to. Here is a paper by a serious scientist who knows a 'U' from a 'T' (Mutational Analysis of the Influenza Virus cRNA Promoter and Identification of Nucleotides Critical for Replication). If he sends a GBFF file, or FASTA, he will send the right letters (don't reject his data for that).

The .zip file I sent you shows how the locations of the molecules change during translation.

I know you can't do anything about it, so thanks for responding anyway.

Dredd

Some example files that were in the .zip file I sent to Wayne are in today's Appendix.

III. Closing Comments

The graphic at Fig. 1 reveals how RNA is depicted in drawings intended to show the chemical makeup of RNA ('U' for uracil).

Fig. 2 GenBank can do this too

The graphic at Fig. 2 shows how serious scientific papers use that same technique to reveal deep-down RNA genomic dynamics ('U' for uracil).

In the case of the scientific paper (see link at Fig. 2) if 'T' for thymine had been used instead of 'U' for uracil, the paper would not have been published in The Journal of Virology.

I don't think GenBank (a very good source of data) should continue its practice of using a 'T' in place of the 'U' when placing data in its vast store of database information.

We must take this and related issues very seriously:

"Assembly pipelines often result in viral genomes contaminated with host genetic material, some of which are currently deposited into public databases ... For years now the study of viruses and their genetic composition has been important in their identification and classification. Especially in these times of the pandemic turmoil, accurate knowledge of a virus’ exact genetic composition can help identify its strengths and weaknesses allowing us to track its evolution and assist in the development of vaccines and antiviral agents. The reconstruction of these genomic sequences is called the assembly process, a bioinformatics approach which can be complicated and full of pitfalls. This work identifies one such issue, concerning artifacts introduced in viral genomes from the new technologies of nucleic acid sequencing. The proposed algorithm helps alleviate this problem by tentatively removing these problematic regions while keeping the vast majority of the genetic information required to produce a more complete viral genome. This work is anticipated to assist in the submission of higher integrity and accuracy viral genomes in public databases used for novel virus identification and characterization ... Open databases, such as GenBank, while they contain most of the current sequences, are poorly curated regarding the integrity and the accuracy of the submitted sequences. Indeed, various reports have highlighted contamination of sequences with bacterial moieties that were erroneously incorporated in the final assembly. Such errors are especially important as they result in false positive identification of viruses that happen to contain in their proposed viral genomes parts of host DNA or RNA."

(ZWA: Viral genome assembly and characterization hindrances from virus-host chimeric reads; a refining approach, emphasis added; cf. Some Of My Best Friends Are Germs - 2)."Parts is parts" is not the way to deal with this.

On another front, Dredd Blog recently pointed out that the IPCC's latest report is critical of the way officialdom reacts to problems that are a concern to the public, calling it a "crime" (How Microbes Communicate In The Tiniest Language - 3).

The next post in this series is here, the previous post in this series is here.

Welcome to Wayne's World:

Appendix NTB

This is an appendix to: It's In The GenBank - 4

Proposed GenBank RNA format:

>NC_026431.1 982bp cRNA Influenza A virus (A/California/07/2009(H1N1))
uacucagaag auuggcucca gcuuugcaug caagaaagau aguagggcag uccgggggag
uuucggcucu agcgcgucuc ugaccuuuca cagaaacguc cuuucuugug ucuagaacuc
cgagaguacc uuaccgauuu cuguucuggu uagaacagug gagacugauu cccuuaaaau
ccuaaacaca agugcgagug gcacggguca cucgcuccug acgucgcauc ugcgaaacag
guuuuacggg auuuacccuu accccugggc uuguuguacc uaucucguca auuugauaug
uucuucgagu uuucucuuua uugcaaggua ccccgguucc uccacaguga uucgauaagu
ugaccacgug aacggucaac guacccggag uauauguugu ccuacccuug ucacuggugu
cuucgacgaa aaccagauca cacacgguga acacuugucu aacgacuaag ugucguagcc
agagugucug ucuaccgaug auggugguua ggugauuagu ccguacuuuu gucuuaccac
gaccgaucgu gaugccguuu ccgauaccuu gucuaccgac cuagcucacu uguccgucgc
cuccgguacc uccaacgauu agucugaucc gucuaccaug uacguuacuc uugauaaccc
ugaguaggau cgaggucacg accagacuuu cuacuggaag aacuuuuaaa cguccggaug
gucuucgcuu acccucacgu cuacgucgcu aaguucacua ggagagcagu aacgucguuu
auaguaaccc uagaacgugg acuauaacac cuaaugacua gcagaaaaaa aguuuacaua
aauagcagcg aaauuuaugc caaacuuuuc ucccggaaga ugccuuccuc acggacucag
guacucccuu cuuauaguug uccuugucgu cucacgacac cuacaacugc uaccaguaaa
acaguuguau cucgaucuca uu