Dredd Blog: The Advent Of The Sulfur Atom

Thursday, January 4, 2024

The Advent Of The Sulfur Atom - 4

In The Advent Of The Sulfur Atom - 2 it was shown that the way proteins and nucleotides are presented in GenBank .gbff files is derived internally ("in-house").

But to save space in databases and simplify the use of proteins (composed of amino acids) and base pairs/nucleotides ('A','C','G', 'T', and 'U') when they are in separate lists or groups (as they are in .gbff files), I needed a solution, a simplification if you will.

So, I developed a code that is only two characters long which identifies both the amino acid and its specific codon complement.

In Fig. 1 the outer perimeter is the amino acid layer (20 unique amino acids with a few repeats), three terminators (stop), and 1 promoter (start).

Beneath that outer layer are the codon/nucleotide layers.

A codon is 3 sequential nucleotides, e.g. "AUG" where 'A' is the first nucleotide, 'U' the second, and 'G' the third).

The third nucleotide layer, just under the amino acid layer, contains 16 nucleotides.

Then inside that layer is the second nucleotide layer containing 4 nucleotides.

In the very center is the first nucleotide layer (1 nucleotide, which is an 'A', or 'C', or 'G', or 'U').

Another way of looking at it is that there are four quadrants, so using a clock analogy, one quadrant would extend from 12 o'clock to 3 o'clock, the next in circular sequence would be 3 o'clock to 6 o'clock, then on around to 6 o'clock to 9 o'clock, and finally the last quadrant would be 9 o'clock back to 12 o'clock.

When referencing these genetic components without a 'compression code' like the one I developed, programmers will likely generate software 'spaghetti code' when programming software to process them.

So, I simplified genetic component processing with a code beginning with an uppercase character 'A', ending with an uppercase 'Z', which represents the amino acid.

Then the second member of 'the code' is a number, a digit '1' through '5'.

With that two-member code I can reference all 20 amino acids (plus 1 stop codon) and all 64 codons (three stop codons), including the start amino acid 'M' and the stop ('Z' my nomenclature).

The software components are, among other things, an array of amino acid information and an array of codons relating to those amino acids.

The effect is to cut the size of data in the database to a minimum of half of what is required to store each nucleotide, protein sequence.

So, let's look at some appendices with information generated using the compressed information.

The output is the same with no loss and it works (Appendix 1, 2, 3, 4).

Appendix 1 is an example of the base two-character coding system.

The data in the other appendices shows partially decoded two-character mRNA in codon^protein format (e.g. AUG^M) in accord with the manner depicted in the video below.

The TATA BOX (composed of a promoter and a terminator) is not included.

Only the mRNA start (AUG) and stop codons ... e.g. " UAG^Z " are depicted.

Note that these are decoded and generated from the aforementioned software code (e.g. 'A0') that I use to store DNA and mRNA data.

The graphic at Fig. 1 will help you understand.

Note:

the most important issue involved in this subject matter is that the current method used in .gbff files obscures the atomic structure of the protein because it also obscures the atomic structure of some of the amino acids.

When either the one letter or the three letter code of an amino acid can mean several different nucleotide combinations, meaning several different atom combinations, then the real chemistry of the protein is unknown.

That also means that the typical descriptions of the dynamics of a polymerase or other 'observer' choosing a codon are not fully coherent.

For example, how choosing which atoms/nucleotide to remove or add to the protein/amino acid/codon mix if the current list of atoms isn't known in the, for example, 'V', 'R', 'Y', 'T' or 'L'.

"Does the 'V' mean 'GUU', 'GUA', or 'GUG'; does the 'S' mean 'UCU', 'UCA','UCG', 'AGU', or 'AGC' ?", etc.

In Appendix 1 all of those digits above '0' (e.g. "V3R3Y1T3L5") in the various letters symbolizing various amino acids, indicate that there are multiple nucleotides which can be contained in that amino acid.

But also, that digit specifies exactly which codon and therefore exactly which atoms are in that amino acid and later protein (in the current nomenclature used by geneticists a 'V' alone means guess which one of 0,1,2,3,4, or 5 on the lists of nucleotides/atoms are contained in 'V'. "Parts Is Parts" (cf. Fig. 2 here; and this table which shows some atomic variations in detail).

Atoms matter.

Thus, the coding system I mentioned in this post is not just a space saving technique, it is also a correct atomic composition revealing technique.

The previous post in this series is here (upcoming topics: Cosmic Ray Record, Southern Ocean, despotic minority 1934 France).