Saturday, May 31, 2025

Genetic Constants In DNA and RNA

What?

I. Problems? What Problems?

What is the cause of difficulties in analyzing DNA and RNA sequences?

The accuracy of collection samples is a serious problem:

"A computer-aided analysis of almost 12,000 human-genetics papers has found more than 700 studies with errors in the DNA or RNA sequences of their experimental reagents1. That amounts to a “problem of alarming proportions”, because it suggests that a worrying fraction of studies on human genes are not reliable, says the team that conducted the analysis, led by cancer researcher Jennifer Byrne at the University of Sydney in Australia."

(Nature, "Errors in genetic sequences", 2021; cf. this). But beyond that there is also:

"The problem of the creation of numerical constants has haunted the Genetic Programming (GP) community for a long time and is still considered one of the principal open research issues. Many problems tackled by GP include finding mathematical formulas, which often contain numerical constants."

(Creation of Numerical Constants in Robust Gene Expression Programming). As a Dylan song lyric indicates "You can't win with a losing hand" (Things Have Changed).

And since the "dealers who hand out the cards" ... 'cards' which are the hundreds of thousands of genetic sequences (e.g. GenBank), it would be nice to have a simple (by comparison) method of scanning the sequence data to find diversions from official values.

II. A Place To Begin

The basics to take note of are that the DNA sequences are composed of "ACTG" nucleotides for DNA contrasted with "ACUG" for RNA.



DNA:

(Adenine)  'A' = "C5H5N5" (5 Carbon, 5 Hydrogen, 5 Nitrogen)

(Cytocine'C' = "C4H5N3O1" (4 Carbon, 5 Hydrogen, 3 Nitrogen, 1 Oxygen)

(Thymine) 'T' = "C5H6N2O2" (5 Carbon, 6 Hydrogen, 2 Nitrogen, 2 Oxygen)

(Guanine)  'G' = "C5H5N5O1" (5 Carbon, 5 Hydrogen, 5 Nitrogen, 1 Oxygen)

DNA molecules contain 59 atoms


Carbon Atoms in DNA molecules
(Adenine5,Cytocine4,Thymine5,Guanine5) = 19
19÷59 = 0.322033898
a DNA molecule is 32.2033898 percent Carbon

Hydrogen Atoms in DNA molecules
(Adenine5,Cytocine5,Thymine6,Guanine5) = 21
21÷59 = 0.355932203
a DNA molecule is 35.5932203 percent Hydrogen

Nitrogen Atoms in DNA molecules
(Adenine5,Cytocine3,Thymine2,Guanine5) = 15
15÷59 = 0.254237288
a DNA molecule is 25.4237288 percent Nitrogen

Oxygen Atoms in DNA molecules
(Adenine0,Cytocine1,Thymine2,Guanine1) = 4
4÷59 = 0.06779661
a DNA molecule is 6.779661 percent Oxygen

Thus, the DNA Genetic Constant:
32.2033898 + 35.5932203 + 25.4237288 + 6.779661 = 99.9999999



RNA:

(Adenine)  'A' = "C5H5N5" (5 Carbon, 5 Hydrogen, 5 Nitrogen)

(Cytocine'C' = "C4H5N3O1" (4 Carbon, 5 Hydrogen, 3 Nitrogen, 1 Oxygen)

(Uracil'U' = "C4H4N2O2" (4 Carbon, 4 Hydrogen, 2 Nitrogen, 2 Oxygen)

(Guanine)  'G' = "C5H5N5O1" (5 Carbon, 5 Hydrogen, 5 Nitrogen, 1 Oxygen)

RNA molecules contain 56 atoms 


Carbon Atoms in RNA molecules
(Adenine5,Cytocine4,Uracil4,Guanine5) = 18
18÷56 = 0.321428571
an RNA molecule is 32.1428571 percent Carbon

Hydrogen Atoms in RNA molecules
(Adenine5,Cytocine5,Uracil4,Guanine5) = 19
19÷56 = 0.339285714
an RNA molecule is 33.9285714 percent Hydrogen


Nitrogen Atoms in RNA molecules
(Adenine5,Cytocine3,Uracil2,Guanine5) = 15
15÷56 = 0.267857143 
an RNA molecule is 26.7857143 percent Nitrogen

Oxygen Atoms in RNA molecules
(Adenine0,Cytocine1,Uracil2,Guanine1) = 4
4÷56 = 0.071428571
an RNA molecule is 7.1428571 percent Oxygen

Thus, the RNA Genetic Constant:
32.1428571 + 33.9285714 + 26.7857143 + 7.1428571 = 99.9999999



III. Using These Constants

There are many examples of the use of these constants in previous series, including both DNA  (On The Origin Of A Genetic Constant, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13), and RNA (On The Origin Of Another Genetic Constant, 2, 3, 4, 5).

The basic process, using GenBank "FASTA" format sequences, is to:

 1) load the entire sequence into your software analyzer

2) count the individual 'A', 'C', 'T', 'G' (for DNA; 'T'='U' for RNA) in that sequence

3) sum those individual atom counts 

4) divide that sum by the total atoms count (section IV. below)

A general example is found in a previous post appendix where the results of a scan of Cuculus Canorus is detailed (Appendix Cuckoo Chromosomes).

Here is a section from that large sequence in FASTA format:

>NC_071419.1 Cuculus canorus isolate bCucCan1 chromosome 19
TAACCCTAACCCTAAACCCTAAGCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCAAACCCATAAC
CTACCCTAACCCTAACCCTAACCCTAACCATAAACCTAACCCCTAACCCTAAACCCTAAACCCTAACCCG
AACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAAAC
CCTAACCACTAAAACCCTAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCTAACCCTACCA
CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTACCCTACCCCTAA
CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAA
CCCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCT
AACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
...

The process is to consider each 'A' 'C' 'G' and 'T' in DNA sequences or 'A' 'C' 'G' and 'U' in RNA sequences to chronicle the total atom count:

each "(Adenine)  'A' = "C5H5N5" (5 Carbon, 5 Hydrogen, 5 Nitrogen)";

each "(Cytocine'C' = "C4H5N3O1" (4 Carbon, 5 Hydrogen, 3 Nitrogen, 1 Oxygen);

each "(Guanine)  'G' = "C5H5N5O1" (5 Carbon, 5 Hydrogen, 5 Nitrogen, 1 Oxygen)"

each  "(Thymine) 'T' = "C5H6N2O2" (5 Carbon, 6 Hydrogen, 2 Nitrogen, 2 Oxygen)";

Sometimes there are 'N' letters in the sequence:

"How to handle 'N' in Nucleotide/Genes Sequences retrieved from NCBI GeneBank?

We know that the four native bases for DNA are AGTC, however, some of the sequences, retrieved from NCBI, contain letter 'N', which illustrates that these nucleotide bases are not deciphered correctly, leaving an unidentified nucleotide. Should I replace N with any other base i.e. AGTC, assuming N can be any nucleotide, or I should exclude such sequences assuming that the sequencing done was not of good quality. If none of these, what I can do with such sequences in my dataset?
P.S I can't find any help in Entrez Sequences Help Catalog."

(ResearchGate, cf. SEQanswers). The number of unknown nucleotides ('N') is usually small, but can be a large enough percentage  to require more analysis.

IV. ... And Then

When all is said and done useful percentages can be derived by dividing the total atom count "188,585,905" into each atom type's count:

carbon count: 60,816,461 ÷ 188,585,905 = 35.25%

hydrogen count: 67,198,608 ÷ 188,585,905 = 35.63%

nitrogen count: 47,829,318 ÷ 188,585,905 = 25.36%

oxygen count:  12,741,518 ÷ 188,585,905 = 6.76%

As shown in the following "Table 1" from that appendix:



Full Genome
Table: 1
Link and genome info: NC_071419.1
Cuculus canorus isolate bCucCan1 chromosome 19
bCucCan1.pri, whole genome shotgun sequence


Nucleotide count: 12,779,222
'N' Nucleotide Count: 300

Atom Atom Count Percent
Carbon 60,816,461 32.25
Hydrogen 67,198,608 35.63
Nitrogen 47,829,318 25.36
Oxygen 12,741,518 6.76
Totals 188,585,905 100.00

Finally, after taking note of the 'N' count ("'N' Nucleotide Count: 300") and it's percentage, the degree to which the sequence percents deviate from the official percents can be considered and compared to the official values:

"the DNA Genetic Constant:
32.2033898 + 35.5932203 + 25.4237288 + 6.779661 

So, we can calculate the deviation as:

carbon (32.25 - 32.2033898 = 0.0466102) 4.66102%; 

hydrogen (35.63 - 35.5932203 = 0.0367797) 3.67797%; 

nitrogen (25.36 - 25.4237288 = −0.0637288) -6.37288%; 

oxygen (6.76 - 6.779661 = −0.019661) -1.9661%.

The appendices in the On The Origin Of A Genetic Constant and On The Origin Of Another Genetic Constant have thousands of such examples taken from scores of different flora, fauna, humans, viruses, and microbes.

V. Closing Comments

The quality required in genetic sequences will vary with the type of project being considered.

For example, DNA in a murder case in a criminal court would seem to require more accuracy in the sequences at issue than determining DNA content of ancient mummies would.

Anyway, give me a shout at https://dreddblog@gmail.com if need be.

(Thanks to Christie L. Mills for editing this post, and others, for grammar).

The next post in this series is here.

No comments:

Post a Comment