![]() |
Where's It At? |
That is because DNA and RNA each have a well defined and unique chemical fingerprint that can be used to determine the accuracy of a nucleotide sequence composed of "ACTGU" sequences:
'A' = "C5H5N5" (5 Carbon, 5 Hydrogen, 5 Nitrogen)
'C' = "C4H5N3O" (4 Carbon, 5 Hydrogen, 3 Nitrogen, 1 Oxygen)
'T' = "C5H6N2O2" (5 Carbon, 6 Hydrogen, 2 Nitrogen, 2 Oxygen)
'G' = "C5H5N5O" (5 Carbon, 5 Hydrogen, 5 Nitrogen, 1 Oxygen)
'U' = "C4H4N2O2" (4 Carbon, 4 Hydrogen, 2 Nitrogen, 2 Oxygen)
It is well known that current technology for collecting those genetic sequences is not yet perfect, but we can determine how accurate any sequence is by how well its fingerprint matches the real fingerprint.
The method of counting the atoms helps determine the accuracy or lack thereof of the sequences collected and placed into public databases such as GenBank.
Today's appendix (Appendix 1) focuses on that issue by containing only sequences with 1 or more 'N' positions in the featured sequences therein.
So, first let's consider the 'N' positions in human chromosome 1, which has 18,475,408 of them in it.
Check out this official record if you don't believe it: GenBank FASTA.
The tables in today's appendix are structured as follows:
Atom Type | Atom Count | Atom Percent | DNA Const | Variation From Const | Atom Count Variation | Error Percent | Plus 'N' uncert. |
carbon | 1,104,350,017 | 32.3930726 | 32.2033898 | 0.1896828 | 209,476,253 | 18.97 | 20.38% |
hydrogen | 1,219,649,224 | 35.7750580 | 35.5932203 | 0.1818377 | 221,778,214 | 18.18 | 19.53% |
nitrogen | 854,562,482 | 25.0662418 | 25.4237288 | 0.3574870 | 305,494,976 | 35.75 | 38.40% |
oxygen | 230,654,899 | 6.7656275 | 6.7796610 | 0.0140335 | 3,236,884 | 1.40 | 1.51% |
Totals | 3,409,216,622 | 100 | 100 | 0.7430410 | 739,986,327 | 21.71% | 23.32% |
Here is a description of those table parts:
Link: link to GenBank data URL
Organism: microbiology nomenclature of current genome
Nucleotide count: nucleotides (A,C,G,T,U) in GenBank sequence
'A' count: Adenine
'C' count: Cytocine
'G' count: Guanine
'T' count: Thymine
'U' count: Uracil
'N' count: unknown nucleotide type
Plus 'N' uncert.: 'N' count relative percent
Atom Type (atoms that make up DNA/RNA nucleotides)
Atom Count quantity of atom type in the subject genome
Atom Percent % of the atom type in the subject genome
DNA/RNA const (DNA or RNA constant of the nucleoties)
Variation From Const (atom % variation from the constant)
Atom Count Variation number of atoms varying from const
Error Percent variation percent
Plus 'N' uncert. error % after 'N' uncertainty is considered
The focus of the operation to analyze those parts is to note how many atoms of each type should be in the sequence compared to how many there actually are.
Even though this will determine if it is scientifically accurate or not, the judgment as to the adequacy of the sequence for any particular purpose, is determined by the purpose for which this sequence at this time is being used.
Different strokes for different folks?
The previous post in this series is here.
AI Overview
ReplyDeleteGrammatical Rules for DNA Sequence Representation
"In a genetic sequence, the letter "N" signifies that the nucleotide base at that specific position could not be identified during DNA sequencing. It represents an unknown or ambiguous base, meaning any of the four DNA bases (adenine, guanine, cytosine, or thymine) could potentially occupy that spot. This ambiguity often arises from low-quality sequence data or technical limitations during the sequencing process, such as hairpin loops or overlapping traces."