Thursday, May 4, 2023

MetaSUB - 6

Fig. 1 The Zone
Today's appendix (MS Appendix Codons) shows the results using one way to extract codons from MetaSUB's sequences published in government databases.

I search for the Genetic Code's 64 in-frame codons and extract them as genome segments between a 'promoter' sequence and a 'terminator' sequence (Fig. 1, see the video below too). 

The reason for that, among other things, is the format MetaSUB uses.

Here is a sample of "their FASTA" format, which I concatenate into one large sequence prior to analysis on their files:

"@SRR10153697.1.1 1 length=250
TAGGGAATCTTCCGCAATGGGCGAAAGC...

@SRR10153697.2.1 2 length=250
TAGGGAATCTTCCGCAATGGGCGAAAGC...

@SRR10153697.3.1 3 length=250
TAGGGAATCTTCCGCAATGGGCGAAAGC..."

I noticed that 250 cannot be divided by 3 without a remainder (250 / 3 = 83.33), which means each segment of their list contains out-of-frame material.

After making one huge sequence out of their 250 bases-long segments (e.g. 20,30, 40 plus million ACGT characters), I blank out each of the three valid nucleotide codon locations (millions of them).

When I have finished that, the bases that remain are "out-of-frame", i.e., 'erroneous elements' (mutants? or chimera?).

The appendix details how many valid codons and how many out-of-frame bases were in each MetaSUB file on NCBI's SRA database (I used 33 of their files).

It also specifies the percentage of those out-of-frame elements, which said percentages seem very mechanical rather than random as one would expect of random samples in random locations around the globe.

Which leads me to the suspicion that their use of software, whether AI or otherwise, is involved the their data production.

They indicate that they use "shotgun sequencing", as I pointed out in the previous post in this series:

"Imagine taking a page in a book, making hundreds of copies of it, taking those copies and putting them in a paper shredder that magically creates strips of paper that contain various bits and pieces of the original sentences on the starting page. Then imagine reading each of those strips containing the bits and pieces of the sentences, looking for overlapping words and phrases, and eventually being able to reconstruct the entire text on the starting page by having read enough of those shreds of paper. In essence, that is shotgun DNA sequencing. Instead of a page of a book, a scientist starts with a piece of clone DNA or even an entire genome. The DNA is broken apart and many many many many many sequence reads are generated from the DNA pieces. All of those data are analyzed by a computer program that looks for overlapping stretches of sequence and eventually puts that DNA puzzle back together."

(MetaSUB - 5). Another location points out that:

"In reality, this process uses enormous amounts of information that are rife with ambiguities and sequencing errors."

(Wikipedia, Shotgun Sequencing). Add to that machine learning/AI and chimera are well hidden as part of the noise:

"Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity."

(MetaSUB Machine Learning). One would think that there is only one right way to "Shotgun sequence" a genome, especially with mathematically oriented software.

But there seem to be lots of different programs for doing that more than one way (Github genome-sequencing, shotgun sequencing).

The long and short of it is that sequencing the world's public places should follow the mastering of more stable sequencing locations (microbes, viruses, plants, one human, one bat, etc.) ... you get my drift.

The next post in this series is here, the previous post in this series is here.

1 comment:

  1. The genome business seems big on AI doing genetics (Link).

    ReplyDelete