"The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation."
(The Human Pangenome Project, Nature, 2020). But that does not mean the data cannot be improved upon.
That is what the "pangenome" project is all about (The Human Pangenome Reference Consortium).
In a recent series Dredd Blog took a gander at the MetaSUB project (the one that had taken a smorgasbord view of DNA residue at public places in "cities").
"Cities" are where most people now reside in countries all around the globe (MetaSUB, 2, 3, 4, 5, 6, 7).
The MetaSUB DNA data is likely to include Bat, Bird, Bacteria, Human, and other DNA samples mixed together, engendering "chimera" samples (ibid).
Anyway, today the table which was presented in MetaSUB-7 is updated to include the human Pangenome FASTA data as well as updated MetaSUB DNA because the previous MetaSUB data was mRNA due to my mistake.
The updated table:
Codon | Australia | Egypt | Human | MetaSUB | Pangenome | Viking |
GCT | 2.05% | 1.51% | 1.10% | 0.25% | 1.68% | 2.06% |
GCC | 1.94% | 2.32% | 0.94% | 4.27% | 1.52% | 2.00% |
GCA | 1.90% | 2.03% | 1.35% | 2.52% | 1.64% | 1.97% |
GCG | 0.37% | 2.29% | 0.22% | 4.52% | 0.25% | 0.36% |
GCU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TGT | 2.29% | 1.43% | 2.10% | 0.44% | 2.56% | 2.17% |
TGC | 0.02% | 0.13% | 0.01% | 0.03% | 0.02% | 0.03% |
UGU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UGC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GAT | 1.20% | 2.62% | 0.85% | 0.31% | 1.09% | 1.10% |
GAC | 0.56% | 0.88% | 0.43% | 0.56% | 0.53% | 0.69% |
GAU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GAA | 2.39% | 2.90% | 1.91% | 6.40% | 2.21% | 2.42% |
GAG | 0.70% | 0.50% | 0.58% | 0.38% | 0.60% | 0.74% |
TTT | 1.09% | 0.43% | 1.44% | 0.03% | 1.96% | 1.16% |
TTC | 0.99% | 0.67% | 1.38% | 0.21% | 1.24% | 1.06% |
UUU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UUC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GGT | 1.19% | 0.93% | 0.66% | 0.27% | 1.00% | 1.07% |
GGC | 0.04% | 0.14% | 0.02% | 0.66% | 0.03% | 0.04% |
GGA | 0.22% | 0.29% | 0.16% | 0.13% | 0.20% | 0.26% |
GGG | 0.31% | 0.40% | 0.22% | 0.49% | 0.32% | 0.38% |
Codon | Australia | Egypt | Human | MetaSUB | Pangenome | Viking |
GGU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CAT | 0.61% | 0.29% | 0.71% | 0.05% | 0.68% | 0.56% |
CAC | 1.24% | 0.91% | 0.89% | 0.06% | 0.97% | 1.03% |
CAU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
ATT | 0.22% | 0.10% | 0.21% | 0.08% | 0.35% | 0.19% |
ATC | 0.41% | 0.35% | 0.42% | 0.15% | 0.47% | 0.37% |
ATA | 0.27% | 0.15% | 0.34% | 0.02% | 0.51% | 0.25% |
AUU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
AUC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
AUA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
AAA | 1.34% | 0.48% | 1.24% | 2.01% | 1.59% | 1.19% |
AAG | 0.22% | 0.10% | 0.16% | 0.01% | 0.22% | 0.23% |
CTT | 0.36% | 0.15% | 0.21% | 0.01% | 0.42% | 0.38% |
CTC | 1.08% | 0.54% | 0.70% | 0.11% | 0.96% | 0.97% |
CTA | 0.24% | 0.16% | 0.27% | 0.07% | 0.32% | 0.28% |
CTG | 0.22% | 0.14% | 0.13% | 0.01% | 0.21% | 0.28% |
TTA | 0.12% | 0.06% | 0.13% | 0.05% | 0.19% | 0.13% |
TTG | 0.10% | 0.05% | 0.06% | 0.00% | 0.08% | 0.09% |
CUU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CUC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CUA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CUG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
Codon | Australia | Egypt | Human | MetaSUB | Pangenome | Viking |
UUA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UUG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
ATG | 0.02% | 0.01% | 0.03% | 0.00% | 0.03% | 0.02% |
AUG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
AAT | 0.06% | 0.03% | 0.06% | 0.00% | 0.08% | 0.05% |
AAC | 0.20% | 0.19% | 0.18% | 0.04% | 0.22% | 0.22% |
AAU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CCT | 0.28% | 0.12% | 0.19% | 0.44% | 0.27% | 0.31% |
CCC | 0.37% | 0.26% | 0.29% | 0.02% | 0.36% | 0.41% |
CCA | 0.31% | 0.21% | 0.28% | 0.03% | 0.31% | 0.37% |
CCG | 0.05% | 0.26% | 0.04% | 0.11% | 0.05% | 0.08% |
CCU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
CAA | 0.16% | 0.11% | 0.15% | 0.02% | 0.18% | 0.16% |
CAG | 0.47% | 0.16% | 0.24% | 0.02% | 0.37% | 0.43% |
CGT | 0.08% | 0.37% | 0.07% | 0.10% | 0.08% | 0.10% |
CGC | 0.00% | 0.06% | 0.00% | 0.00% | 0.00% | 0.00% |
CGA | 0.02% | 0.14% | 0.01% | 0.01% | 0.01% | 0.01% |
CGG | 0.02% | 0.12% | 0.01% | 0.02% | 0.02% | 0.03% |
AGA | 0.19% | 0.24% | 0.25% | 0.18% | 0.27% | 0.22% |
AGG | 0.05% | 0.03% | 0.05% | 0.00% | 0.06% | 0.07% |
CGU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TCT | 0.14% | 0.37% | 0.10% | 0.44% | 0.15% | 0.16% |
Codon | Australia | Egypt | Human | MetaSUB | Pangenome | Viking |
TCC | 0.09% | 0.09% | 0.07% | 0.01% | 0.10% | 0.12% |
TCA | 0.15% | 0.07% | 0.10% | 0.00% | 0.13% | 0.13% |
TCG | 0.03% | 0.53% | 0.02% | 0.01% | 0.02% | 0.02% |
AGT | 0.13% | 0.07% | 0.09% | 0.03% | 0.14% | 0.16% |
AGC | 0.00% | 0.02% | 0.00% | 0.00% | 0.00% | 0.00% |
UCU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UCC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UCA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UCG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
AGU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
ACT | 0.14% | 0.05% | 0.09% | 0.03% | 0.13% | 0.12% |
ACC | 0.11% | 0.09% | 0.06% | 0.03% | 0.08% | 0.08% |
ACA | 0.12% | 0.06% | 0.12% | 0.00% | 0.14% | 0.12% |
ACG | 0.01% | 0.06% | 0.02% | 0.00% | 0.02% | 0.02% |
ACU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GTT | 0.06% | 0.05% | 0.03% | 0.00% | 0.04% | 0.05% |
GTC | 0.04% | 0.11% | 0.03% | 0.00% | 0.04% | 0.05% |
GTA | 0.05% | 0.07% | 0.04% | 0.13% | 0.06% | 0.06% |
GTG | 0.11% | 0.06% | 0.06% | 0.00% | 0.07% | 0.09% |
GUU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GUC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
GUA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
Codon | Australia | Egypt | Human | MetaSUB | Pangenome | Viking |
GUG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TGG | 0.28% | 0.11% | 0.15% | 1.99% | 0.24% | 0.27% |
UGG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TAT | 0.02% | 0.01% | 0.02% | 0.00% | 0.03% | 0.01% |
TAC | 0.04% | 0.04% | 0.04% | 0.01% | 0.06% | 0.05% |
UAU | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UAC | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TAA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TAG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
TGA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UAA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UAG | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
UGA | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
Closing Comments
The pop culture of the upper middle class and wealthy is wild about "ancestry".
But just as the hard-core-scientist-prepared GenBank and the like have errors in them, so do the pop culture data:
"Although some people purchase kits from multiple companies, the majority of people take just one test. Each person who buys genetic analysis from Ancestry, for example, consents to having his/her data become part of Ancestry’s enormous database, which is used to perform the analyses that people pay for. There are some interesting implications to how these databases are built.
First, they are primarily made up of paying customers, which means that the vast majority of genetic datasets in Ancestry’s database come from people who have enough disposable income to purchase the kit and analysis. It may not seem like an important detail, but it shows that the comparison population is not the same as the general population.
Second, because the analyses compare the sample DNA to DNA already in the database, it matters how many people from any given area have taken the test and are in the database. An article in Gizmodo describes one family’s experience with DNA testing and some of the pitfalls. The author quotes a representative from the company 23andMe as saying, “Different companies have different reference data sets and different algorithms, hence the variance in results. Middle Eastern reference populations [for example] are not as well represented as European, an industry-wide challenge.”
The same is true for any population where not many members have taken the test for a particular company. In an interview with NPR about trying to find information about her ancestry, journalist Alex Wagner described a similar problem, saying, “There are not a lot of Burmese people taking DNA tests … and so, the results that were returned were kind of nebulous.”
Wagner’s mother and grandmother both immigrated to the US from Burma in 1965, and when Wagner began investigating her ancestry, she, both of her parents, and her grandmother, all took tests from three different direct-to-consumer DNA testing companies. To Wagner’s surprise, her mother and grandmother both had results that showed they were Mongolian, but none of the results indicated Burmese heritage. In the interview she says that one of the biggest things she learned through doing all these tests was that “a lot of these DNA test companies [are] commercial enterprises. So, they basically purchase or acquire DNA samples on market-demand.”
As it turns out, there aren’t many Burmese people taking DNA tests, so there’s not much reason for the testing companies to pursue having a robust Burmese or even Southeast Asian database of DNA."
(The Problems with Ancestry DNA Analyses). Fortunately, non-genetic historical records such as obituaries and birth records diminish the uncertainty:
"Although it has been studied for many decades, DNA is not entirely understood. There could be significant SNPs that are not evaluated or recognized as important genetic markers. You also might not inherit certain genes that show your Scandinavian heritage even if your siblings have. Even with the best DNA testing, genes are tricky and cannot tell you everything about your family. Some companies (like Ancestry.com) incorporate the use of historical records to increase their ancestry DNA accuracy."
(GenomeLink). Even noted experts in the field disagree on DNA interpretations in various degrees (some of those degrees are hot):
"A New York laboratory has cut its ties with James Watson, the Nobel prize-winning scientist who helped discover the structure of DNA, over 'reprehensible' comments in which he said race and intelligence are connected.
The Cold Spring Harbor Laboratory said it was revoking all titles and honors conferred on Watson, 90, who led the lab for many years.
The lab 'unequivocally rejects the unsubstantiated and reckless personal opinions Dr James D Watson expressed on the subject of ethnicity and genetics', its president, Bruce Stillman, and chair of the board of trustees, Marilyn Simons, said in a statement.
'Dr Watson’s statements are reprehensible, unsupported by science, and in no way represent the views of CSHL, its trustees, faculty, staff, or students. The laboratory condemns the misuse of science to justify prejudice.'”
(Guardian, 2019). The bottom line is "don't bet the farm" on it until you do an exhaustive, detailed investigation.
The video below features experts who detail a lot of myths about DNA (genes) which are prevalent in various cultures.
The next post in this series is here.