A Short Introduction to Algebraic Taxonomy on Genes Regions

a-short-introduction-to-algebraic-taxonomy-on-genes-regions

The Binary Alphabet of DNA

On the DNA Computer Binary Code

In any finite set we can define a partial order, a binary operation in different ways. But here, a partial order is defined in the set of four DNA bases in such a manner that a Boolean lattice structure is obtained. A Boolean lattice is an algebraic structure that captures essential properties of both set operations and logic operations. This partial order is defined based on the physico-chemical properties of the DNA bases: hydrogen bond number and chemical type: of purine {A, G} and pyrimidine {U, C}. This physico-mathematical description permits the study of the genetic information carried by the DNA molecules as a computer binary code of zeros (0) and (1).

1. Boolean lattice of the four DNA bases

In any four-element Boolean lattice every element is comparable to every other, except two of them that are, nevertheless, complementary. Consequently, to build a four-base Boolean lattice it is necessary for the bases with the same number of hydrogen bonds in the DNA molecule and in different chemical types to be complementary elements in the lattice. In other words, the complementary bases in the DNA molecule (G≡C and A=T or A=U during the translation of mRNA) should be complementary elements in the Boolean lattice. Thus, there are four possible lattices, each one with a different base as the maximum element.

2. Boolean (logic) operations in the set of DNA bases

The Boolean algebra on the set of elements X will be denoted by $(B(X), \vee, \wedge)$.  Here the operators $\vee$ and $\wedge$ represent classical “OR” and “AND” logical operations term-by-term. From the Boolean algebra definition it follows that this structure is (among other things) a partially ordered set in which any two elements $\alpha$ and $\beta$ have upper and lower bounds. Particularly, the greater lower bound of the elements $\alpha$ and $\beta$ is the element $\alpha\vee\beta$ and the least upper bound is the element $\alpha\wedge\beta$. This equivalent partial ordered set is called Boolean lattice.

  • In every Boolean algebra (denoted by $(B(X), \vee, \wedge)$) for any two elements , $\alpha,\beta \in X$ we have $\alpha \le \beta$, if and only if $\neg\alpha\vee\beta=1$, where symbol “$\neg$” stands for the logic negation. If the last equality holds, then it is said that $\beta$ is deduced from $\alpha$. Furthermore, if $\alpha \le \beta$ or $\alpha \ge \beta$ the elements and are said to be comparable. Otherwise, they are said not to be comparable.

In the set of four DNA bases, we can built twenty four isomorphic Boolean lattices [1]. Herein, we focus our attention that one described in reference [2], where the DNA bases G and C are taken as the maximum and minimum elements, respectively, in the Boolean lattice. The logic operation in this DNA computer code are given in the following table:

ORAND
$\vee$GAUC$\wedge$GAUC
GGAUÇGGGGG
AAACCAGAGA
UUCUCUGGUU
CCCCCCGAUC

It is well known that all Boolean algebras with the same number of elements are isomorphic. Therefore, our algebra $(B(X), \vee, \wedge)$ is isomorphic to the Boolean algebra $(\mathbb{Z}_2^2(X), \vee, \wedge)$, where $\mathbb{Z}_2 = \{0,1\}$. Then, we can represent this DNA Boolean algebra by means of the correspondence: $G \leftrightarrow 00$; $A \leftrightarrow 01$; $U \leftrightarrow 10$; $C \leftrightarrow 11$. So, in accordance with the operation table:

  • $A \vee U = C \leftrightarrow 01 \vee 10 = 11$
  • $U \wedge G = U \leftrightarrow 10 \wedge 00 = 00$
  • $G \vee C = C \leftrightarrow 00 \vee 11 = 11$

The logic negation ($\neg$) of a base yields the DNA complementary base: $\neg A = U \leftrightarrow \neg 01 = 10$;  $\neg G = C \leftrightarrow \neg 00 = 11$

  • A Boolean lattice has in correspondence a directed graph called Hasse diagram, where two nodes (elements) $\alpha$ and $\beta$ are connected with a directed edge from $\alpha$ to $\beta$ (or connected with a directed edge from $\beta$ to $\alpha$) if, and only if, $\alpha \le \beta$ ($\alpha \ge \beta$) and there is no other element between $\alpha$ and $\beta$.

The figure shows the Hasse diagram corresponding to the Boolean algebra $(B(X), \vee, \wedge)$. There are twenty four possible Hasse diagrams of four DNA bases and they integrate a symmetric group isomorphic to the symmetric group of degree four $S_4$ [1].

3. The Genetic code Boolean Algebras

Boolean algebras of codons are, explicitly, derived as the direct product $C(X) = B(X) \times B(X) \times B(X)$. These algebras are isomorphic to the dual Boolean algebras $(\mathbb{Z}_2^6, \vee, \wedge)$ and $(\mathbb{Z}_2^6, \wedge, \vee)$ induced by the isomorphism $B(X) \cong \mathbb{Z}_2^2$, where $X$ runs over the twenty four possibles ordered sets of four DNA bases [1]. For example:

CAG $\vee$ AUC = CCC $\leftrightarrow$ 110100 $\vee$ 011011 = 111111

ACG $\wedge$ UGA = GGG $\leftrightarrow$ 011100 $\wedge$ 100001 = 000000

$\neg$ (CAU) = GUA $\leftrightarrow$ $\neg$ (110110) = 001001

The Hasse diagram for the corresponding Boolean algebra derived from the direct product of the Boolean algebra of four DNA bases given in the above operation table is:

In the Hasse diagram, chains and anti-chains are located. A Boolean lattice subset is called a chain if any two of its elements are comparable but, on the contrary, if any two of its elements are not comparable, the subset is called an anti-chain. In the Hasse diagram of codons shown in the figure, all chains with maximal length have the same minimum element GGG and the maximum element CCC. It is evident that two codons are in the same chain with maximal length if and only if they are comparable, for example the chain: GGG $\leftrightarrow$ GAG $\leftrightarrow$ AAG $\leftrightarrow$ AAA $\leftrightarrow$ AAC $\leftrightarrow$ CAC $\leftrightarrow$ CCC

The Hasse diagram symmetry reflects the role of hydrophobicity in the distribution of codons assigned to each amino acid. In general, codons that code to amino acids with extreme hydrophobic differences are in different chains with maximal length. In particular, codons with U as a second base will appear in chains of maximal length whereas codons with A as a second base will not. For that reason, it will be impossible to obtain hydrophobic amino acid with codons having U in the second position through deductions from hydrophilic amino acids with codons having A in the second position.

There are twenty four Hasse diagrams of codons, corresponding to the twenty four genetic-code Boolean algebras. These algebras integrate a symmetric group isomorphic to the symmetric group of degree four $S_4$ [1]. In summary, the DNA binary code is not arbitrary, but subject to logic operations with subjacent biophysical meaning.

References

  1. Sanchez R. Symmetric Group of the Genetic-Code Cubes. Effect of the Genetic-Code Architecture on the Evolutionary Process. MATCH Commun Math Comput Chem, 2018, 79:527–60.
  2. Sánchez R, Morgado E, Grau R. A genetic code Boolean structure. I. The meaning of Boolean deductions. Bull Math Biol, 2005, 67:1–14.

The genetic-code vector space B^3 over the Galois field GF(5)

The $\mathbb{Z_5}$-vector space $\mathfrak{B}$3 over the field $(\mathbb{Z_5}, +, .)$

1. Background

This is a formal introduction to the genetic code $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$ over the field $(\mathbb{Z_5}, +, .)$. This mathematical model is defined based on the physicochemical properties of DNA bases (see previous post). This introduction can be complemented with a Wolfram Computable Document Format (CDF) named IntroductionToZ5GeneticCodeVectorSpace.cdf available in GitHub. This is graphic user interface with an interactive didactic introduction to the mathematical biology background that is explained here. To interact with a CDF users will require for Wolfram CDF Player or Mathematica. The Wolfram CDF Player is freely available (easy installation on Windows OS and on Linux OS).

2. Biological mathematical model

If the Watson-Crick base pairings are symbolically expressed by means of the sum “+” operation, in such a way that hold: G + C = C + G = D, U + A  = A + U = D, then this requirement leads us to define an additive group ($\mathfrak{B}^3$, +) on the set of five DNA bases ($\mathfrak{B}^3$, +). Explicitly, it was required that the bases with the same number of hydrogen bonds in the DNA molecule and different chemical types were algebraically inverse in the additive group defined in the set of DNA bases $\mathfrak{B}$. In fact eight sum tables (like that one shown below), which will satisfice the last constraints, can be defined in eight ordered sets: {D, A, C, G, U}, {D, U, C, G, A}, {D, A, G, C, U}, {D, U, G, C, A},{G, A, U, C},{G, U, A, C},{C, A, U, G} and {C, U, A, G} [1,2]. The sets originated by these base orders are called the strong-weak ordered sets of bases [1,2] since, for each one of them, the algebraic-complementary bases are DNA complementary bases as well, pairing with three hydrogen bonds (strong, G:::C) and two hydrogen bonds (weak, A::U). We shall denote this set SW.

A set of extended base triplet is defined as $\mathfrak{B}^3$ = {XYZ | X, Y, Z $\in\mathfrak{B}$}, where to keep the biological usual notation for codons, the triplet of letters $XYZ\in\mathfrak{B}^3$ denotes the vector $(X,Y,Z)\in\mathfrak{B}^3$ and $\mathfrak{B} =$ {A, C, G, U}. An Abelian group on the extended triplets set can be defined as the direct third power of group: 

$(\mathfrak{B}^3,+) = (\mathfrak{B},+)×(\mathfrak{B},+)×(\mathfrak{B},+)$

where X, Y, Z $\in\mathfrak{B}$, and the operation “+” as shown in the table [2]. Next, for all elements $\alpha\in\mathbb{Z}_{(+)}$ (the set of positive integers) and for all codons $XYZ\in(\mathfrak{B}^3,+)$, the element:

$\alpha \bullet XYZ = \overbrace{XYZ+XYX+…+XYZ}^{\hbox{$\alpha$ times}}\in(\mathfrak{B}^3,+)$ is well defined. In particular, $0 \bullet X =$ D for all $X\in(\mathfrak{B}^3,+) $. As a result, $(\mathfrak{B}^3,+)$ is a three-dimensional (3D) $\mathbb{Z_5}$-vector space over the field $(\mathbb{Z_5}, +, .)$ of the integer numbers modulo 5, which is isomorphic to the Galois field GF(5). Notice that the Abelian groups $(\mathbb{Z}_5, +)$ and $(\mathfrak{B},+)$ are isomorphic. For the sake of brevity, the same notation $\mathfrak{B}^3$ will be used to denote the group $(\mathfrak{B}^3,+)$ and the vector space defined on it.

+DACGU
DDACGU
AACGUD
CCGUDA
GGUDAC
UUDACG

This operation is only one of the eight sum operations that can be defined on each one of the ordered sets of bases from SW.

3. The canonical base of the $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$

Next, in the vector space $\mathfrak{B}^3$, vectors (extended codons): e1=ADD, e2= DAD and e3=DDA are linearly independent, i.e., $\sum\limits_{i=1}^3 c_i e_i =$ DDD implies $c_1=0, c_2=0$ and $c_3=0$ for any distinct $c_1, c_2, c_3 \in\mathbb{Z_5}$. Moreover, the representation of every extended triplet $XYZ\in\mathfrak{B}^3$ on the field $\mathbb{Z_5}$ as $XYZ=xe_1+ye_2+ze_3$ is unique and the generating set $e_1, e_2$, and $e_3$ is a canonical base for the $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$. It is said that elements $x, y, z \in\mathbb{Z_5}$ are the coordinates of the extended triplet $XYZ\in\mathfrak{B}^3$ in the canonical base ($e_1, e_2, e_3$) [3]
  1. José M V, Morgado ER, Sánchez R, Govezensky T. The 24 Possible Algebraic Representations of the Standard Genetic Code in Six or in Three Dimensions. Adv Stud Biol, 2012, 4:119–52.
  2. Sanchez R. Symmetric Group of the Genetic-Code Cubes. Effect of the Genetic-Code Architecture on the Evolutionary Process. MATCH Commun Math Comput Chem, 2018, 79:527–60.
  3. Sánchez R, Grau R. An algebraic hypothesis about the primeval genetic code architecture. Math Biosci, 2009, 221:60–76.

Group operations on the set of five DNA bases

General biochemical background. The genetic information on how to build proteins able to perform different biological/biochemical functions is encoded in the DNA sequence. Code-words of three letters/bases, called triplets or codons, are used to encode the information that will be used to synthesize proteins. Every codon encodes the information for one amino acid and every amino acid can be encoded by one or more codons. The genetic code is the biochemical system that establishes the rules by which the nucleotide sequence of a protein-encoding gene is transcribed into mRNA codon sequences and then translated into the amino acid sequences of the corresponding proteins. The genetic code is an extension of the four-letter alphabet found in DNA molecules. These “letters” are the DNA bases: Adenine, Guanine, Cytosine, and Thymine, usually denoted A, G, C, and T respectively (in an RNA molecule, T is changed to U, uracil).

In the DNA molecules, nucleotide bases are paired according to the following rule (Watson-Crick base pairings): G:C, A:T. That is, base G is the complementary base of C, and A is the complementary base of T (or U) in the DNA (or in the RNA) molecule and vice-versa.

Each DNA/RNA base can be classified into three main classes according to three criteria: chemical type (purines: A and G,  or pyrimidines: C and T), number of hydrogen bonds (strong or weak) and the whether the base has an amino group or a keto group [1]. Each criterion produces a partition of the set of bases [2].

Base pair AT
Two hydrogen bonds are established in the pairing of Adenine (T) and Thymine (T) two. Yikrazuul [Public domain], from Wikimedia Commons
Base pair GC
The pairing between guanine and cytosine involved three hydrogen bonds. Yikrazuul [Public domain], from Wikimedia Commons

The relationships between the DNA nucleotide bases, quantitatively expressed throughout the Watson-Crick base-pairings, permit the representation of the standard genetic code as a cube inserted in the Euclidean three-dimensional vector space $\mathbb{R}^3$ [3,4]. In particular, it is plausible that the present standard genetic code was derived from an ancestral code architecture with five or more bases (see main text for full discussion). The algebraic and biological model suggests the plausibility of the transition from a primeval code with an extended DNA alphabet $\mathfrak{B}$ ={D,A,C,G,U} to the present standard code, where the symbol “D” represents one or more hypothetical bases with unspecific pairings. It is important to observe that though the evidence from organic chemistry experiments supports the necessity of five or more DNA bases in the primordial genetic system apparatus, the formal development of the algebraic theory necessarily leads to an extension of the DNA base alphabet. In fact, the additional base was implicit in the multiple sequence alignments of the DNA sequences as gaps representing insertion and deletion mutations (indel mutations). The importance of considering the indel mutations in the phylogenetic analysis was analyzed in [2]. Perhaps the most significant role of the fifth base in the current DNA molecules is played by the epigenetics role of cytosine DNA methylation (CDM). CDM patterning represents one feature of the epigenome that is highly responsive to environmental stress and associates with trans-generational adaptation in plants and in animals.

If the Watson-Crick base pairings are symbolically expressed by means of the sum “+” operation in such a way that the following relationships hold: G + C = C + G = D, and A + U = U + A = D, then this requirement leads to the definition of an additive group or Abelian group on the set of five RNA (DNA) bases. Explicitly, it will be required that bases with the same number of hydrogen bonds in the DNA molecule and different chemical types be algebraic inverses of each other in the additive group defined on the set of DNA bases. In other words, the complementary RNA bases G:C and A:U (or G:C and A:T in the DNA) are, respectively, algebraic complements. This definition also reflects the non-specific pairings of the ancient hypothetical base(s) D, which is taken as the neutral element of the sum operation. Next, there is only one possible definition for the multiplication operation ($\times$) (with base A as the neutral element for this operation) in such a way that it completes a finite (Galois) field structure isomorphic to the field $\mathbb{Z_5}$ and defined over the set of integers modulo 5 (GF(5)). The simplicity of these operations can be noticed in Table 1.

Table 1. Operation tables of the Galois field (GF(5)) on the ordered set of the extended bases alphabet $\mathfrak{B}$={D, A, C, G, U}, and on $\mathbb{Z_5}$.

Sum Product
+ D A C G U × D A C G U
D D A C G U D D D D D D
A A C G U D A D A C G U
C C G U D A C D C U A G
G G U D A C G D G A U C
U U D A C G U D U G C A
SumProduct
+01234 ×01234
001234000000
112340101234
223401202413
334012303142
440123404321

Further readings are provided in the Computable Document Format (CDF) named IntroductionToZ5GeneticCodeVectorSpace.cdf. In this CDF, readers will gain a better comprehension of the biological and algebraic background on the genetic code algebras presented in reference [4]. This is a didactic and interactive visualization to  introduce the Genetic code $\mathbb{Z}_5$-Vector Space.

  1. Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Research, 13(9), 3021–3030.
  2. Jimenez-Montano MA, de la Mora-Basanez CR, Poschel T. The hypercube structure of the genetic code explains conservative and non-conservative aminoacid substitutions in vivo and in vitro. Biosystems, 1996, 39:117–25
  3. Sanchez R, Grau R, Morgado E (2006) A novel Lie algebra of the genetic code over the Galois field of four DNA bases. Math Biosci 202: 156-174.
  4. Sánchez R, Grau R (2009) An algebraic hypothesis about the primeval genetic code architecture. Math Biosci 221 : 60 – 76.