The genetic-code vector space B^3 over the Galois field GF(5)

The $\mathbb{Z_5}$-vector space $\mathfrak{B}$3 over the field $(\mathbb{Z_5}, +, .)$

1. Background

This is a formal introduction to the genetic code $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$ over the field $(\mathbb{Z_5}, +, .)$. This mathematical model is defined based on the physicochemical properties of DNA bases (see previous post). This introduction can be complemented with a Wolfram Computable Document Format (CDF) named IntroductionToZ5GeneticCodeVectorSpace.cdf available in GitHub. This is graphic user interface with an interactive didactic introduction to the mathematical biology background that is explained here. To interact with a CDF users will require for Wolfram CDF Player or Mathematica. The Wolfram CDF Player is freely available (easy installation on Windows OS and on Linux OS).

2. Biological mathematical model

If the Watson-Crick base pairings are symbolically expressed by means of the sum “+” operation, in such a way that hold: G + C = C + G = D, U + A  = A + U = D, then this requirement leads us to define an additive group ($\mathfrak{B}^3$, +) on the set of five DNA bases ($\mathfrak{B}^3$, +). Explicitly, it was required that the bases with the same number of hydrogen bonds in the DNA molecule and different chemical types were algebraically inverse in the additive group defined in the set of DNA bases $\mathfrak{B}$. In fact eight sum tables (like that one shown below), which will satisfice the last constraints, can be defined in eight ordered sets: {D, A, C, G, U}, {D, U, C, G, A}, {D, A, G, C, U}, {D, U, G, C, A},{G, A, U, C},{G, U, A, C},{C, A, U, G} and {C, U, A, G} [1,2]. The sets originated by these base orders are called the strong-weak ordered sets of bases [1,2] since, for each one of them, the algebraic-complementary bases are DNA complementary bases as well, pairing with three hydrogen bonds (strong, G:::C) and two hydrogen bonds (weak, A::U). We shall denote this set SW.

A set of extended base triplet is defined as $\mathfrak{B}^3$ = {XYZ | X, Y, Z $\in\mathfrak{B}$}, where to keep the biological usual notation for codons, the triplet of letters $XYZ\in\mathfrak{B}^3$ denotes the vector $(X,Y,Z)\in\mathfrak{B}^3$ and $\mathfrak{B} =$ {A, C, G, U}. An Abelian group on the extended triplets set can be defined as the direct third power of group: 

$(\mathfrak{B}^3,+) = (\mathfrak{B},+)×(\mathfrak{B},+)×(\mathfrak{B},+)$

where X, Y, Z $\in\mathfrak{B}$, and the operation “+” as shown in the table [2]. Next, for all elements $\alpha\in\mathbb{Z}_{(+)}$ (the set of positive integers) and for all codons $XYZ\in(\mathfrak{B}^3,+)$, the element:

$\alpha \bullet XYZ = \overbrace{XYZ+XYX+…+XYZ}^{\hbox{$\alpha$ times}}\in(\mathfrak{B}^3,+)$ is well defined. In particular, $0 \bullet X =$ D for all $X\in(\mathfrak{B}^3,+) $. As a result, $(\mathfrak{B}^3,+)$ is a three-dimensional (3D) $\mathbb{Z_5}$-vector space over the field $(\mathbb{Z_5}, +, .)$ of the integer numbers modulo 5, which is isomorphic to the Galois field GF(5). Notice that the Abelian groups $(\mathbb{Z}_5, +)$ and $(\mathfrak{B},+)$ are isomorphic. For the sake of brevity, the same notation $\mathfrak{B}^3$ will be used to denote the group $(\mathfrak{B}^3,+)$ and the vector space defined on it.

+DACGU
DDACGU
AACGUD
CCGUDA
GGUDAC
UUDACG

This operation is only one of the eight sum operations that can be defined on each one of the ordered sets of bases from SW.

3. The canonical base of the $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$

Next, in the vector space $\mathfrak{B}^3$, vectors (extended codons): e1=ADD, e2= DAD and e3=DDA are linearly independent, i.e., $\sum\limits_{i=1}^3 c_i e_i =$ DDD implies $c_1=0, c_2=0$ and $c_3=0$ for any distinct $c_1, c_2, c_3 \in\mathbb{Z_5}$. Moreover, the representation of every extended triplet $XYZ\in\mathfrak{B}^3$ on the field $\mathbb{Z_5}$ as $XYZ=xe_1+ye_2+ze_3$ is unique and the generating set $e_1, e_2$, and $e_3$ is a canonical base for the $\mathbb{Z_5}$-vector space $\mathfrak{B}^3$. It is said that elements $x, y, z \in\mathbb{Z_5}$ are the coordinates of the extended triplet $XYZ\in\mathfrak{B}^3$ in the canonical base ($e_1, e_2, e_3$) [3]
  1. José M V, Morgado ER, Sánchez R, Govezensky T. The 24 Possible Algebraic Representations of the Standard Genetic Code in Six or in Three Dimensions. Adv Stud Biol, 2012, 4:119–52.
  2. Sanchez R. Symmetric Group of the Genetic-Code Cubes. Effect of the Genetic-Code Architecture on the Evolutionary Process. MATCH Commun Math Comput Chem, 2018, 79:527–60.
  3. Sánchez R, Grau R. An algebraic hypothesis about the primeval genetic code architecture. Math Biosci, 2009, 221:60–76.

Group operations on the set of five DNA bases

General biochemical background. The genetic information on how to build proteins able to perform different biological/biochemical functions is encoded in the DNA sequence. Code-words of three letters/bases, called triplets or codons, are used to encode the information that will be used to synthesize proteins. Every codon encodes the information for one amino acid and every amino acid can be encoded by one or more codons. The genetic code is the biochemical system that establishes the rules by which the nucleotide sequence of a protein-encoding gene is transcribed into mRNA codon sequences and then translated into the amino acid sequences of the corresponding proteins. The genetic code is an extension of the four-letter alphabet found in DNA molecules. These “letters” are the DNA bases: Adenine, Guanine, Cytosine, and Thymine, usually denoted A, G, C, and T respectively (in an RNA molecule, T is changed to U, uracil).

In the DNA molecules, nucleotide bases are paired according to the following rule (Watson-Crick base pairings): G:C, A:T. That is, base G is the complementary base of C, and A is the complementary base of T (or U) in the DNA (or in the RNA) molecule and vice-versa.

Each DNA/RNA base can be classified into three main classes according to three criteria: chemical type (purines: A and G,  or pyrimidines: C and T), number of hydrogen bonds (strong or weak) and the whether the base has an amino group or a keto group [1]. Each criterion produces a partition of the set of bases [2].

Base pair AT
Two hydrogen bonds are established in the pairing of Adenine (T) and Thymine (T) two. Yikrazuul [Public domain], from Wikimedia Commons
Base pair GC
The pairing between guanine and cytosine involved three hydrogen bonds. Yikrazuul [Public domain], from Wikimedia Commons

The relationships between the DNA nucleotide bases, quantitatively expressed throughout the Watson-Crick base-pairings, permit the representation of the standard genetic code as a cube inserted in the Euclidean three-dimensional vector space $\mathbb{R}^3$ [3,4]. In particular, it is plausible that the present standard genetic code was derived from an ancestral code architecture with five or more bases (see main text for full discussion). The algebraic and biological model suggests the plausibility of the transition from a primeval code with an extended DNA alphabet $\mathfrak{B}$ ={D,A,C,G,U} to the present standard code, where the symbol “D” represents one or more hypothetical bases with unspecific pairings. It is important to observe that though the evidence from organic chemistry experiments supports the necessity of five or more DNA bases in the primordial genetic system apparatus, the formal development of the algebraic theory necessarily leads to an extension of the DNA base alphabet. In fact, the additional base was implicit in the multiple sequence alignments of the DNA sequences as gaps representing insertion and deletion mutations (indel mutations). The importance of considering the indel mutations in the phylogenetic analysis was analyzed in [2]. Perhaps the most significant role of the fifth base in the current DNA molecules is played by the epigenetics role of cytosine DNA methylation (CDM). CDM patterning represents one feature of the epigenome that is highly responsive to environmental stress and associates with trans-generational adaptation in plants and in animals.

If the Watson-Crick base pairings are symbolically expressed by means of the sum “+” operation in such a way that the following relationships hold: G + C = C + G = D, and A + U = U + A = D, then this requirement leads to the definition of an additive group or Abelian group on the set of five RNA (DNA) bases. Explicitly, it will be required that bases with the same number of hydrogen bonds in the DNA molecule and different chemical types be algebraic inverses of each other in the additive group defined on the set of DNA bases. In other words, the complementary RNA bases G:C and A:U (or G:C and A:T in the DNA) are, respectively, algebraic complements. This definition also reflects the non-specific pairings of the ancient hypothetical base(s) D, which is taken as the neutral element of the sum operation. Next, there is only one possible definition for the multiplication operation ($\times$) (with base A as the neutral element for this operation) in such a way that it completes a finite (Galois) field structure isomorphic to the field $\mathbb{Z_5}$ and defined over the set of integers modulo 5 (GF(5)). The simplicity of these operations can be noticed in Table 1.

Table 1. Operation tables of the Galois field (GF(5)) on the ordered set of the extended bases alphabet $\mathfrak{B}$={D, A, C, G, U}, and on $\mathbb{Z_5}$.

Sum Product
+ D A C G U × D A C G U
D D A C G U D D D D D D
A A C G U D A D A C G U
C C G U D A C D C U A G
G G U D A C G D G A U C
U U D A C G U D U G C A
SumProduct
+01234 ×01234
001234000000
112340101234
223401202413
334012303142
440123404321

Further readings are provided in the Computable Document Format (CDF) named IntroductionToZ5GeneticCodeVectorSpace.cdf. In this CDF, readers will gain a better comprehension of the biological and algebraic background on the genetic code algebras presented in reference [4]. This is a didactic and interactive visualization to  introduce the Genetic code $\mathbb{Z}_5$-Vector Space.

  1. Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Research, 13(9), 3021–3030.
  2. Jimenez-Montano MA, de la Mora-Basanez CR, Poschel T. The hypercube structure of the genetic code explains conservative and non-conservative aminoacid substitutions in vivo and in vitro. Biosystems, 1996, 39:117–25
  3. Sanchez R, Grau R, Morgado E (2006) A novel Lie algebra of the genetic code over the Galois field of four DNA bases. Math Biosci 202: 156-174.
  4. Sánchez R, Grau R (2009) An algebraic hypothesis about the primeval genetic code architecture. Math Biosci 221 : 60 – 76.