Group Theory of Messenger RNA Metabolism and Disease

Michel Planat; Marcelo Amaral; Fang Fang; David Chester; Raymond Aschheim; Klee Irwin

doi:10.20944/preprints202307.0107.v1

Submitted:

29 June 2023

Posted:

03 July 2023

Read the latest preprint version here

Abstract

Our recent work has focused on the application of infinite group theory and related algebraic geometric tools in the context of transcription factors and microRNAs. We were able to differentiate between “healthy" nucleotide sequences and disrupted sequences that may be associated with various diseases. In this paper, we extend our efforts to the study of messenger RNA metabolism, showcasing the power of our approach. We investigate (i) mRNA translation in prokaryotes and eukaryotes, (ii) polyadenylation in eukaryotes, which is crucial for nuclear export, translation, stability, and splicing of mRNA, (iii) miRNAs involved in RNA silencing and post-transcriptional regulation of gene expression, and (iv) the identification of disrupted sequences that could lead to potential illnesses. To achieve this, we employ (a) infinite (finitely generated) groups fp, with generators representing the r+1 distinct nucleotides and a relation between them [e.g., the consensus sequence in the mRNA translation (i), the poly(A) tail in item (ii), and the miRNA seed in item (iii)], (b) aperiodicity theory, which connects “healthy groups" fp to free groups Fr of rank r and their profinite completion F^r, and (c) the representation theory of groups fp over the space-time-spin group SL2(C), highlighting the role of surfaces with isolated singularities in the character variety. Our approach could potentially contribute to the understanding of the molecular mechanisms underlying various diseases and help develop new diagnostic or therapeutic strategies.

Keywords:

RNA metabolism

;

micro-RNAs

;

diseases

;

finitely generated group

;

SL(2

;

C) character variety

;

aperiodicity

Subject:

Biology and Life Sciences - Biochemistry and Molecular Biology

1. Introduction

Genome-scale metabolic pathways [1,2], genome–environment interactions [3], the immune response [4], post-transcriptional regulatory mechanisms [5] and oncohistones [6] represent aspects of a research field connecting the heritable genetic code to other biological codes.

The aforementioned genetic code is defined precisely as a non-injective map from the 64 codons to the 20 amino acids. Both finite groups [7,8,9] and quantum groups [10] play a leading role in modeling this code.

We refer to the “epigenetic code" as all processes that reveal and execute gene expression. This includes DNA methylation processes [11], mRNA translation preparation, the poly(A) tail, the RNA-induced silencing complex (RISC) — a vital tool in gene regulation comprising single strands of RNA (ssRNA) and double strands of small interfering RNA (siRNA) — and other regulatory nucleotide sequence fragments that are discarded after splicing. For a relation between the epigenetic code and morphogenesis, see [12].

For studying the epigenetic code (hereinafter referred to as the e-code), we employ infinite (finitely generated) groups denoted by

f_{p}

, and their representations over the

(2 \times 2)

matrix group

S L_{2} (C)

, where the entries are complex numbers [13]. The significance of this group extends across all fields of physics as it represents a space-time-spin group. In this study, we apply a mathematical field known as algebraic geometry to define the e-code.

Our crucial observation is that an

f_{p}

group associated with a "healthy" sequence usually approximates a free group

F_{r}

, where the rank r equals the number of distinct nucleotides (nt) minus one. A sequence deviating from this may suggest a potential e-code deregulation leading to a disease. However, an

f_{p}

group closely resembling a free group does not provide sufficient assurance against a disease. Additional examination of the

S L_{2} (C)

representations of

f_{p}

— termed the character variety— specifically, its basis — called a Groebner basis

G

— is necessary.

The Groebner basis comprises a set of surfaces. A surface within

G

containing isolated singularities indicates another potential disease that can be identified specifically, e.g., relating to an oncogene or a neurological disorder [13] (Figure 6, Tables 2 to 4). The e-code we define comprises such algebraic geometric characteristics.

An additional attribute of “healthy" sequences, which leads to a group

f_{p}

approximating the free group

F_{r}

and not mentioned in [13], is their connection to aperiodicity. Schrödinger’s book [14] proposes aperiodicity of living “crystals". Our paper [15] characterizes aperiodic DNA sequences. We further this concept by introducing the so-called profinite completion

{\hat{F}}_{r}

of the free group

F_{r}

. A sequence

f_{p}^{(l)}

of finitely generated groups approaching

F_{r}

emerges by applying l repeated substitutions to the generators of

f_{p}

. However, all distinct groups

f_{p}^{(l)}

should possess the same profinite completion

{\hat{F}}_{r}

. Profinite groups

{\hat{F}}_{1}

(corresponding to sequences containing two distinct nt) and

{\hat{F}}_{2}

(corresponding to sequences containing three distinct nt) have been examined in a prominent algebraic geometry treatise [16]. We present the details below in a manner that is accessible to a non-specialist reader.

In Section 2, we illustrate our mathematical concepts through a few simple pedagogical examples. In Section 3, we apply these concepts to the cases of mRNA translation and microRNAs. In Section 4, we provide additional comments, a summary diagram and perspectives.

2. Methods and preliminary results

2.1. Infinite finitely generated groups $f_{p}$ and free groups $F_{r}$

The TATA box

We’ll start with a simple example of an infinitely finitely generated group taken from the context of introns. The DNA sequence located in the core promoter region of many eukaryotic genes is the Goldberg-Hogness sequence, also known as the TATA box. This sequence contains a non-coding segment with repeated T and A base pairs. The TATA box serves as the binding site for the TATA-binding protein and other transcription factors in some eukaryotic genes. Its consensus sequence takes the form rel=TATAAAA. Variations in this consensus sequence, resulting from genetic polymorphism, can lead to diseases like Gilbert’s syndrome and immune suppression [17].

In our methodology, we define the group

f_{p} = 〈A, T | r e l〉

, which contains infinitely many elements. There are numerous ways to investigate this group, but we opted for a specific one. This method involves calculating the number of conjugacy classes of subgroups of index d of

f_{p}

(a sequence we’ll refer to as the card seq of

f_{p}

). The card seq of

f_{p}

for the selected TATA sequence is

[1, 1, 2, 3, 2, 8, 7, 10, 18, 28 \dots]

. Interestingly, the group

H_{3} = 〈A, T | A^{2} = T^{3}〉

shows a similar card seq (at least up to the highest index we can reach with the calculations). The group

H_{3}

, as defined, is isomorphic to the so-called modular group

P S L (2, Z)

– the group of (

2 \times 2

) matrices of determinant 1 with integer entries. This group has an intriguing topological interpretation as the fundamental group of the trefoil knot manifold. Thus, we find that the group

f_{p}

is ’close’ to

H_{3}

since the card seq of both groups is the same, but we can easily verify that

f_{p}

and

H_{3}

are not isomorphic.

In paper [19] ( Section 3.1 and Table 2), we discovered that Hecke groups

H_{q} = 〈A, T | A^{2} = T^{q}〉

, with

q = 3

or 4, have a card seq corresponding to ’healthy’ TATA box sequences. The

f_{p}

group for a TATA box with a card seq resembling that of Hecke groups, with

q \neq 3

or

q \neq 4

, or even that of groups slightly different from

H_{3}

and

H_{4}

, signifies Gilbert’s syndrome.

Polyadenylation signals

For our second example, we select a sequence from the context of eukaryotic polyadenylation [18]. Polyadenylation involves the addition of a poly(A) tail to an RNA transcript, usually a messenger RNA (mRNA). A consensus poly(A) sequence takes the form rel1=AAUAAA. This corresponds to a two-generator group of the form

f_{p} = 〈A U | r e l 1〉

. The card seq of such a group is found to be

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \dots]

, implying a single conjugacy class for each index. It appears that the free group

F_{1} = 〈A, U | A U〉

, of rank 1, has the same card seq as the

f_{p}

group with relation rel1, even though both groups are not isomorphic.

Another consensus poly(A) sequence takes the form rel2=UGUAA. This corresponds to a three-generator group of the form

f_{p} = 〈A, U, G | r e l 2〉

. The card seq of such a group is found to be

[1, 3, 7, 26, 97, 624, 4163, \dots]

. Intriguingly, the free group

F_{2} = 〈A, U, G | A U G〉

, of rank 2, has the same card seq as the

f_{p}

group with relation rel2, despite both groups not being isomorphic.

From our perspective, DNA/RNA sequences that lead to

f_{p}

groups closely resembling a free group are considered ’healthy’ sequences [13,15,19]. The standard poly(A) sequences mentioned earlier play a regulatory role in producing mature mRNA during translation. Sequences that generate an

f_{p}

group diverging from a free group

F_{r}

may be indicative of a disease.

2.2. Aperiodic sequences, their attached groups $f_{p}$ and free groups

In this subsection, we’ll elucidate how a group

f_{p}

, with a card seq identified to be close to a free group

F_{r}

, can be linked to an aperiodic sequence and the profinite completion

{\hat{F}}_{r}

. We introduced the concept of aperiodic groups and sequences in our earlier papers [19] (Section 4) and [15] (Section 2).

Consider the motif

r e l = T T T A T T A

, which serves as a consensus sequence for the transcription factor of the DBX gene in Drosophila melanogaster (fruit flies). This gene is involved in neuronal specification and differentiation. The group

f_{p} = 〈A, T | r e l〉

has the same card seq as the free group

F_{1}

of rank 1. Furthermore, by splitting rel into two segments

r e l = {r e l}_{A} {r e l}_{T}

and applying the substitution maps

A \to {r e l}_{A} = T T T A

,

T \to {r e l}_{T} = T T A

, we generate the substitution sequence

S_{D B X} = A, T, A T, T T T A T T A, T T A T T A T T A T T T A T T A T T A T T T A, \dots

. Upon inspection, it’s straightforward to observe that all finitely generated groups

f_{p}^{(l)}

, with their generators being

A T, T T T A T T A, T T A T T A T T A T T T A T T A T T A T T T A, \dots

, respectively, possess the card seq of

F_{1}

.

As per Reference [19] (Section 4), a substitution rule to be considered aperiodic must satisfy two conditions: (1) the substitution matrix M must be primitive, meaning it should be a strictly positive matrix (all entries

> 0

), irreducible, and

M^{k}

should be strictly positive for some k. This condition is denoted as

M > > 0

, and (2) the Perron-Frobenius

λ_{P F}

eigenvalue must be irrational. It’s worth noting that the Perron-Frobenius eigenvector of an irreducible non-negative matrix is the only one whose entries are all positive.

The aforementioned sequence has a substitution matrix

M = (\begin{matrix} 1 & 3 \\ 1 & 2 \end{matrix})

. One can verify that M is primitive since

M^{2} > > 0

and

λ_{P F} = (3 + \sqrt{13}) / 2

. Conditions (1) and (2) are satisfied, implying that the substitution

S_{D B X}

is aperiodic.

Numerous other genes have transcription factors with a motif rel generating an aperiodic sequence [15].

2.3. Aperiodic sequences and the profinite groups ${\hat{F}}_{r}$

This section can be skipped without affecting the comprehension of the rest of the paper. It endeavors to answer the following question: why do the aforementioned groups

f_{p}^{(l)}

produce the same card seq as that of the free group

F_{1}

? The tentative answer to this question is that the profinite completion of all groups

f_{p}^{(l)}

is the profinite group

{\hat{F}}_{1}

. By making this observation, we align the aperiodicity of sequences with profinite groups. Profinite groups were introduced by Grothendieck in the context of algebraic geometry [16]. Here, we describe the necessary ingredients for the layperson, focusing first on

{\hat{F}}_{1}

and then on

{\hat{F}}_{2}

, and their relevance to our present work.

A group G can be considered a ’topological group’ by applying the ’discrete topology’, in which the elements of G are points of a ’discrete space’, form a ’discontinuous sequence’ and are isolated from each other. Every subset is ’open’ in the discrete topology. A profinite group is a topological group that, in a certain sense, is assembled from a system of finite groups. A profinite group requires a system of finite groups and group homomorphisms between them.

Given a group G, there is a related profinite group

\hat{G}

defined as the inverse limit

\hat{G} = {lim}_{\leftarrow} G / N,

of the groups

G / N

, where N runs through the normal subgroups of G of finite index [a normal subgroup is a subgroup that remains invariant under conjugation by members of the group]. Each finite quotient group corresponds to a normal subgroup N of G and the profinite completion

\hat{G}

can be perceived as containing an analogue of each of these normal subgroups.

The profinite group

\hat{G}

exhibits several properties: it is non-abelian, residually finite [meaning that for any non-identity element g in

\hat{G}

, there exists a finite quotient of

\hat{G}

in which g is not the identity], and totally disconnected [meaning that the only connected subsets of

\hat{G}

are singletons, sets containing only one element].

In general, an explicit construction of profinite groups

\hat{G}

cannot be obtained. However,

{\hat{F}}_{1}

and

{\hat{F}}_{2}

are not overly complex to handle.

About the profinite group ${\hat{F}}_{1}$

Let’s begin with

{\hat{F}}_{1}

. The free group

F_{1}

on a single generator can be described as a group with one generator, say a, and no relations. It consists of all possible finite strings that can be formed by combining the generator and its inverse. It is the infinite cyclic group

Z = {1, a, a^{- 1}, a^{2}, a^{- 2}, a^{3}, a^{- 3}, \dots}

. Now, let’s discuss the profinite completion of

F_{1}

. The profinite group

{\hat{F}}_{1}

is isomorphic to the group of all units of the commutative ring of p-adic integers

Z_{p}

, across all primes p. It is often denoted as

Z_{p}^{*}

since it corresponds to the elements of

Z_{p}

with a valuation of zero. The p-adic integers are a special class of numbers utilized in number theory and algebraic geometry.

About the profinite group ${\hat{F}}_{2}$

Let’s briefly discuss

{\hat{F}}_{2}

. This topic was initiated in [16]. The subject is deep and complex. It’s connected to the so-called Belyi theorem, a fundamental result that establishes a connection between algebraic curves defined over the algebraic closure of the rationals,

\bar{Q}

, and certain rational functions called Belyi functions.

An algebraic curve defined over

\bar{Q}

can be represented as a branched covering of the Riemann sphere (the complex projective line

P^{1} (C)

) branched only over three points (usually taken as 0, 1, and ∞) if and only if the curve itself is defined over a number field, which is a finite extension of the field of rational numbers

Q

.

In other words, the Belyi theorem implies that an algebraic curve defined over a number field can be mapped to the Riemann sphere in such a way that the ramification (branching) is restricted to just three points. The rational functions that provide these branched coverings are known as Belyi functions.

The significance of the Belyi theorem lies in the fact that it provides a method to study algebraic curves defined over number fields by analyzing their ramified coverings and the associated dessins d’enfants, which are combinatorial objects encoding the ramification data.

Specifically, we have the crucial result that

{\hat{π}}_{1} (P^{1} (C) ∖ {0, 1, \infty}) ≅ {\hat{F}}_{2},

i.e., the so-called étale fundamental group for the triply branched projective line is the profinite group

{\hat{F}}_{2}

.

2.4. $S L_{2} (C)$ representations of groups $f_{p}$ and a Groebner basis $G$

While the previous section about profinite groups showcases the importance of algebraic geometry in the context of DNA/RNA sequences, it remains somewhat abstract. To address this, we can consider the representations of an

f_{p}

group over the space-time-spin group

S L_{2} (C)

, as we did in [13,15].

Representations of

f_{p}

in

S L_{2} (C)

are homomorphisms

ρ : f_{p} \to S L_{2} (C)

with character

κ_{ρ} (g) = t r (ρ (g))

,

g \in f_{p}

. The notation

t r (ρ (g))

signifies the trace of the matrix

ρ (g)

. The set of characters is employed to determine an algebraic set by taking the quotient of the set of representations

ρ

by the group

S L_{2} (C)

, which acts by conjugation on representations[20,21].

In our paper [13], we elaborated that the character variety of

f_{p}

is a set comprised of a sequence X of multivariate polynomials. A particular basis related to X is the Groebner basis

G (X)

, whose factors define hypersurfaces.

For a two-generator group

f_{p}

, the factors are three-dimensional surfaces. In general, these surfaces can be classified by mapping them to a rational surface across five categories [13]. Often encountered surfaces are degree p Del Pezzo surfaces where

1 \leq p \leq 9

. A rational surface may either be non-singular, ’almost non-singular’, having only isolated singularities, or singular. Almost non-singular surfaces are crucial in our context. A simple singularity is referred to as an A-D-E singularity and must be of the type

A_{n}

,

n \geq 1

,

D_{n}

,

n \geq 4

,

E_{6}

,

E_{7}

or

E_{8}

.

The A-D-E type is mirrored in the notation we employ. For instance,

S^{(l A_{1}, m A_{2}, n A_{3}, \dots)}

denotes a surface containing l type

A_{1}

, m type

A_{2}

, n type

A_{3}

singularities, etc. A generic surface is the Cayley cubic we encountered in our previous papers, defined as

S^{(4 A_{1})} = x y z + x^{2} + y^{2} + z^{2} - 4

[13] (Figure 5).

For a three-generator group

f_{p}

, the factors of

G (X)

are seven-dimensional surfaces of the form

S_{a, b, c, d} (x, y, z)

. Some of them belong to the Fricke family [13] (Equation (3)), which is associated with the four-punctured sphere. But for a chosen set of parameters

a, b, c, d

, the hypersurface reduces to an ordinary three-dimensional surface.

For a four-generator group

f_{p}

, the factors of

G (X)

are 14-dimensional surfaces containing 4 copies of the form

S (x, y, z)

,

S (x, u, v)

,

S (y, u, v)

and

S (z, v, w)

for selected choices of 8 parameters.

Groebner basis for the TATA box

The Groebner basis for the character variety associated with the

f_{p}

group of generator rel=TATAAAA of the TATA box, studied in subSection 2.1, is found to be:

\begin{matrix} G_{T A T A} = (z^{4} - x y^{2} - x y z + x^{2} + y^{2} + y z - 3 z^{2} + x - 2) (x^{2} z - x y - x z + y - z) \\ S^{(A_{2})} S^{(A_{4})} (x^{3} - z^{2} - 3 x + 2), \end{matrix}

where

S^{(A_{2})} = x^{2} y - z^{3} - x z - y + 3 z

and

S^{(A_{4})} = x z^{2} - x^{2} - y z - x + 2

are degree 3 Del Pezzo surfaces.

The Groebner basis

G_{T A T A}

comprises a degree 2 Del Pezzo surface (see Figure 1, up), and a rational scroll whole analytic expression is in the first row. Both surfaces are singular. The second row consists of two surfaces with simple singularities of type

A_{2}

and

A_{4}

, respectively. The last term represents a curve (not a surface).

Groebner basis for polyadenylation signals

For the first polyadenylation signal considered in subSection 2.1, the relation of the

f_{p}

group is rel1=AAUAAA. The corresponding Groebner basis is:

G_{r e l 1} = 3 r a t i o n a l s c r o l l s \times P^{2} \times S^{(4 A_{1})} S^{(A_{1})} \times c u r v e .

The Groebner basis

G_{r e l 1}

contains three rational scrolls, a surface birationally equivalent to the projective plane

P^{2}

, the Cayley cubic

S^{(4 A_{1})}

, the degree 3 Del Pezzo surface

S^{(A_{1})} = x^{2} y - x z^{2} - x z + y z + x - y

(see Figure 1, down) and a curve.

For the second polyadenylation signal considered in subSection 2.1, the relation of the

f_{p}

group is rel2=UGUAA. The factors of

G (X)

are seven-dimensional hypersurfaces

S_{a, b, c, d} (x, y, z)

. However, choosing specific parameters, such as

S_{0, 0, 0, 0} (x, y, z)

or

S_{1, 1, 1, 1} (x, y, z)

, we obtain three-dimensional surfaces. These are found to be degree 3 Del Pezzo surfaces with simple singularities of the form

S^{l A_{2}}

, with l=1, 2 or 3, quadrics, or curves.

Groebner basis for the transcription factor of DBX gene

For the DBX gene studied in Section 2.2, the Groebner basis takes the form

G_{D B X} = s c r o l l \times P^{2} \times S^{(A_{4})} \times S^{(A_{2})} \times S^{(4 A_{1})} \times c u r v e,

where scroll=

y^{2} z - x y - y z + x - z

and

P^{2} = z^{4} - x^{2} y + x z - 4 z^{2} + y + 2

are singular. The other factors are

D P^{3}

surfaces with isolated singularities that are

S^{(A_{4})} = y z^{2} - y^{2} - x z - y^{2}

,

S^{(A_{2})} = z^{3} - x y^{2} + y z + x - 3 z

, the Cayley cubic

S^{(4 A_{1})}

and curve=

y^{3} - z^{2} - 3 y + 2

.

3. Further results

In this section, we produce further results related to mRNA metabolism and miRNA.

3.1. Algebraic geometry of mRNA translation

The Shine-Dalgarno box

Ribosomal RNA (rRNA) – a type of non coding RNA– is the main component of a macromolecular machine, called the ribosome, whose role is to ensure mRNA translation. The initiation of translation needs the recognition of the appropriate sequences on the m-RNA by the ribosome. A major factor in this recognition is an mRNA-rRNA interaction first proposed by Shine and Dalgarno [22]. They proposed that the ribosomal nucleotides recognize the complementary purine-rich sequence rel3=AGGAGGU, which is found around 8 bases upstream of the start codon AUG in a number of mRNAs found in viruses that affect Escherichia coli.

Let us study the group

f_{p} = 〈A, G, U | r e l 3〉 .

The card seq of

f_{p}

is found to be the same than that of the free group

F_{2}

.

The

S L_{2} (C)

chararacter variety is a scheme X whose a Groebner basis

G (X)

is made of of 7-dimensional surfaces

S_{a, b, c, d} (x, y, z)

. By projecting to 3 dimensions, one gets surfaces like

S_{0, 0, 0, 0} (x, y, z)

and

S_{1, 1, 1, 1} (x, y, z)

as in Section 2.4. We find degree 3 Del Pezzo surfaces with isolated singularities

S^{(A_{1})} = x^{2} y + y z^{2} + 4 x z + 4 y

and

x^{2} y + y z^{2} + x^{2} + z^{2} + 6 x z + 5 y - 6 z - 7

,

S^{(A_{2})} = x y z + 2 x^{2} + z^{2} + 4

and

S^{(A_{4})} = x y z + 3 x^{2} + z^{2} - 5 z

, quadrics and curves.

Kozak consensus sequence

The Kozak consensus sequence is a nucleotide motif that functions as the protein translation initiation site in most eukaryotic mRNA transcripts [23]. The small (40S) subunit of eukaryotic ribosomes bind, initially at the capped

5^{'}

-end of messenger RNA and then migrate, stopping at the first AUG codon in a favorable context for initiating translation. In eukaryotes, the Kozak sequence ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A sequence logo of the most conserved bases around the initiation codon AUG for human mRNAs may be found in the first caption of [24] as

r e l 4 = A C C A U G G C

.

Let us study the group

f_{p} = 〈A, C, G, U | r e l 4〉 .

The card seq of

f_{p}

is found to be the same than that of the free group

F_{3}

of rank 3. This group can be linked to an aperiodic sequence by following the steps given in Section 2.2.

By splitting rel4 into four segments

r e l 4 = {r e l}_{A} {r e l}_{C} {r e l}_{G} {r e l}_{U}

and applying the substitution maps

C \to {r e l}_{C} = A

,

A \to {r e l}_{A} = C C A U G

,

U \to {r e l}_{U} = G

,

G \to {r e l}_{G} = C

, we generate the substitution sequence

S_{K o z a k} = C, A, U, G, C A U G, A C C A U G G C, C C A U G A^{2} C C A U G G C^{2} A, \dots

.

Upon inspection, it’s straightforward to observe that all finitely generated groups

f_{p}^{(l)}

, with their generators being

C A U G, A C C A U G G C, C C A U G A^{2} C C A U G G C^{2} A, \dots

, respectively, possess the card seq of

F_{3}

.

The aforementioned sequence has a substitution matrix

M = (\begin{matrix} 0 & 2 & 0 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \end{matrix})

. One can verify that M is primitive since

M^{4} > > 0

and

λ_{P F} \approx 2.2055694

is the only real (and irrational) solution of the equation

x^{3} - 2 x^{2} - 1

. Conditions (1) and (2) of Section 2.2 are satisfied, implying that the substitution

S_{K o z a k}

is aperiodic. See [25] for a connection of the later Perron-Frobenius eigenvalue to random Fibonacci sequences.

Mutation of a purine at position

- 3

with respect to the AUG codon is kwown to be associated to a disease such as a type of thalassemia due to a bad initiation of

α

-globin [23]. In our approach the mutation from rel4 to rel4’=CCCAUGGC leads to a substitution

M^{'}

that is no longer primitive so that the property of aperiodicity of the sequence is lost. However the cardseq of the associated

f_{p}

group is still that of the free group

F_{3}

. No other substitution in the sequence rel4’ can be found to restore the aperiodicity.

3.2. Algebraic geometry of miRNAs

A microRNA (miRNA) is a small, single-stranded, non-coding RNA molecule containing approximately 22 nucleotides. miRNAs play crucial roles in RNA silencing and post-transcriptional regulation of gene expression by specifically targeting certain mRNAs for degradation and translational repression [26,27]. miRNA genes are typically transcribed by RNA polymerase II (Pol II), which binds to a promoter located near the DNA sequence, encoding what will become the hairpin loop of a pre-miRNA (for precursor-miRNA). The pre-miRNAs are approximately 70 nucleotides in length and fold into imperfect stem-loop structures.

Each miRNA is synthesized as a miRNA duplex comprising two strands (-5p and -3p). However, only one of the two strands is selectively incorporated into the RNA-induced silencing complex to act as a template for the transcript of a complementary mRNA [28,29]. For details about the miRNA sequences, we use the Mir database [30,31,32].

Plant miRNAs usually have near-perfect pairing with their mRNA targets, leading to gene repression through cleavage of the target transcripts. In contrast, animal miRNAs can recognize their target mRNAs using as few as 6 to 8 nucleotides (the seed region), which is not sufficient pairing to induce cleavage of the target mRNAs. A given miRNA may have hundreds of different mRNA targets, and a single target might be regulated by multiple miRNAs.

For previous results about how to define a

f_{p}

group from the seed of a miRNA, the reader may consult [13] (Section 4.3).

Below, we focus on other examples.

miRNA hsa-mir-503

The slowest evolving miRNA gene in the human species (hsa) is hsa-mir-503 [31]. It regulates gene expression in various pathological processes of diseases, including carcinogenesis, angiogenesis, tissue fibrosis, and oxidative stress [33].

The seed region for mir-503-5p is seed1=AGCAGCGG. The corresponding group

f_{p} = 〈A, C, G | s e e d 1〉

has the card seq of the free group

F_{2}

. For this group, the Groebner basis with parameters

(a, b, c, d) = (0, 0, 0, 0)

is quite simple:

G_{m i r - 503 - 5 p} (0, 0, 0, 0) = S^{(4 A_{1})} (x, y, z)

, which is the already mentioned Cayley cubic.

For

(a, b, c, d) = (1, 1, 0, 0)

,

G_{m i r - 503 - 5 p} (1, 1, 0, 0) = - 3 x y z κ_{3} (x, y, z)

, where

κ_{3} (x, y, z)

is the Fricke surface found in [34]. For

(a, b, c, d) = (1, 1, 1, 1)

, there are several more polynomials. One of them defines the Fricke surface

x y z + x^{2} + y^{2} + z^{2} - 2 x - y - 2

.

The considered seed region for mir-503-3p is GGGUAUU. The surfaces in the Groebner basis are very simple in this case, and no simple singularities exist within them.

miRNA hsa-mir-146a

Mir-146 is primarily involved in the regulation of inflammation and other processes functioning in the innate immune system. It plays a role in neuropathogenesis.

The considered seed region for hsa-mir-146a-5p is seed2=GAGAAC [31]. Again the corresponding group

f_{p} = 〈A, C, G | s e e d 2〉

has the card seq of the free group

F_{2}

.

The Groebner basis with parameters

(a, b, c, d) = (0, 0, 0, 0)

is

G_{h s a - 146 a - 5 p} (1, 1, 1, 1) = (x z + y + 2) {(y - z^{2} + 2)}^{2} (x^{2} + z^{2} - 2 y - 4) S^{(3 A_{2})},

where

S^{(3 A_{2})} = z^{3} - x y - 2 y z - 2 x - 4 z

.

The Groebner basis with parameters

(a, b, c, d) = (1, 1, 1, 1)

is of the form

G_{h s a - 146 a - 5 p} (1, 1, 1, 1) =

D P^{4}

\times f^{(2 A_{2})} \times

quadric × curves, where

D P^{4}

is a degree 4 del Pezzo surface.

miRNAs and disease

As we found earlier, a potential disease is associated with

f_{p}

groups whose character variety has a Groebner basis containing isolated singularities, even though the

f_{p}

group has the card seq of a free group [13] (Figure 6). This is the case for the latter two miRNAs. Additional examples can be found in [13] (Table 3).

Besides isolated singularities, the Groebner basis may contain singular surfaces that are not simply singular. The

D P^{4}

surface in

G_{h s a - 146 a - 5 p} (1, 1, 1, 1)

is an example of a singular surface. Further mathematical techniques are required to investigate these surfaces [35]. However, we will not discuss these methods in this paper.

4. Discussion

In this section, we summarize our paper by referring to the diagram in Figure 2. Given a short DNA/RNA sequence, rel, which represents a consensus sequence in a transcription factor, the seed of a miRNA, or a relevant sequence in mRNA recognition and processing, we construct a finitely generated group,

f_{p}

. The architecture of subgroups, card seq, within this group is computed (see Section 2.1). If the

f_{p}

card seq matches that of the free group

F_{r}

(of rank r equal to nt-1), we proceed to path 4; otherwise, a potential disease could be in sight (path 3). After reaching path 4, the next step involves checking the aperiodicity of rel and the corresponding

f_{p}

group, as described in Section 2.2. The final step is to examine the presence (or absence) of isolated singularities in the Groebner basis

G

for the

S L_{2} (C)

character variety associated with

f_{p}

, as outlined in Section 2.4. For a healthy sequence, the path concludes at 6, while a potential disease may be indicated if the path ends at 3, 7 or 8.

In Table 1, we provide several examples of paths. All three checks can be performed, even if paths 4 or 5 are not followed. For instance, the termination

{7, 8}

signifies that the sequence fails both in being aperiodic and in being devoid of simple singularities.

For sequences with 4 nt (like the sequence of transcription factor DBX or the Shine-Dalgarno sequence rel3), it is difficult to conclude about the risk of a disease. The generic Groebner basis

G

(x,y,z) always contains surfaces with isolated singularitis such as

S^{(4 A_{1})}

and

S^{(3 A_{1})}

and there are four copies of them. The termination

{6, 8}

applies for these case.

Our approach is quite comprehensive and can be applied in numerous contexts beyond those we have considered thus far. It has the potential to impact the search for underlying causes of diseases and aid in the discovery of therapeutic strategies. The e-code, the processes that reveal and execute gene expression, has a sophisticated structure, which our mathematical approach aims to elucidate.

Author Contributions

Conceptualization, M.P., F.F., and K.I.; methodology, M.P., D.C., and R.A.; software, M.P.; validation, R.A., F.F., D.C., and M.M.A..; formal analysis, M.P. and M.M.A.; investigation, M.P., D.C., F.F., and M.M.A.; writing—original draft preparation, M.P.; writing—review and editing, M.P.; visualization, F.F. and R.A.; supervision, M.P. and K.I.; project administration, K.I.; funding acquisition, K.I. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was obtained from Quantum Gravity Research in Los Angeles, CA.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

Computational data are available from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gu, C.; Kim G., B.; Kim, W. J.; Kim, H. U.; Lee, S. Y. Current status and applications of genome-scale metabolic models. Genome Biology 2019, 20, 121. [Google Scholar] [CrossRef] [PubMed]
Romão, L. mRNA metabolism in health and disease. Biomedicines 2022, 10, 2262. [Google Scholar] [CrossRef] [PubMed]
Peedicayil, J. Genome–environment interactions and psychiatric disorders. Biomedicines 2023, 11, 1209. [Google Scholar] [CrossRef]
Scharf, S.; Ackerman, J.; Bender, L.; Wurzel, P.; Schäfer, H.; Hansmann, M. L.; Koch, I. Holistic view on the structure of immune response: Petri net model. Biomedicines 2023, 11, 452. [Google Scholar] [CrossRef] [PubMed]
Marques, A. R.; Santos, J. X.; Martiniano, H.; Vilela, J.; Rasga, C.; Romão, L.; Vicente, A. M. Gene variants involved in nonsense-mediated mRNA decay suggest a role in autism spectrum disorder. Biomedicines 2022, 10, 665. [Google Scholar] [CrossRef]
Wan, Y. C.E. Histone H2B Mutations in Cancer. Biomedicines 2021, 9, 694. [Google Scholar] [CrossRef] [PubMed]
Fimmel, E.; Giannerini, S.; Gonzalez, D. L.; Strüngmann. Circular codes, symmetries and transformations. J. Math. Biol. 2014, 70, 1623–16434. [Google Scholar] [CrossRef]
Planat, M.; Aschheim, R; Amaral, M. M; Fang, F; Irwin, K. Complete quantum information in the DNA genetic code. Symmetry 2020, 12, 1993. [Google Scholar] [CrossRef]
Sanchez, R.; Barreto, J. Genomic abelian finite groups. Available online. [CrossRef]
Frappat, L.; Sciarrino, A.; Sorba; P. Crystalizing the Genetic Code. J. Biol. Phys. 2001, 27, 1–34. [Google Scholar] [CrossRef] [PubMed]
Sanchez, R.; Mackenzie, S. A. On the thermodynamics of DNA methylation process. Scient. Rep. 2023, 13, 8914. [Google Scholar] [CrossRef]
Bessonov, N.; Butuzova, O.; Minarsky, A.; Penner, R.; Soulé, C; Tosenberger, A. ; Morozova, D. Morphogenesis software based on epigenetic code concept. Comp. Struct. Biotech. J. 2019, 17, 1203–1216. [Google Scholar] [CrossRef]
Planat, M.; Amaral, M. M; Irwin, K. Algebraic morphology of DNA–RNA transcription and regulation. Symmetry 2023, 15, 770. [Google Scholar] [CrossRef]
Schrödinger, E. What Is Life? The Physical Aspect of the Living Cell; Cambridge University Press: 1944, Cambridge, UK.
Planat, M.; Amaral, M. M; Fang, F.; Chester,D.; Aschheim R.; Irwin, K. DNA sequence and structure under the prism of group theory and algebraic surfaces. Int. J. Mol. Sci. 2022, 3, 13290. [Google Scholar] [CrossRef]
Grothendieck, A. Esquisse d’un programme 1984, in Geometric Galois Actions I: Around Grothendieck’s Esquisse D’un Programme, London Mathematical Society Lecture Note Series, vol. 242, Cambridge University Press Schneps and Lochak 1997, pp. 243-283. Avalaible online: http://matematicas.unex.es/∼ navarro/res/esquisseeng.pdf.
TATA Box: Available online:. Available online: https://en.wikipedia.org/wiki/TATA_box (accessed on 1 September 2021).
Polyadenylation: Available online:. Available online: https://en.wikipedia.org/wiki/Polyadenylation (accessed on 1 May 2023).
Planat, M.; Amaral, M. M; Fang, F.; Chester,D.; Aschheim R.; Irwin, K. Group theory of syntactical freedom in DNA transcription and genome decoding. Curr. Issues Mol. Biol. 2022, 44, 1417–1433. [Google Scholar] [CrossRef]
Goldman, W.M. Trace coordinates on Fricke spaces of some simple hyperbolic surfaces. Eur. Math. Soc. 2009, 13, 611–684. [Google Scholar]
Ashley, C.; Burelle, J.P.; Lawton, S. Rank 1 character varieties of finitely presented groups. Geom. Dedicata 2018, 192, 1–19. [Google Scholar] [CrossRef]
Jacob, W. F. Jacob, Santer, M., Dahlberg, A. E. A single-base change in the Shine-Dalgarno region of 16 rRNA of Escherichia colo affects translation of many proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4257–4761. [Google Scholar] [CrossRef] [PubMed]
Kozak, M. The scanning model for translation: an update. J. Cell Biology 1989, 108, 229–241. [Google Scholar] [CrossRef]
Kozak consensus sequence: Available online:. Available online: https://en.wikipedia.org/wiki/Kozak_consensus_sequence (accessed on 1 January 2023).
Rittaud, B. On the average groth of random Fibonacci sequences. J. Int. Seq. 2007, 10. [Google Scholar]
microRNA. Available online: https://en.wikipedia.org/wiki/MicroRNA (accessed on 1 September 2022).
Fang, Y.; Pan, X.; Shen, H. B. Recent deep learning methodology development for RNA–RNA interaction prediction. Symmetry 2022, 14, 1302. [Google Scholar] [CrossRef]
Medley, C. M.; Panzade, G.; Zinovyeva, A. Y. MicroRNA strand selection,: unwinding the rules. WIREs RNA 2021, 12, e1627. [Google Scholar] [CrossRef] [PubMed]
Dawson, O.; Piccinini, A. M. miR-155-3p: processing by-product or rising star in immunity and cancer? Open Biol. 2022, 12, 220070. [Google Scholar] [CrossRef]
Kozomara, A.; Birgaonu, M.; Griffiths-Jones, S. miRBase: from microRNA sequences to function. Nucl. Acids Res. 2019, 47, D155–D162. [Google Scholar] [CrossRef] [PubMed]
miRBase: the microRNA database. Available online: https://www.mirbase.org/ (accessed on 1 November 2022).
Fromm, B.; Billipp, T.; Peck. L. E.; Johansen, M.; Tarver, J. E.; King, B. L.; Newcomb, J. M.; Sempere, L. F.; Flatmark, K.; Hovig, E.; Peterson, K. J. A uniform system for the annotation of human microRNA genes and the evolution of the human microRNAome. Annu. Rev. Genet. 2015, 23, 213–242. [Google Scholar] [CrossRef]
He, Y.; Cai, Y.; Paii, P. M.; Ren, X.; Xia, Z. The Causes and Consequences of miR-503 Dysregulation and Its Impact on Cardiovascular Disease and Cancer. Front. Pharmacol. 2021, 12, 629611. [Google Scholar] [CrossRef]
Planat, M.; Chester, D.; Amaral, M.; Irwin, K. Fricke topological qubits. Quant. Rep. 2022, 4, 523–532. [Google Scholar] [CrossRef]
Planat, M.; Amaral, M. M.; Chester, D.; Irwin, K. SL(2,C) scheme processsing of singularities in quantum computing and genetics. Axioms 2023, 12, 233. [Google Scholar] [CrossRef]
Mir-132. Available online: https://fr.wikipedia.org/wiki/Micro-ARN_7 (accessed on 1 June 2023).
Sonkoly, E.; Stahle, M. Pivarsci, A. MicroRNAs and immunity: novel players in the regulation of normal immune function and inflammation. Sem. in Cancer Biol. 2008, 18, 131–140. [Google Scholar] [CrossRef] [PubMed]
Micro-ARN 7. Available online: https://en.wikipedia.org/wiki/MiR-132 (accessed on 1 June 2023).

Figure 1. Up: The degree 2 Del Pezzo surface within

G_{T A T A}

. Down: The degree 3 Del Pezzo surface

S^{(A_{1})}

within

G_{r e l 1}

.

Figure 1. Up: The degree 2 Del Pezzo surface within

G_{T A T A}

. Down: The degree 3 Del Pezzo surface

S^{(A_{1})}

within

G_{r e l 1}

.

Figure 2. A diagram illustrating the main results discussed in the text. For example, for the transcription factor of the gene EGR1, rel=GCGTGGGCG [19], the path is

1 \to 2 \to 4 \to 5 \to 6

showing no risk of disease. But for the transcription factor of gene DBX (see Section 2.2 and Section 2.4), rel= TTTATTA, the path is

1 \to 2 \to 4 \to 5 \to 8

meaning a potential disease (see Table 1).

Figure 2. A diagram illustrating the main results discussed in the text. For example, for the transcription factor of the gene EGR1, rel=GCGTGGGCG [19], the path is

1 \to 2 \to 4 \to 5 \to 6

showing no risk of disease. But for the transcription factor of gene DBX (see Section 2.2 and Section 2.4), rel= TTTATTA, the path is

1 \to 2 \to 4 \to 5 \to 8

meaning a potential disease (see Table 1).

Table 1. A few possible paths in the diagram of Figure 2 terminating at 6 (healthy) or (3)-(7)-(8) (potential disease). The selected examples are taken in three parts that are transcription factors (group 1), regulating elements in introns (group 2) and miRNAs (group 3). Details are given in the text. Otherwise a reference is provided.

Sequence	rel	path
EGR1 [19]	GCGTGGGCG	$1 \to 2 \to 4 \to 5 \to 6$
FOS [19]	TGAGTCA	$1 \to 2 \to 4 \to 5 \to {6, 8}$
Nanog [19]	TAATGG	$1 \to 2 \to {7, 8}$
DBX	TTTATTA	$1 \to 2 \to 4 \to 5 \to 8$
TATA	TATAAAA	$1 \to 2 \to 3 \to (7, 8)$
poly(A) (rel1)	AAUAAA	$1 \to 2 \to 4 \to {7, 8}$
poly(A) (rel2)	UGUAA	$1 \to 2 \to 4 \to {7, 8}$
Shine-Dalgarno (rel3)	AGGAGGU	$1 \to 2 \to 4 \to 5 \to 8$
Kozak (rel4)	ACCAUGGC	$1 \to 2 \to 4 \to 5 \to {6, 8}$
Kozak (rel4’)	CCCAUGGC	$1 \to 2 \to 7$
hsa-mir-132-5p [36]	CCGUGGC	$1 \to 2 \to 4 \to 5 \to 6$
hsa-mir-503-5p (seed1) [33]	AGCAGCGG	$1 \to 2 \to 5 \to 8$
hsa-mir-146a-5p (seed2) [37]	GAGAAC	$1 \to 2 \to {7, 8}$
hsa-mir-7-5p [38]	GGAAGA	$1 \to 2 \to {3, 7, 8}$
hsa-mir-7-5p	GGAAGAC	$1 \to 2 \to 4 \to 5 \to 6$
hsa-mir-7-3p	AACAAAU	$1 \to 2 \to 7$
hsa-mir-155-3p [29,37]	UCCUAC	$1 \to 2 \to 4 \to {7, 8}$
hsa-mir-155-3p	UCCUACA	$1 \to 2 \to {3, 7}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Group Theory of Messenger RNA Metabolism and Disease

Abstract

Keywords:

Subject:

1. Introduction

2. Methods and preliminary results

2.1. Infinite finitely generated groups $f_{p}$ and free groups $F_{r}$

The TATA box

Polyadenylation signals

2.2. Aperiodic sequences, their attached groups $f_{p}$ and free groups

2.3. Aperiodic sequences and the profinite groups ${\hat{F}}_{r}$

About the profinite group ${\hat{F}}_{1}$

About the profinite group ${\hat{F}}_{2}$

2.4. $S L_{2} (C)$ representations of groups $f_{p}$ and a Groebner basis $G$

Groebner basis for the TATA box

Groebner basis for polyadenylation signals

Groebner basis for the transcription factor of DBX gene

3. Further results

3.1. Algebraic geometry of mRNA translation

The Shine-Dalgarno box

Kozak consensus sequence

3.2. Algebraic geometry of miRNAs

miRNA hsa-mir-503

miRNA hsa-mir-146a

miRNAs and disease

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe

Group Theory of Messenger RNA Metabolism and Disease

Abstract

Keywords:

Subject:

1. Introduction

2. Methods and preliminary results

2.1. Infinite finitely generated groups f p and free groups F r

The TATA box

Polyadenylation signals

2.2. Aperiodic sequences, their attached groups f p and free groups

2.3. Aperiodic sequences and the profinite groups F ^ r

About the profinite group F ^ 1

About the profinite group F ^ 2

2.4. S L 2 ( C ) representations of groups f p and a Groebner basis G

Groebner basis for the TATA box

Groebner basis for polyadenylation signals

Groebner basis for the transcription factor of DBX gene

3. Further results

3.1. Algebraic geometry of mRNA translation

The Shine-Dalgarno box

Kozak consensus sequence

3.2. Algebraic geometry of miRNAs

miRNA hsa-mir-503

miRNA hsa-mir-146a

miRNAs and disease

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe

2.1. Infinite finitely generated groups $f_{p}$ and free groups $F_{r}$

2.2. Aperiodic sequences, their attached groups $f_{p}$ and free groups

2.3. Aperiodic sequences and the profinite groups ${\hat{F}}_{r}$

About the profinite group ${\hat{F}}_{1}$

About the profinite group ${\hat{F}}_{2}$

2.4. $S L_{2} (C)$ representations of groups $f_{p}$ and a Groebner basis $G$