1. Introduction
Assembly theory (AT), formulated in 2017, introduced the concept of an
initial pool [
1].
Definition 1. We call a set that contains different basic symbols c, the initial assembly pool.
The reader will find numerous results on AT in refs. [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10], for example. Here, we extend the results of our previous study [
9] concerning bitstrings to strings of any natural radix
b. We consider the formation of strings
of length
N containing symbols from the initial assembly pool
within the AT framework in consecutive assembly steps from basic symbols
c and strings assembled in previous steps.
In fact, any embodiment of AT, with basic symbols representing LEGO® blocks, chemical bonds, graphs, monomers, etc. assembled in any
n-dimensional space (
) [
11] corresponds to the string AT version. This is because in AT an assembly step always consists in joining two parts only, which can be thought of as the left and right fragments of the newly formed string. Put simply, AT explains and quantifies selection and evolution [
7] but it is through the word (aka string or
message), in particular a nucleotide sequence in the case of
, all AT
things come into existence [
12].
Definition 2. We call a set that contains basic symbols and strings assembled in previous steps the working assembly pool.
An assembly step
s may consist of
where
,
, and
. We note that the joining operator "∘", in general, does not commute. Using Definitions 1 and 2, the assembly index (ASI) of a string is the minimal achievable value of a difference between the cardinalities of the working and initial assembly pools leading to this string, since at each assembly step the cardinality of the working assembly pool increases by one. Therefore, the working assembly pool 2 cannot be identified with the initial assembly pool 1; the initial assembly pool 1 must not contain strings of basic symbols (see Section H).
2. Results
Theorems 1 and 2 were already stated in our previous study [
9] for
. We restate them here
for clarity.
Theorem 1. A quadruplet is the shortest string that allows for more than one ASI for all b.
Proof.
provides available doublets with unit ASI. provides available triplets with ASI equal to two. Only provides quadruplets that include b quadruplets and quadruplets with ASI equal to two, while the ASI of the remaining quadruplets is three. For example, to assemble the quadruplet , we need to assemble the doublet and reuse it from the first step pol , while there is nothing available to reuse, in the case of the quadruplet . □
Where the symbol value can be arbitrary, we write * assuming that it is the same within the string. If we allow for the 2nd possibility different from *, we write ★. Furthermore, we consider the degenerate case of just one basic symbol ().
Theorem 2. The smallest ASI as a function of N corresponds to the shortest addition chain for N (OEIS A003313) for all b.
Proof. Strings
for which
,
can be formed in subsequent steps
s by joining the longest string assembled so far with itself until
is reached. Therefore, if
, then
. Only
strings have such ASI if
, including respectively
b and
strings
and the assembly pathway of each of the strings (
2) is unique. At each assembly step, its length doubles.
An addition chain for
having the shortest length
(commonly denoted as
) is defined as a sequence
of integers such that
,
for
. The first step in creating an addition chain for
N is always
and this corresponds to assembling a doublet
or
from the initial assembly pool
. Thus, the lower bound for
s of the addition chain for
N,
is achieved for
by
strings (
2).
The second step in creating an addition chain can be or . Thus, finding the shortest addition chain for N corresponds to finding the ASI of a string containing basic symbols and/or doublets and/or triplets containing these doublets for since due to Theorem 1 only they provide the same assembly indices . □
At least some of the following seven simple theorems are useful for further consideration.
Theorem 3. The strings can contain at most two symbols if . Other minimal ASI strings of length can contain at most three symbols if .
Proof. Minimal ASI strings of length are formed by joining the newly assembled string to itself, where a clear or mixed doublet is created in the first step. Minimal ASI strings of other lengths admit a doublet and a triplet containing this doublet and an additional basic symbol.
To formally prove the first part, we can also use mathematical induction on the assembly step
s. If
, then the minimal strings
are doublets of the form
, where
. If
, the string contains one distinct symbol, and if
, the string contains two distinct symbols. In both cases, the number of distinct symbols does not exceed two. Now assume that for some
, all minimal strings
contain at most two distinct symbols. We must show that
also contains at most two distinct symbols. Consider constructing
by joining two identical minimal strings
with each other. By the inductive hypothesis, each
contains at most two distinct symbols. Therefore, their concatenation also contains at most two distinct symbols. By induction, for all
, the minimal string
contains at most two distinct symbols.
We will now show that other minimal ASI strings of length can contain at most three distinct symbols if . We provide the construction of minimal ASI strings with three symbols. In the first step , we create a doublet where and . Next, we combine the existing doublet with a new symbol where . This forms a triplet , introducing a third distinct symbol and further increasing the ASI by 1. We continue assembling by joining the longest string formed so far with itself or with previously formed strings, maintaining the minimal increase in ASI.
Assume a contrario that there exists a minimal ASI string of length that contains four or more distinct symbols. To incorporate a fourth symbol, at least one additional assembly step is required beyond what is needed for the three symbols. This additional step implies an increase in ASI, which contradicts the minimality of . Thus, Theorem 3 is proven. □
Theorem 4. A string containing the same three doublets has the same ASI as a string containing two pairs of the same doublets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. Without loss of generality (w.l.o.g.), consider the following two strings of the same length
with
and the same distributions of other repetitions (if there are any other repetitions)
where
. Creating a doublet takes one assembly step. Each appending of a doublet to an assembled string counts as another assembly step. Hence, in a general case (i.e., for strings
,
containing also other symbols), the string
requires six additional assembly steps, the same as the string
, which completes the proof. □
Theorem 5. A string containing the same three doublets has the same ASI as a string containing the same two triplets, provided that both strings have the same distributions of other repetitions.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating a triplet takes two assembly steps. Hence, in the general case, the string
requires four additional assembly steps, the same as the string
, which completes the proof. □
Theorem 6. A string containing the same two triplets has the same ASI as a string containing two pairs of the same doublets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. The proof stems from Theorems 4 and 5. □
Theorem 7. A string containing the same two quadruplets of the minimum ASI has the same ASI as a string containing the same three triplets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating such a quadruplet takes two assembly steps. Hence, in a general case, the string
requires five additional assembly steps, the same as the string
, which completes the proof. □
Theorem 8. A string containing the same two quadruplets of the maximum ASI has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating such a quadruplet takes three assembly steps. Hence, in a general case, the string
requires five additional assembly steps, the same as the string
, which completes the proof. □
Theorem 9. A string containing the same two doublets and the same two triplets not based on this doublet has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
where
. In a general case, the string
requires seven additional assembly steps, the same as the string
, which completes the proof. □
In general, Theorems 1-9 show that
k copies of a doublet in a string decrease the ASI of this string at least by ;
k copies of a triplet in a string decrease the ASI of this string at least by ;
k copies of a minimum ASI quadruplets in a string decrease the ASI of this string at least by ;
k copies of a maximum ASI quadruplets in a string decrease the ASI of this string at least by .
Here, the phrase "at least" is meant to indicate that other repetitions, such as e.g. doublets forming multiple quadruplets, etc. can further decrease the ASI of the string.
Another quantity related to the string assembly is the assembly depth defined [
13] as
where
and
are the assembly depths of two parts of this string that were joined in step
s, where
. If there are more assembly paths leading to a string with different assembly depths, which happens if at least two assembly steps can occur independently, in ref. [
13] the minimum
d value is assigned to the string. Here, we relax this reassumption. Any string has a unique assembly index but can have different assembly depths if its ASI is not minimal.
Theorem 10.
The assembly depth of a string has a value between the minimum ASI for the length N of this string and the ASI of this string, that is
Proof. For strings having the minimum ASI, the assembly pathway is unique, and the assembly depth increases stepwise along with the ASI. Hence, a string with
can be constructed with the assembly depth
only. for all
b. A non-
string can be assembled in many ways with different depths between
and the ASI of this non-
string. For example, the string
with
can be assembled with the assembly depths
. Similarly,
with
can be assembled in six steps in a range of depths
, e.g., as
Similarly,
with
can be assembled in a range of depths
□
The seven-bit string is the longest string that can have the maximum ASI
. There are four such bitstrings containing two clear triplets and the starting bit at the end or the ending bit at the start, that is
and their lengths cannot be increased without a repetition of a doublet, which keeps the ASI at the same level
.
This observation and Theorem 2 motivated us to develop a general method to construct the longest possible string having the ASI , as a function of the radix b. We denote the length of this string by or , and we call this string a string.
After a few groping try-outs, we eventually reached two stable methods (cf. Appendices, Methods A and B). In both methods, we start with an initial balanced string of length
containing
b clear triplets ordered as
The doublets that can be inserted into the initial string (
14) can be arranged in a
matrix
where the crossed out entries on a diagonal cannot be reused, as they would create repetitions in this string. If we assume that we shall not insert doublets between the clear triplets of the string (
14), we can also cross out the entries in the first superdiagonal of the matrix (
15). The strings of odd lengths generated by these general methods are not only the longest but also the most balanced. This can be stated in the following theorem.
Theorem 11 (
string).
The longest length of a string that has the ASI of is given by
(OEIS A353887) and this string is nearly balanced, that is
where is the number of occurrences of all but one symbol within the string, and its Shannon entropy is
The proof of Theorem 11 is given in
Appendix D. A
string must contain all clear triplets and all doublets. Although the case for
is degenerate, as no information can be conveyed using only one symbol (
in this case), nothing precludes the assembly of such defunct strings and the formula (
16) yields the correct result; the string
is the longest string with
by Theorem 1, as for
the upper and the lower bound on the ASI are the same,
(OEIS
A003313). This is the only case where the maximum ASI is not monotonically nondecreasing.
Subsequently, we considered other strings for with the maximum ASI .
Theorem 12 (
string).
For all the longest length of a string that has the ASI of is given by or equivalently by
and
where is the number of occurrences of all but two symbols within the string, and its Shannon entropy is
The entropy for .
The proof of Theorem 12 is given in
Appendix F.
string must contain only two copies of a doublet. Hence, a clear quadruplet (
) and a pattern binding different symbols adjoining this quadruplet, such as
,
, etc. must be present, so that any
string contains only one pair of repeated doublets
,
, or
. For example, for
, sixteen bitstrings
(an additional eight are given by swapping 0 with 1) have the ASI
, where the underlined string (
22) is the one that is created for
in
Appendix F.
Theorem 13 (
string).
For all the longest length of a string that has the ASI of is given by . or equivalently by
and
where is the number of occurrences of all but three symbols within the string, and its Shannon entropy is
The entropy for .
The proof of Theorem 13 is given in
Appendix G.
string must contain only three copies of a doublet, two copies of a triplet, or two pairs of different doublets. Hence, ??
Theorem 14 (
string).
For all the longest length of a string that has the ASI of is given by . or equivalently by
and
where is the number of occurrences of all but four symbols within the string, and its Shannon entropy is
The entropy for .
Proof. W.l.o.g. an exemplary string of this kind is
Where the starting quadruplet
provides only one doublet of the triplet
that is present in this string if it is generated by Method
Appendix A or
Appendix B. An insertion of another symbol into the string (
29) at any position will maintain or even decrease the ASI of this newly formed string. □
In general, the strings of Theorems 12-14 owe their properties to the following distributions of symbols
The bounds of Theorems 12-14 are illustrated in
Figure 1 and listed in
Table 1, which also lists the length
of a string that has exactly two copies of all doublets and no repeated triplets (cf.
Appendix C). We conjecture that the length
, as well as the lack of lengths
for
may be related to the bounds
.