1. Introduction
Assembly theory (AT), formulated in 2017, introduced the concept of an
initial pool [
1].
Definition 1. We call a set that contains different basic symbols c, the initial assembly pool.
The reader will find numerous results on AT in refs. [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10], for example. Here, we extend the results of our previous study [
9] concerning bitstrings to strings of any natural radix
b. We consider the formation of strings
of length
N containing symbols from the initial assembly pool
within the AT framework in consecutive assembly steps from basic symbols
c and strings assembled in previous steps.
In fact, any embodiment of AT, with basic symbols representing LEGO® blocks, chemical bonds, graphs, monomers, etc. assembled in any
n-dimensional space (
) [
11] corresponds to the string AT version. This is because in AT an assembly step always consists in joining two parts only, which can be thought of as the left and right fragments of the newly formed string. Put simply, AT explains and quantifies selection and evolution [
7] but it is through the word (aka string or
message), in particular a nucleotide sequence in the case of
, all AT
things come into existence [
12].
Definition 2. We call a set that contains basic symbols and strings assembled in previous steps the working assembly pool.
An assembly step
s may consist of
where
,
, and
. We note that the joining operator "∘", in general, does not commute. Using Definitions 1 and 2, the assembly index (ASI) of a string is the minimal achievable value of a difference between the cardinalities of the working and initial assembly pools leading to this string, since at each assembly step the cardinality of the working assembly pool increases by one. Therefore, the working assembly pool 2 cannot be identified with the initial assembly pool 1; the initial assembly pool 1 must not contain strings of basic symbols (see
Appendix G).
2. Results
Theorems 1 and 2 were already stated in our previous study [
9] for
. We restate them here
for clarity.
Theorem 1. A quadruplet is the shortest string that allows for more than one ASI for all b.
Proof.
provides available doublets with unit ASI. provides available triplets with ASI equal to two. Only provides quadruplets that include b quadruplets and quadruplets with ASI equal to two, while the ASI of the remaining quadruplets is three. For example, to assemble the quadruplet , we need to assemble the doublet and reuse it from the first step pol , while there is nothing available to reuse, in the case of the quadruplet . □
Where the symbol value can be arbitrary, we write * assuming that it is the same within the string. If we allow for the 2nd possibility different from *, we write ★. Furthermore, we consider the degenerate case of .
Theorem 2. The smallest ASI as a function of N corresponds to the shortest addition chain for N (OEIS A003313) for all b.
Proof. Strings
for which
,
can be formed in subsequent steps
s by joining the longest string assembled so far with itself until
is reached. Therefore, if
, then
. Only
strings have such ASI if
, including respectively
b and
strings
and the assembly pathway of each of the strings (
2) is unique. At each assembly step, its length doubles.
An addition chain for
having the shortest length
(commonly denoted as
) is defined as a sequence
of integers such that
,
for
. The first step in creating an addition chain for
N is always
and this corresponds to assembling a doublet
or
from the initial assembly pool
. Thus, the lower bound for
s of the addition chain for
N,
is achieved for
by
strings (
2).
The second step in creating an addition chain can be or . Thus, finding the shortest addition chain for N corresponds to finding the ASI of a string containing basic symbols and/or doublets and/or triplets containing these doublets for since due to Theorem 1 only they provide the same assembly indices . □
At least some of the following six simple theorems are useful for further consideration.
Theorem 3. The strings can contain at most two symbols if . Other minimal ASI strings of length can contain at most three symbols if .
Proof. Minimal ASI strings of length are formed by joining the newly assembled string to itself, where a clear or mixed doublet is created in the first step. Minimal ASI strings of other lengths admit a doublet and a triplet containing this doublet and an additional basic symbol.
To formally prove the first part, we can also use mathematical induction on the asembly step
s. If
, then the minimal strings
are doublets of the form
, where
. If
, the string contains one distinct symbol, and if
, the string contains two distinct symbols. In both cases, the number of distinct symbols does not exceed two. Now assume that for some
, all minimal strings
contain at most two distinct symbols. We must show that
also contains at most two distinct symbols. Consider constructing
by joining two identical minimal strings
with each other. By the inductive hypothesis, each
contains at most two distinct symbols. Therefore, their concatenation also contains at most two distinct symbols. By induction, for all
, the minimal string
contains at most two distinct symbols.
We will now show that other minimal ASI strings of length can contain at most three distinct symbols if . We provide the construction of minimal ASI strings with three symbols. In the first step , we create a doublet where and . Next, we combine the existing doublet with a new symbol where . This forms a triplet , introducing a third distinct symbol and further increasing the ASI by 1. We continue assembling by joining the longest string formed so far with itself or with previously formed strings, maintaining the minimal increase in ASI.
Assume a contrario that there exists a minimal ASI string of length that contains four or more distinct symbols. To incorporate a fourth symbol, at least one additional assembly step is required beyond what is needed for the three symbols. This additional step implies an increase in ASI, which contradicts the minimality of . Thus, Theorem 3 is proven. □
Theorem 4. A string containing the same three doublets has the same ASI as a string containing two pairs of the same doublets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. Without loss of generality (w.l.o.g.), consider the following two strings of the same length
with
and the same distributions of other repetitions (if there are any other repetitions)
where
. Creating a doublet takes one assembly step. Each appending of a doublet to an assembled string counts as another assembly step. Hence, in a general case (i.e., for strings
,
containing also other symbols), the string
requires six additional assembly steps, the same as the string
, which completes the proof. □
Theorem 5. A string containing the same three doublets has the same ASI as a string containing the same two triplets, provided that both strings have the same distributions of other repetitions.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating a triplet takes two assembly steps. Hence, in the general case, the string requires four additional assembly steps, the same as the string , which completes the proof. □
Theorem 6. A string containing the same two quadruplets of the minimum ASI has the same ASI as a string containing the same three triplets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating such a quadruplet takes two assembly steps. Hence, in a general case, the string requires five additional assembly steps, the same as the string , which completes the proof. □
Theorem 7. A string containing the same two quadruplets of the maximum ASI has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
Creating such a quadruplet takes three assembly steps. Hence, in a general case, the string requires five additional assembly steps, the same as the string , which completes the proof. □
Theorem 8. A string containing the same two doublets and the same two triplets not based on this doublet has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. W.l.o.g. consider the following two strings of the same length
with the same distributions of other repetitions
where
. In a general case, the string
requires seven additional assembly steps, the same as the string
, which completes the proof. □
The seven-bit string is the longest string that can have the maximum ASI
. There are four such bitstrings containing two clear triplets and the starting bit at the end or the ending bit at the start, that is
and their lengths cannot be increased without a repetition of a doublet, which keeps the ASI at the same level
.
This observation and Theorem 2 motivated us to develop a general method to construct the longest possible string having the ASI , as a function of the radix b. We denote the length of this string by or , and we call this string a string.
After a few groping try-outs, we eventually reached two stable methods (cf. Appendices, Method A and Method B). In both methods, we start with an initial balanced string of length
containing
b clear triplets ordered as
The doublets that can be inserted into the initial string (
10) can be arranged in a
matrix
where the crossed out entries on a diagonal cannot be reused, as they would create repetitions in this string. If we assume that we shall not insert doublets between the clear triplets of the string (
10), we can also cross out the entries in the first superdiagonal of the matrix (
11). The strings of odd lengths generated by these general methods are not only the longest but also the most balanced. This can be stated in the following theorem.
Theorem 9 (
string).
The longest length of a string that has the ASI of is given by
(OEIS A353887) and this string is nearly balanced, that is
where is the number of occurrences of all but one symbol within the string, and its Shannon entropy is
The proof of the Theorem 9 is given in
Appendix C. Although the case for
is degenerate, as no information can be conveyed using only one symbol (
in this case), the formula (
12) yields the correct result; the string
is the longest string with
by Theorem 1, as for
the upper and the lower bound on the ASI are the same,
(OEIS
A003313). Thus, AT subsumes information theory.
Subsequently, we considered other strings for with the maximum ASI .
Theorem 10 (
string).
For all the longest length of a string that has the ASI of is given by or equivalently by
and
where is the number of occurrences of all but two symbols within the string, and its Shannon entropy is
The entropy for .
The proof of the Theorem 10 is given in
Appendix E. In general,
string must contain a clear quadruplet (
) and a pattern binding the symbols adjoining this quadruplet, such as
,
, etc., so that any
string contains only one pair of repeated doublets
,
, or
. For example, for
, sixteen bitstrings
(an additional eight are given by swapping 0 with 1) have the ASI
, where the underlined string (
18) is the one that is created for
in
Appendix E.
Theorem 11 (
string).
For all the longest length of a string that has the ASI of is given by . or equivalently by
and
where is the number of occurrences of all but three symbols within the string, and its Shannon entropy is
The entropy for .
The proof of the Theorem 11 is given in
Appendix F.