2.1. Methods and Alorithms for Sequence Alignment
String alignment algorithms are widely used in bioinformatics to compare DNA sequences. The pattern of a particular DNA sequence in the form of a character string is compared or searched against other sequences to find similarity [
10]. Thus, the pattern is matched to the large amount of DNA sequences, which are sometimes very complex and not easy to extract. In order to obtain a result or matching pattern with greater accuracy, algorithms such as Knuth-Morris-Pratt, Boyer-Moore, Brute Force, Rabin-Karp [
11] and other algorithms are used.
A number of studies have been conducted on string matching algorithms, including exact matching and approximate string matching [
12,
13]. Their complexity is compared using DNA datasets to find the appropriate algorithm with high temporal quality and accuracy. Global alignment is a form of global optimization where the alignment spans the entire length of the sequences being examined. Local alignment identifies similar regions in long sequences that are generally too different.
In the field of bioinformatics, two main groups of algorithms are used to align pairs of sequences: exact algorithms and approximate (heuristic-based) algorithms [
14]. The exact algorithms are based on the dynamic programming methodology [
15,
16]. Such are Needleman-Wunsch and Smith-Waterman algorithms. FASTA [
17,
18] and BLAST are heuristic-based algorithms and are more widely used because they offer faster computational performance. The challenge in performing sequence alignment from biological data is the trade-off between accuracy and efficiency. Needleman-Wunsch and Smith-Waterman algorithms tend to be very computationally complex, but they manage to find the optimal arrangement between pairs of sequences.
Needleman-Wunsch algorithm provides a method for finding an optimal global alignment of two sequences, so it cannot be used to detect local regions of high similarity. This algorithm tries to reach the maximum of matches and the minimum of mismatches (gaps) when comparing protein or nucleotide sequences. It is based on the method of dynamic programming and guarantees the calculation of the maximum order.
Smith-Waterman algorithm implements local sequence alignment. Instead of aligning the entire length of the two sequences, this algorithm finds the region with the highest similarity. Its purpose is to identify similar regions between strings, nucleotides and proteins. This technology is potentially more applicable due to the fact that the ends of proteins are in most cases less conserved compared to the middle regions, resulting in higher mutation, deletion and insertion rates at the ends of the protein. Smith-Waterman algorithm allows to align proteins that can be quite different without having to align their ends. Smith-Waterman algorithm is a basic algorithm based on dynamic programming and provides high accuracy. The time complexity of this algorithm for comparing two sequences is O(MN), where M and N are the lengths of the two sequences being compared. As the database of genetic biological sequences grows exponentially, the complexity relevant to real applications is O(kMN), where k represents the growth of the volume of genetic databases.
The first developed and implemented popular heuristic algorithm for database similarity search is FASTA, which can be used for rapid comparison of protein or nucleotide sequences. The FASTA algorithm is most commonly used to search biological sequence databases. The short identical sections in the two sequences are located. A sequence of multiple matching regions that are observed in the same order in both sequences is used as the starting point for dynamic programming algorithm ordering. This algorithm achieves a high level of similarity search accuracy and high speed due to the use of a word matching model. The goal is to find a potential match before starting the time-consuming optimized search. The trade-off between speed and accuracy is governed by a parameter that determines the size of the word. The FASTA algorithm does not search for every word match, but instead searches for segments containing several adjacent matches. These segments are compared using a heuristic method to determine the segment with the best score.
The BLAST algorithm is perhaps the most widely used tool that has been developed for bioinformatics research purposes. It is also heuristic-based and is used to search for homologous sequences. Sequences that have k-fold matches are found in the database using the BLAST algorithm. BLAST creates a look-up table of all substrings of the given input length contained in the input sequence, as well as similar "neighboring" substrings. It is used to compare sequences with biological information, both for sequences containing amino acids of different proteins and for sequences containing nucleotides of DNA. The BLAST algorithm compares a given sequence to a database of sequences.
Different variants of the BLAST algorithm have been developed to accommodate different types of sequence alignment (MEGABLAST, PSIBLAST). As the volume of nucleotide and protein sequence databases increases, it is imperative to compare, search and analyze the data using parallel algorithms and high-performance computers. Several parallel versions of the BLAST alignment and search algorithm have been developed. mpiBLAST uses the database segmentation strategy. It can be used on different types of computer clusters and supercomputers. It is very popular among bioinformatics scientists in need of a high-throughput BLAST algorithm. Many scientists have reported experimental results of their research works using parallel implementations of the BLAST package [
19,
20,
21,
22,
23].
However, even parallel execution of sequence alignment algorithms faces limitations on hardware systems [
24,
25,
26]. A novel approach for sequence comparison is proposed by defining a heuristic pairwise alignment inside the database environment [
27]. This method takes advantage of the benefits provided by the database management system and presents a way to use similarities in datasets to speed up the alignment algorithm. Deep learning models are also used to predict miRNA binding site [
28]. An interpretation technique called imputation sequence alignment has been reported for miRNA target site prediction models that can interpret deep learning models on a two-dimensional representation of miRNA and putative target sequence. Another technique for detecting the similarity of DNA sequences is to define a generalized string editing distance that allows the addition or deletion of entire motifs in addition to single nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences [
29].
2.2. Method for DNA Sequence Alignmment Based on Trilateration
Finding a beginning point, or benchmark, against which the other data in the database can be compared is at the core of the concept of a favorite sequence. Alternatively, if we were to approach the problem mathematically, sequence favorite could be represented as a function of N unknowns (in the context of DNA, the unknowns being the 4 bases adenine, thymine, guanine, and cytosine), and the remaining database entries could then be represented once more as functions of the same variables. In this scenario, the distance between each individual sequence and the preferred sequence would be represented by the similarity comparison. In other words, determine the relationship between a point defined by the favorite sequence function and a point described by the sequence function.
A point is generated somewhere in the center of the cloud of points that is used as a reference (sequence favorite) when comparing to a set of points (the database entries) because there is no coordinate system. However, if a coordinate system or three or more reference points are found, it would be possible to use trilateration or elementary analytical geometry to determine the positions of the points relative to one another, which would reflect the degree to which the database records match one another. Moreover, to do away with the requirement for favorite computation sequence.
A novel approach, known as the CAT method (named after the first letter of the defined constant benchmarks C, A, and T), has been introduced for aligning DNA sequences, utilizing the trilateration technique [
30]. This method establishes three consistent reference points for trilateration application, resulting in a steadfast reference sequence. This sequence, comprising C-benchmark, A-benchmark, and T-benchmark, remains constant regardless of alterations in the database content.
Key point establishing the reference benchmark are:
- -
For A-benchmark and T- benchmark, we should never have matching position and base. On this way base from processed DNA could match on index and base on A or T, but not both.
- -
For C-benchmark, we want index and base of 25% of A-benchmark and 25% of T-benchmark to match on C-benchmark.
ACGTACGTACGTACGTACGTACGTACGTACGTACGTAC....... – A-Benchmark
GTACGTACGTACGTACGTACGTACGTACGTACGTACGT…… – T- Benchmark
AGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG....... – C- Benchmark
Once the three constant benchmarks for applying trilateration are established, issue (1) is eliminated. Constant sequence favorite – i.e. independent of the records in the database and to remain the same when the data set is changed.
Since the constant benchmark sequences are determined (i.e. they do not depend on either the data or their count), this allows comparisons to be made at the very beginning - when the sequences are uploaded into the database, and this is metadata information accompanying each sequence. This way, sequences won’t need to be compared during lookup (which is the slowest operation), but instead only the metadata information generated during the upload will be compared.
By establishing the benchmark sequences, issue (3) unification of favorite sequences for all databases is also eliminated. There are now unified sequences that are standardized for all databases using the described alignment algorithm. When two sequences have the same profiles, this means that they have regions with the same alignment, and one hundred percent complete matching of one sequence on the other can be expected.
For the evaluation of two sequences, it is necessary to calculate the distance of the segment S1S2 in
Figure 1.
For now, ∆AS1T is considered, then analogous calculations and reasoning are per- formed for AS2T. What is known about ∆AS1T is the sides AT = |1|, AS1 = distance from S1 to A-benchmark (it is known), S1T = distance from S1 to T-benchmark. Use the cosine theorem to find ∡TAC, then side AD1:
The calculations for triangle AS2T are analogous. After substituting the values found, a value for
S1S2 is obtained.
The smaller value obtained for the intercept, the greater the probability of a complete match, expressed as a percentage. This allows the database to be quickly searched for sequences with a certain percentage of similarity, which can later be aligned and compared with more accurate algorithms such as Needleman-Wunsch or Smith-Waterman.
It is possible to occur collisions in such proposed DNA sequence alignment method based on trilateration, i.e.:
-
More than one sequence of the same length to get the same values for AD and h:
Due to the nature of benchmark sequences and the fact that the real sequence projects at most a quarter of the bases onto the benchmark, i.e. with benchmark ACGT and projection of G at the second position, there are no matches, and no value is accumulated for match rate.
-
During S1S2 calculation, the same values are obtained:
Because of statistical errors accumulated when calculating AD and h.
Because of rounding in calculations due to the range of data types, this cannot be avoided even with the use of more precise types.
To minimize collisions, the precision in calculations for AD and h should be increased by adding the dependency on neighboring bases. Like in local alignment, when the current base of the benchmark sequence does not match the current base of the real sequence, additional points can be added or subtracted, depending on whether a neighboring base match is. After Needleman-Wunsch alignment, the places where the bases do not match appear as gaps “_” positions and are given different points accordingly.
A similar principle can be applied in the proposed method. If a base and index match is given value of 1. If there is a mismatch, a neighboring base from the benchmark sequence is checked and given a value of 0.6 or 0.4 depending on how far the neighbor is, i.e. when there is match with the left or right base from the benchmark then give 0.6, when is the far – 0.4. If we define baseDistance array for near matches it will looks like: [0, 0.6, 0.4, 0.6]. Left and right neighbors are with indexes 1 and 3, 0 index is for exact match and index 2 is the far neighbor.
Example of benchmark ACGT base at position 2 G
G at position 2 corresponds to C from the benchmark sequence and instead of 0 a match value of 0.6 can be given because it is adjacent to the right and in Needleman- Wunsch ordering it has the following alignment:
In this way, the precision of AD and h calculation is increased, the accumulation of statistical errors is reduced 1., thus reducing collisions 2. After finding a suitable sequence in the base, one with a minimum value for S1S2, a more accurate alignment algorithm can be applied. The comparison calculations in the direction calculate the CAT of two sections proposed in the presented method with a constant complexity that makes it to be applied as a first step suitable in the FASTA algorithm as well as for multiple alignments such as ClustalW.
Add dependency with a previous match and current position to increase uniqueness of the CAT profile. Like in positional numeral system contribution of a digit to the value of a number is the value of the digit multiplied by a factor determined by the position of the digit. It is needed something like this to be done here but considering the average length of the sequence exact same approach cannot be used. Instead, when a previous exact or near match is detected, a sum of current maximum theoretical sequence of matches is tracked, will be increased. The ratio of current maximum theoretical sequence of matches to current index will be added to the sum of previous matches, and this will be considered as kind of a bonus points. All given bonuses from previous matches must be tracked as well, since they are need in a later stage to correct the distances, so that a tringle can be formed. To be able to apply method of trilateration calculated distances should ensure that in any conditions a triangle can be formed. I.e. the sum of any two sides must be greater than the third one.