In computational phylogenetics, tree alignment is the problem of producing a multiple sequence alignment,which can be used to analyse a set of sequences with evolutionary relationship using a fixed tree. Essentially,tree alignment is an algorithm for optimizing phylogenetic tree by calculating the edit distance to achieve the minimum value.To be specific, phylogenetic tree shows an evolutionary relationship between different species and taxa joined together are assumed to have the same ancestor.
Contents
- Sequence alignment
- Edit distance
- The problem of Tree alignment
- Combinatorial Optimization Strategy
- The Keyword Tree Theory and Aho Corasick search algorithm
- Keyword Tree Theory
- Aho Corasick search algorithm
- Other strategies
- Heuristic Algorithm
- Tree Alignment Graph
- References
Formally, tree alignment is the following optimization problem.
Input: A set
Output: A labeling of the internal vertices of
The task is NP-hard
Sequence alignment
In bioinformatics, the basic method of information processing is to contrast the sequence data. It has a very important significance when biologists use it to discover the function, structure and evolutionary information in biological sequences. From the sequence assembly, the phylogenetic analysis, the haplotype comparison, and the prediction of RNA structure are all based on sequence alignment, so the efficiency of sequence alignment, especially multiple sequence alignment, will directly affect the effect of these problems’ solution. Therefore, to design a rational and efficient sequence alignment algorithm becomes a very important branch of research in the field of bioinformatics.
Generally, sequence alignment means constructing a string from two or more given strings with the greatest similarity by adding, deleting letters, or adding a space for each string. The multiple sequence alignment problem is generally based on pairwise sequence alignment and currently, for pairwise sequence alignment problem, biologists can use dynamic programming approach to obtain its optimal solution. However, the multiple sequence alignment problem is still one of the intractable problems in bioinformatics, because finding the optimal solution of multiple sequence alignment has been proved as a NP-complete problem so that only approximate optimal solution can be obtained.
Edit distance
Edit distance measures the minimum operation number of character insertions, deletions and substitutions that are required to transform one sequence u to the other sequence v when being operated on a pair of strings. The calculation of edit distance can be based on dynamic programming,and the equation is in O(|u|∗|v|) time, where |u| and |v| are the lengths of u and v Edit distance are the basic principle in computational biology, thus an efficient estimation of edit distance is very essential . There are some functions to calculate edit distance, including “symmetrization” used for functions of hereditary properties. Because there are a series of functions being used to calculate edit distance, different functions may result in distinct results. Finding an optimal edit distance function seems essential for further explanation.
The problem of Tree alignment
Tree alignment problem is a NP-hard problem when we restrict its scoring mode and alphabet size, and it can be found an algorithm, which uses to find the optimized solution. However, there is an exponential relationship between its efficiency and the number of sequence, it means when the number of sequence is very large, the runtime before getting results is an enormous figure and it is unacceptable. Using star alignment is faster than tree alignment to get the approximate optimized solution. However, whatever the degree of multiple-sequence similarity is, the time complexity of star alignment has a proportional relationship with the square of sequence number and the square of the sequence average length. In usual, the sequence in MSA is so long that it is also inefficient or even unacceptable. Therefore, how to reduce the time complexity to linear is one of the core issues in the Tree alignment.
Combinatorial Optimization Strategy
Combinatorial optimization is a good strategy to solve MSA problem. The idea of combinatorial optimization strategy is to transform the multiple sequence alignment into pair sequence alignment to solve this problem. Depending on its transformation strategy, the combinatorial optimization strategy can be divided into the tree alignment algorithm and the star alignment algorithm. For a given multi sequences set
The Keyword Tree Theory and Aho-Corasick search algorithm
When we use combinatorial optimization strategy to transform the multiple sequence alignment into pair sequence alignment, the main problem is changed from how to improve the efficiency of multiple sequence alignment to how to improve the efficiency of pairwise sequence alignment. The Keyword Tree Theory and Aho-Corasick search algorithm is an efficient approach to solve the pairwise sequence alignment problem. The aim of combining the keyword tree theory and Aho-Corasick search algorithm is to solve this kind of problem: for a given long string T and a short strings set
Keyword Tree Theory
The keyword tree of the set
And We use
For example, The set
To establish failure link is the key to improve the time complexity of Aho-Corasick algorithm. It can reduced the original polynomial time to the linear time for searching. Therefore, the core of keyword tree theory is to find all failure links(also means find all
Aho-Corasick search algorithm
After establishing all failure links in the keyword tree, we use Aho-Corasick search algorithm to find the locations of all
Other strategies
In MSA ,DNA,RNA, and proteins sequences are usually generated and they are assumed to have evolutionary relationship .By comparing generated maps of RNA,DNA and sequences from evolutionary family ,people can assess conservation of protein ,find functional gene domains by comparing differences between evolutionary sequences. Generally ,heuristic algorithm and tree alignment graph are also adopted to solve multiple sequence alignment problems.
Heuristic Algorithm
Generally heuristic algorithm relies on the iterative strategy, scilicet based on a comparison method, optimizing the results of multiple sequence alignment by the iterative process. Davie M proposed using particle swarm optimization algorithm to solve the multiple sequence alignment problem; Ikeda T proposed a heuristic algorithm which is based on A* search algorithm; Bimey E first proposed using hidden Markov model to solve the multiple sequence alignment problem; and many other biologists use genetic algorithm to solve it. All these algorithms generally are robust and insensitive to the number of sequences, but they also have shortcoming, for example, the result got from particle swarm optimization algorithm is unstable and its merits depend on the selection of random numbers, the runtime of A * search algorithm is too long and the genetic algorithm is easy to fall into local excellent.
Tree Alignment Graph
Roughly ,tree alignment graph aims to align trees into a graph and finally synthesis them to develop statistics.For biologist,tree alignment graph (TAGs) are used to remove the evolutionary conflicts or overlapping taxa from sets of trees and can be queried to explore uncertainty and conflict.By integrating methods of aligning ,synthsizing and analyzing ,the TAG aims to solve the conflicting relationships and partial overlapping taxon sets obtained from a wide range of sequence.Also ,tree alignment graph serves as a fundamental approach for supertree and grafting exercise,which have been successfully tested to construct supertrees by Berry et al. Because the transformation from trees to a graph contain similar nodes and edges from their source trees ,TAGs also can provide extraction of original source trees for further analysis . TAG is a combination of a set of aligning trees,it can store conflicting hypotheses evolutionary relationship and synthesize the source trees to develop evolutionary hypotheses ,therefore ,it is a basic method to solve other alignment problems.