Probabilistic Consistency-based Multiple Alignment of Amino Acid Sequences

	RUN		ABOUT		ALGORITHM		DOWNLOAD		HELP

PROBCONS is a consistency based multiple aligner.

Traditional alignment methods based on edit distance score an alignment as the sum of similarity values for aligned residues and length dependent gap penalties for unaligned positions. Scoring a multiple alignment in a probabilistically rigorous and biologically motivated manner, and finding the optimal alignment once a scoring scheme has been specified, are not straightforward tasks. In practice, the ad hoc sum-of-pairs measure, which combines the projected pairwise distances of all pairs of sequences in the alignment are commonly used for scoring. Many heuristic strategies use progressive alignment based on an evolutionary tree or iterative approaches, but are prone to errors in early stages of alignment. To combat this, post-processing steps such as iterative refinement are applied.

Consistency-based schemes take the alternative view that "prevention is the best medicine." For any multiple alignment, the induced pairwise alignments are necessarily consistent, i.e., given a multiple alignment of x, y, and z, if position x_i aligns with position z_k and position z_k aligns with y_j in the projected x-z and z-y alignments respectively, then, x_i must align with y_j in the projected x-y alignment. Consistency-based techniques apply this principle in reverse, using alignments to intermediate sequences as evidence to guide the pairwise alignment of x and y.

This principle forms the basis of the PROBCONS aligner.

Formally, the algorithm is described below:

ProbCons algorithm

1. (Initial alignment) For every pair of sequences x and y,

a. Compute a posterior probability table using the HMM specified in the figure below, containing the posterior probabilities P(xi ~ yj | x, y) for matching each letter xi of one sequence against each letter yj of the other.

b. Compute the expected accuracy of the alignment, E(x, y), defined to be the sum of posterior match probabilities along the highest summing path divided by the length of the shorter sequence.

2. (Consistency transformation) Simultaneously update all posterior probability matrices using the transformation:

Repeat this step for a total of two iterations.

3. (Guide tree) Given the expected accuracies for each pair-wise alignment, compute a guide tree T using the following greedy hierarchical clustering procedure:

a. Initially, place each sequence in its own cluster.

b. Merge clusters x and y with maximum expected alignment accuracy. When the new cluster xy is formed, define its expected accuracy with any other cluster z to be E(x, y)(E(x, z) + E(y, z)) / 2.

c. Repeat until only one cluster remains.

4. (Progressive alignment) Perform progressive multiple alignment according to the guide tree T using a sum-of-pairs objective function consisting of the sum of the re-estimated P(xi ~ yj | x, y) terms for all aligned residue pairs; like in step 2, no insertion penalties are used in calculating the highest summing path.

5. (Iterative refinement) Randomly partition the sequences in the current multiple alignment into two groups and realign. Repeat this step for a total of 100 iterations.

Some of the steps are described in detail below:

Progressive alignment

In the pairwise alignment case, an optimal expected accuracy alignment is calculated by applying the maximum weight trace algorithm to a matrix consisting of the P(x_i ~ y_j | x, y) values. For the multiple sequence case, we rely on a progressive alignment scheme using the guide tree calculated by hierarchical clustering of sequences. Initially, progressive alignment proceeds by assigning each sequence to its corresponding leaf in the tree. If a node has exactly two leaves as children, then the sequences assigned to the children are aligned, and this alignment is assigned to the node. Finally, the leaves of the node are removed so that the node becomes a leaf, and the process iterates until only a single node remains and all sequences have been aligned.

When aligning groups of sequences, we use a sum-of-pairs scheme in which the score of a multiple alignment is given by summing the expected accuracy for each possible projection of the multiple alignment to two sequences. Thus, we can find the optimal expected accuracy alignment of a group of sequences by a straightforward extension of the pairwise dynamic programming computation.

Guide tree calculation

Guide trees are computed in a greedy hierarchical manner. Given a set S of sequences to be aligned, we denote the expected accuracy for aligning two sequences x and y as E(x, y). Initially, each sequence is placed in its own cluster. Then, the two clusters x and y with the highest expected accuracy are merged to form a new cluster xy; we then define the expected accuracy of aligning xy with any other cluster z as E(x, y)(E(x, z) + E(y, z)) / 2. This process is repeated until only a single cluster remains.

The probabilistic consistency transformation

When aligning two sequences x and y with a set of multiple sequences, S, we ideally would like to provide an estimate for P(x_i ~ y_j | S). In practice, we use the following heuristic decomposition:

where we set P(x_i ~ x_j | x) to 1 if i = j and 0 otherwise.

In this sense, the approximate probabilistic consistency calculation may be viewed as a transformation that, given a set of all-pairs pairwise posterior probabilities, produces a new set of all-pairs pairwise posterior probabilities that have been adjusted to account for a single intermediate sequence. By iterated applications of the transformation, then, we can approximate the effect of accounting for more than one intermediate sequence at a time.