Have an Android phone?

Try our new game!

Smith & Waterman

This is an advanced topic. It assumes that your are familiar with Needleman & Wunsch algorithm

The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Background

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981.[1] Like the Needleman & Wunsch, of which it is a variation, Smith-Waterman uses dynamic programming table. As such, it is guaranteed to find the optimal local alignment for the given substitution matrix and the gap-scoring scheme (defined below as a single two parameter function w). The main difference to the Needleman-Wunsch algorithm is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Backtracking starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment. One does not actually implement the algorithm as described because improved alternatives are now available that have better scaling (Gotoh, 1982) and are more accurate (Altschul and Erickson, 1986).[2]

Algorithm Explanation

A matrix is built as follows:

Where:

  • a, b = the two strings for that the best local alignment must be found. Underscore index means character value at the given location.
  • m = the length of the first string, a.
  • n = the length of the second string, b.
  • D(i,j) - the maximum similarity score between a suffix of a[1...i] and a suffix of b[1...j]
  • w(c,d) - the scoring matrix, how much the given pair of symbols is worth from the scope of the similarity score. Dash means the score of another symbol over space (gap penalty). The applet allows to have individual value for every position ("gap penalty array") but this seldom required.

Example

  • Sequence 1 = ACACACTA
  • Sequence 2 = AGCACACA
  • w(match) = +2
  • w(a,-) = w(-,b) = w(mismatch) = -1

To obtain the optimum local alignment, we start with the highest value in the matrix (i,j). Then, we go backwards to one of positions (i-1,j), (i,j-1), and (i-1,j-1) depending on the direction of movement used to construct the matrix. We keep the process until we reach a matrix cell with zero value, or the value in position (0,0).

In the example, the highest value corresponds to the cell in position (8,8). The walk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1), (1,1), and (0,0),

Once we've finished, we reconstruct the alignment as follows: Starting with the last value, we reach (i,j) using the previously-calculated path. A diagonal jump implies there is an alignment (either a match or a mismatch). A top-down jump implies there is a deletion. A left-right jump implies there is an insertion.

For the example, we get:

  • Sequence 1 = A-CACACTA
  • Sequence 2 = AGCACAC-A

Motivation

One motivation for local alignment is the difficulty of obtaining correct alignments in regions of low similarity between distantly related biological sequences, because mutations have added too much 'noise' over evolutionary time to allow for a meaningful comparison of those regions. Local alignment avoids such regions altogether and focuses on those with a positive score, i.e. those with an evolutionary conserved signal of similarity. A prerequisite for local alignment is a negative expectation score. The expectation score is defined as the average score that the scoring system (substitution matrix and gap penalties) would yield for a random sequence.

Another motivation for using local alignments is that there is a reliable statistical model (developed by Karlin and Altschul) for optimal local alignments. The alignment of unrelated sequences tends to produce optimal local alignment scores which follow an extreme value distribution. This property allows programs to produce an expectation value for the optimal local alignment of two sequences, which is a measure of how often two unrelated sequences would produce an optimal local alignment whose score is greater than or equal to the observed score. Very low expectation values indicate that the two sequences in question might be homologous, meaning they might share a common ancestor.

However, the Smith-Waterman algorithm is fairly demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required. As a result, it has largely been replaced in practical use by the BLAST algorithm; although not guaranteed to find optimal alignments, BLAST is much more efficient.

An implementation of the Smith-Waterman Algorithm, SSEARCH, is available in the FASTA sequence analysis package from http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml.


Acknowledgements

This web page reuses material from Wikipedia page http://en.wikipedia.org/wiki/Smith-Waterman under the rights of CC-BY-SA license. As a result, the content of this page is and will stay available under the rights of this license regardless of restrictions that apply to other pages of this website.


See also

  • A lecture devoted to the Smith & Waterman algorithm.
  • JAligner - open source Java implementation of the Smith-Waterman algorithm with Gotoh's improvement for biological local pairwise sequence alignment using the affine gap penalty model.

References

  1. 1 Smith, Temple F.; and Waterman, Michael S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular Biology 195–197.
  2. 2 Stephen F. Altschul; and Bruce W. Erickson (1986). Identification of Common Molecular Subsequences. Bulletin of Mathematical Biology 603-616.