Sequence alignment

Sequence alignment is a way of arranging the sequences of letters (or other symbols) identify regions of similarity. The algorithms of sequence alignment are very important in bioinformatics where the sequences of different DNA, RNA or protein are alignmed. Detected regions of high similarity often indicate functional, structural, or evolutionary relationships between the sequences than may represent, for instance, two different proteins, or the same protein in two different species. Sequence alignment is also used as an advanced search method to compare non biological data. Most often the two sequences are compared one against another, but the algorithms exist to compare multiple sequences at once, detecting regions that are common to all or majority of them.

Frequently one of the strings being compared is shorter, less known in advance and may include generic wildcards. This sequence is called pattern. Another, usually much longer sequence where it is needed to find the pattern is called text. For instance, when researcher enters a sequence of the short new protein into search engine, this sequence is a pattern. The search engine known the sequences of thousands of proteins, maybe from many species. This search target is a text.

The trivial (brute force) algorithms are way too slow for the sequences they should be used, and various advanced algorithms are proposed to the same or similar result in acceptable time. The notable sequence alignment algorithms are Dynamic programming table, Needleman & Wunsch, Smith & Waterman, Four Russians and others.

Gaps and mutations

Sequence alignment algorithms are created to detect sequences that are just similar but not necessarily the same. The sequences may differ in one or more individual letters (arranging versus erranging) or some missing or additional letters (arranging versis arangning). If the mismatching reqion is longer, it is usually more efficient to count it as a single mismatch and not as a bunch of individual mismatching letters. Mismatching letters are usually called mismatches or substitutions. The longer mismatching reqions are called gaps. Letters that are missing in particular sequence (in comparison to another) are called deletions, and additional letters are called insertions. In protein and DNA sequences all these changes arise from mutations, the primary source of both evolution and hereditary diseases.

Global and local alignment

Alignment that tries to match one full sequence (like protein sequence, or a complete sentence, for instance) against another is called a global alignment. This kind or alignment is most useful when the sequences are similar and of roughly equal size. However, for instance, comparing two sentences with global alignment will not detect much similarity if only one or two words match, especially when the matching words appear in different order. Similarly, protein and DNA sequences often contain relatively short similar sequences with more variable, non similar regions in between. These sequences may represent active, functional parts of the protein or important regulatory commands in the DNA. They are highly interesting to bioinformaticians and many algorithms have been proposed to detect the small similar regions in sequences that are otherwise non similar and may be highly different in length. This second type of search is called local alignment. A certain intermediate way of search is global alignment without end gap penalty that searches for the single similar region that can be bounded by arbitrary mismatching tails.

The edit distance

The edit distance is the minimal required number of single letter modifications (insertions, replacements and deletions) that are required to transform one string into another. It is also sometimes called Levenshtein distance.[1]. The edit distance concept is often used when describing various algorithms.

Scoring matrix (or substitution matrix)

In the simplest case any two mismatching letters just count as mismatch without caring much which two letters. This seems intuitive but in bioinformatics every letter may represent one amino acid in the protein chain, and there are small groups of several amino acids that are quite similar to each other and different from amino acids in another group. In this case it is possible to define the score matrix - an array that indicates the level of dissimilarity between letters. This matrix has zero in diagonal, as every letter is equal to itself. In all other places it contains negative penalty value that should be counted for mismatch.

The term "substitution matrix" comes from bioinformatics, where the alignment algorithms are used to compare DNA, amino acid or similar sequences. Through the evolution, nucleotides and especially amino acids mutate (hence "are substituted") but probability of mutations from one amino acid or nucleotide into another is not the same for any taken pair. This is largely because Genetic code maps some different codes into the same amino acid.

This matrix basically shows the probabilities of transformation from one symbol in another over the evolution of the sequence. These probabilities are defined as

where is the probability that amino acid mutates into amino acid and is the frequency of amino acid i. The base of the logarithm is not important: the same substitution matrix can be expressed in different bases.

PAM

One of the first amino acid substitution matrices, the PAM (Point accepted mutation) matrix was developed in the 1970s. This matrix is calculated by observing the differences in similar proteins. The PAM1 matrix estimates the rate of substitution that would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow, and multiple substitutions can occur at the same site. Using this logic, researchers derived matrices as high as PAM250. Usually the PAM30 and the PAM70 are used.

BLOSUM

PAM matrix does not work well when comparing closely related species, for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time are not well approximated by compounding small changes that occur over short time. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. It is built using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple alignments. These conserved sequences are assumed to be of functional importance. To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered giving weight 1 to each cluster. For the BLOSUM62 matrix, this threshold is set at 62%. Pairs frequencies are then counted between clusters, hence pairs are only counted between segments less than 62% identical. Higher numbered BLOSUM matrices are more appropriate for aligning two closely related sequences, and a lower number matrices for more divergent sequences.

BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

Scoring

The same two longer sequences can usually be aligned in a number of alternative ways that may indicated different levels of similarity. The level of similarity, detected by particular alignment, is called the alignment score. Usually the alignment with the best score is the most interesting, but in some cases (like comparing the sequence with itself for self similarity) other, alternative alignments may be more interesting. The score is not necessarily the same as edit distance as many algorithms count multiple mismatching characters (gaps) differently, for instance.

To get the score, it is necessary to have some scoring schema. The scoring can be configured in various ways, and result varies significantly depending on these settings. The simplest way of scoring is to cout some positive scores for matching letters and negative score for mismatching letters, deletions and insertions. If the gap concept is used, a separate scores for the gap must be included. The negative score for the gap can be constant or depend of the gap length. In addition, it must be defined how long the mismatch, insertion or deletion should be to start counting it as a gap. The previously mentioned 'end gap penalty' is a negative score that may be added for the presence completely mismatching tail.



See also Suffix tree.