All notes
TextAlignm

Basics

Global, Local, Glocal alignments

A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming.

The Smith–Waterman algorithm is a general local alignment method also based on dynamic programming.

Hybrid methods, known as semiglobal or "glocal" (short for global-local) methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence. It is used when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally aligned but only a local alignment is desired for the long sequence.

Edit distance

Wikipedia: edit distance.

Hamming Distance

Wikipedia: hamming distance.

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

The Hamming distance between:

    "karolin" and "kathrin" is 3.
    "karolin" and "kerstin" is 3.
    1011101 and 1001001 is 2.
    2173896 and 2233796 is 3.

Levenstein Distance

wikipedia.

It is named after Vladimir Levenshtein, who considered this distance in 1965.

$$ \qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases} \max(i,j) & \text{ if} \min(i,j)=0, \\ \min \begin{cases} \operatorname{lev}_{a,b}(i-1,j) + 1 \\ \operatorname{lev}_{a,b}(i,j-1) + 1 \\ \operatorname{lev}_{a,b}(i-1,j-1) + 1_{(a_i \neq b_j)} \end{cases} & \text{ otherwise.} \end{cases} $$

where $1_{(a_{i}\neq b_{j})}$ is the indicator function equal to 0 when $ a_{i}=b_{j}$ and equal to 1 otherwise, and $\operatorname{lev}_{a,b}(i,j)$ is the distance between the first $i$ characters of $a$ and the first $j$ characters of $b$.

    It is always at least the difference of the sizes of the two strings.
    It is at most the length of the longer string.
    It is zero if and only if the strings are equal.
    If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
    The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

An example where the Levenshtein distance between two strings of the same length is strictly less than the Hamming distance is given by the pair "flaw" and "lawn". Here the Levenshtein distance equals 2 (delete "f" from the front; insert "n" at the end). The Hamming distance is 4.