Have an Android phone?

Try our new game!

Suffix tree

Suffix tree is a data tree that encodes the given text in a form that is highly efficient for the later pattern search inside that text.

Suffix tree for the given string (below). If part of the input string is selected (highlighted), the tree is built for the selected part only. Character positions in this tree are numbered from zero.

If the substring being searched is in the text for that the tree has been created, it is possible to spell that substring while walking from the root and turning towards the matching characters at branches. No two edges out of the node can lead to the same character. This allows to perform search in linear ( O(n) where n is the length of the pattern ) time, without preprocessing the pattern and only preprocessing the text once (the same tree can be used for searching different patterns). The search time is independent from the length of the text that can be very big. Suffix trees are successfully used in search over human and other genome sequences that may contain millions of nucleotides. This is not achievable with the known alternative algorithms[1].

A suffix tree for a given string m-character string has m leaves, numbered from 0 to m-1 (or from 1 to m, if they are numbered from 1). These numbers tell the start position of the suffix that is spelled while walking from the tree root to that leaf. Numbers are used by various algorithms that also need to tell the location, not just that the search pattern is present in the text. There are some string for that the suffix tree theoretically cannot be built as the path to the suffix would not always end to the leaf. However the tree can always be built for the text that is terminated by the end character, not present anywhere else in the input. In suffix tree literature it is often assumed that the input is terminated by such a character (applet uses the black rectangle as a terminator sign).

Not only suffix trees are efficient solution for the exact matching problems but they are also used by algorithms that search for the inexact matches or inside the set of sequences (with requirement to identify the set). Such tasks are not solvable by this structure alone but are solvable by algorithms that use the suffix trees.

Search

To know if your pattern is somewhere in the tree, you only need to walk from the root, turning at branches in correspondence with your pattern. If this is possible, the patterns has been present in the text from that the tree was constructed. However if you also need to tell where the pattern is located and how many times does it occur, you need then to scan recursively the fragment of the tree that spans past the end of your walk. You will find the required position indexes at the end of every leaf of this subtree. The number of leafs is equal to the number of occurences (test on the applet with some trivial input).

The sequence that is the most common in the tree can be found in the deepest for node while measuring the depth as the number of characters traversed from the root.

Building

There are two known algorithms (Weiner and Ukkonen) that can build suffix trees in linear time, proportional to the length of the search target. The trivial (naive) algorithm, used inside the applet, builds the tree in quadratic time. It starts from the empty tree, then "enters" into it the text sequence (t[0..n]), creating separate node for every letter. Then, in a loop, enters all suffixes (t[1..n], t[2..n], etc). When entering sequence into non empty tree, the algorithm "walks" from the root following characters that match the sequence being entered. When it is no longer possible to follow the existing path, the algorithm starts a new branch in the tree. At the end, algorithm compacts the tree, concatenating parent and child nodes if the parent node has only this one child.

The "true" (linear time) building algorithm is described in [2],[3] or you can try the original Ukkonen's article while it is not very easy to read. The C source code can be found at [4] and Java code at [5].

Compacting

Suffix links

If no any optimization has been done, the tree usually contains multiple, identical "tails" (in our example, for instance, yaza). Such tree can easily us tens times more space than the original text. The space to store the tree can be reduced by introducing "suffix links" so that such tails would only be allocated in memory once. This optimization is part of the Ukkonen tree building algorithm that deals with simplified, intermediate "implicit trees" without position information. A structure with such suffix links is no longer a tree but a directed graph. This graph still has no loops, you can only walk one way through it.

Creating links. Detecting and creating suffix links may look computational intensive on its own but they actually save the time, being one of optimizations required to build the tree in linear time. We can use the link table that maps between the start position of the suffix and the location and the pointer to the node that has been created for the symbol at that position (first symbol in the suffix) inside the tree.

For instance, building a tree for zabzayaza, after entering the first suffix (the word itself), we have the following table:

Now, when entering abzayaza from the position 1, we

  • start from the root and see that the only path leading out of there is the path through "z" - that does not fit.
  • check our table, if we have link for a sequence at the position 1 - yes, we do.
  • so we link the root to the position 1 (do not even need to do any character comparisons here) - and it is done with the second suffix. No need to walk through the suffix or through the tree:

Here and below, overbrace means a link from left to rights and the text under it matches the beginning of the whole suffix being entered. Same way we easily enter the next suffix bzayaza by just creating the link from root to "b":

Such "link and forget" optimization is known as skip-count trick and is required to run in linear time.

However we cannot insert zayaza by just creating the link from the root to the node 3. "z" is already attached to the root (node 0) hence linking the node 3 as well would create two pathes from the root that lead through the same letter, "z". Such structure would no longer be a valid suffix tree.

Hence, when inserting zayaza, we need to step from the root to "z" at node 0 (there is a path), and then to "a" at node 1 (there is a path). However the only path from the node 1 currently goes through "b" (node 2). Hence we can create a path through the required "y" by creating a suffix link from the node 1 to the node 5:

How at this stage to know that the required yaza sequence starts at the node 5? abzayaza is a suffix from the position 2. Also, we have already stepped through the two letters (z, a) while inserting it and are currently dealing with the position 3. Hence we can find the appropriate suffix at position 2+3 = 5 in our table.

So far we were not creating any new nodes, just links. This is sufficient for many cases but not for all - in some cases we do need to create a new nodes, otherwise our structure may report false presence of some strings that were not part of the initial text.

So, the general rule would be

  • walk from the root till it is possible
  • then there is no path any longer, link the current node with the node indexed as suffix_start_position + our_current_position in our suffix table.

Suffix links can be efficiently used if it is not required to tell the position of the pattern in the text and is enough to know that it is somewhere present. There are also published works on creating the fully functional compacted suffix trees[6]. Implicit suffix tree can be converted into "standard" suffix tree in linear time.

Once a leaf, a leaf

The tree under construction cannot be extended past the existing leaf, as this would mean extending past the end line terminator. Also, at most one leaf can be attached to any node, as attaching two would mean that there are two ways through the same symbol (end terminator) out of the node. Hence leafs can be more efficiently implemented as some alternative structures without children support or even as flags in the parent nodes. In the applet view, "end of text" leafs are merged with the parent nodes.

On line construction

Ukkonen also describes another algorithm that processes the string symbol by symbol from left to right, and has always the suffix tree for the scanned part of the string ready ("on line construction")[7]. It is is based on the observation that the suffixes of a string t[0..n] can be obtained from the suffixes of string t[0 .. n-1 by appending symbol t[n] at the end of each suffix of t[0..n] and by adding the empty suffix, in other words:

  • start from the root with one empty suffix
  • for each symbol in the text, from left to right:
  1. append this symbol to the end of every suffix.
  2. add new empty suffix to the root.

The growing tree is the suffix tree, but without suffix links or other means of compression it is a very bulky structure. As all idea is to use algorithm to search over really long text, usually some kind of compacting is required anyway and most of this article is devoted on how to reduce the space requirements from quadratic to linear.

References

  1. 1 Dan Gusfield (1997). Algorithms of on Strings, Trees and Sequences. Cambridge University Press, ISBM 0 521 58519 8, 534p.
  2. 2 Lloyd Allison. Suffix Trees
  3. 3 Suffix trees
  4. 4 Dotan Tsadok, Shlomo Yona. ANSI C code to build suffix tree in linear time using Ukkonen's algorithm], Free software
  5. 5 Illya Havsiyevych. Suffix Trees: Java Ukkonen's Algorithm (JUnit tests questionable; see the URL for more details)
  6. 6 N. Valimaki, W.Gerlach, K.Dixit, V.Makinen Engineering a Compressed Suffix Tree Implementation
  7. 7 Esko Ukkonen. On–line construction of suffix trees

See also