Average Worst case | ||
O ( n ) {displaystyle {mathcal {O}}(n)} O ( n ) {displaystyle {mathcal {O}}(n)} O ( n ) {displaystyle {mathcal {O}}(n)} O ( n ) {displaystyle {mathcal {O}}(n)} |
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms and within the field of bioinformatics.
Contents
- Definition
- Example
- Correspondence to suffix trees
- Space Efficiency
- Construction Algorithms
- Applications
- References
Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They have independently been discovered by Gaston Gonnet in 1987 under the name PAT array (Gonnet, Baeza-Yates & Snider 1992).
Definition
Let
The suffix array
Example
Consider the text banana$
to be indexed:
The text ends with the special sentinel letter $
that is unique and lexicographically smaller than any other character. The text has the following suffixes:
These suffixes can be sorted in ascending order:
The suffix array
The suffix array with the suffixes written out vertically underneath for clarity:
So for example, ana$
.
Correspondence to suffix trees
Suffix arrays are closely related to suffix trees:
It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as the LCP array) and solves the same problem in the same time complexity. Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to Ukkonen's algorithm) and improved cache locality.
Space Efficiency
Suffix arrays were introduced by Manber & Myers (1990) in order to improve over the space requirements of suffix trees: Suffix arrays store
However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires
Such discrepancies motivated a trend towards compressed suffix arrays and BWT-based compressed full-text indices such as the FM-index. These data structures require only space within the size of the text or even less.
Construction Algorithms
A suffix tree can be built in
A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms require
More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals:
One of the first algorithms to achieve all goals is the SA-IS algorithm of Nong, Zhang & Chan (2009). The algorithm is also rather simple (< 100 LOC) and can be enhanced to simultaneously construct the LCP array. The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful implementation by Yuta Mori outperforms most other linear or super-linear construction approaches.
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported alphabet: constant alphabets where the alphabet size is bound by a constant, integer alphabets where characters are integers in a range depending on
Most suffix array construction algorithms are based on one of the following approaches:
A well-known recursive algorithm for integer alphabets is the DC3 / skew algorithm of Kärkkäinen & Sanders (2003). It runs in linear time and has successfully been used as the basis for parallel and external memory suffix array construction algorithms.
Recent work by Salson et al. (2009) proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is
Applications
The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern
Finding the substring pattern
Suffix sorting algorithms can be used to compute the Burrows–Wheeler transform (BWT). The BWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array. The BWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the BWT string:
Suffix arrays can also be used to look up substrings in Example-Based Machine Translation, demanding much less storage than a full phrase table as used in Statistical machine translation.
Many additional applications of the suffix array require the LCP array. Some of these are detailed in the application section of the latter.