Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree, which is a versatile data structure that is used in a number of applications that involve searching string and sequence database (such as text corpora or biological sequence datasets). Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging.
Researchers at the University of Michigan have developed an efficient algorithm for constructing suffix trees on large data sets. In such cases the suffix trees are often larger than the amount of main memory and existing suffix tree algorithms are often prohibitively expensive. This algorithm is top down disk-based technique, and pays careful attention to the disk I/O behavior and the processor cache utilization during the suffix tree construction process to produce and algorithm that is significantly faster than previous approaches.
Applications and Advantages
- Construction of suffix trees for queries over large sequences
- Scales to sizes much larger than previously-nl-described
- Out-performs the best known disk-based-nl-construction algorithms by a factor of 5 to 10