Office of Technology Transfer – University of Michigan

An Efficient Method for Constructing Large Suffix Trees

Technology #3139

Questions about this technology? Ask a Technology Manager

Download Printable PDF

Categories
Researchers
Jignesh M. Patel
Managed By
Drew Bennett
Associate Director - Software Licensing 734-615-4004
Publications
Practical Suffix Tree Construction
Proc 30th Int Conf VLDB, Page 36. 2004

Background

Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree, which is a versatile data structure that is used in a number of applications that involve searching string and sequence database (such as text corpora or biological sequence datasets). Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging.

Technology

Researchers at the University of Michigan have developed an efficient algorithm for constructing suffix trees on large data sets. In such cases the suffix trees are often larger than the amount of main memory and existing suffix tree algorithms are often prohibitively expensive. This algorithm is top down disk-based technique, and pays careful attention to the disk I/O behavior and the processor cache utilization during the suffix tree construction process to produce and algorithm that is significantly faster than previous approaches.

Applications and Advantages

Applications

  • Construction of suffix trees for queries over large sequences

Advantages

  • Scales to sizes much larger than previously-nl-described
  • Out-performs the best known disk-based-nl-construction algorithms by a factor of 5 to 10