Harman Patil (Editor)

TRE (computing)

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Original author(s)
  
Ville Laurikari

Website
  
laurikari.net/tre/

Written in
  
C

Type
  
Approximate string matching

License
  
2-clause BSD-like license

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It is developed by Ville Laurikari and distributed under a 2-clause BSD-like license.

Contents

The library is written in C and provides functions which allow using regular expressions for searching over input text lines. The main difference from other regular expression engines is that TRE can match text fragments in an approximate way, that is, supposing that text could have some number of typos.

Features

TRE uses extended regular expression syntax with the addition of "directions" for matching preceding fragment in approximate way. Each of such directions specifies how many typos are allowed for this fragment.

Approximate matching is performed in a way similar to Levenshtein distance, which means that there are three types of typos 'recognized':

TRE allows specifying of cost for each of three typos type independently.

The project comes with a command-line utility, a reimplementation of agrep.

Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regular expression matching engines. This means that

  • it implements ordinary regular expressions written for strict matching;
  • programmers familiar with POSIX-style regular expressions need not do much study to be able to use TRE.
  • Predictable time and memory consumption

    The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirement is constant during matching and does not depend on the input, only on the pattern.

    Other

    Other features, common for most regular expression engines could be checked in regex engines comparison tables or in list of TRE features on its web-page.

    Usage example

    Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):

  • (regular){~1}s+(expression){~2} would match variants of phrase "regular expression" in which "regular" have no more than one typo and "expression" no more than two; as in ordinary regular expressions "s+" means one or more space characters - i.e. rogular ekspression would pass test;
  • (expression){ 5i + 3d + 2s < 11} would match word "expression" if total cost of typos is less than 11, while insertion cost is set to 5, deletion to 3 and substitution of character to 2 - i.e. ekspresson gives cost of 10.
  • Language bindings

    Apart from C, TRE is usable through bindings for Perl, Python and Haskell. However if the project should be cross-platform, there would be necessary separate interface for each of the target platforms.

    Disadvantages

    Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However there are few things which programmers may wish to be implemented in future releases:

  • a replacement mechanism for substituting matched text fragments (like in sed string processor and many modern implementations of regular expressions, including built into Perl or Java);
  • opportunity to use another approximate matching algorithm (than Levenshtein's) for better typo value assessment (for example Soundex), or at least this algorithm to be improved to allow typos of the "swap" type (see Damerau–Levenshtein distance).
  • References

    TRE (computing) Wikipedia