TRE: A Regex Engine with Approximate Matching

November 4, 2018

A Fuzzy Wuzzy Bear

TRE is a regex engine that allows for approximate matching. It does this by using calculating the Levenshtein Distance (number of insertions, deletions, or substitutions it would take to make the strings equal), as it searchs for a match.

The re::engine::TRE is a Perl wrapper around the TRE engine. It swaps out the default Perl regex engine with the TRE engine within the lexical scope that it is used.

Before we dive into the nitty gritty, here are a few examples.

S=Substitution, I=Insertion, D=Deletion

Fuzzy matching:

0> use re::engine::TRE max_cost => 2
1> say $& if 'pesarl' =~ /perl/x # 2I
pesa
2> say $& if 'prjl' =~ /perl/x # 2S
prjl

Fuzzy captures:

0> use re::engine::TRE max_cost => 6
1> say "\$1 = $1, \$2 = $2" if 'Fussy wuzzy has a beer' =~ /(fuzzy) wuzzy was a (bear)/xi
$1 = Fussy, $2 = beer

Perl regex variables returned (this is so handy):

0> use re::engine::TRE max_cost => 2
1> say "$1 starts at $-[1] and ends at $+[1]" if 'GATACA' =~ /GA(TCC)A/x
TAC starts at 2 and ends at 5

Implementation details

So now that you’ve seen the basic flavor of what the TRE is, let’s clear up some of the details that I found really confusing when I started with it.

The ERE vs the BRE Syntax

They syntax that TRE uses can be switched to either use the POSIX Extended Regular Expression syntax, ERE, or the Basic Regular Expression syntax, BRE. The docs page, however, does not make this clear. The re::engine::TRE implementation looks for the /x flag in order to switch on ERE, which is almost certainly what you want.

The Approximate match specifier

TRE does mix in some of its own syntax as well though. The Approximate Match Settings’ are the primary example of this. They allow you to set the number of mismatches, in total or specific type, for an atom. They also allow you to set the score of each type of mismatch for that atom.

EX:

0> use re::engine::TRE max_cost => 6
1> say "$1 starts at $-[1] and ends at $+[1]" if 'GATTACA' =~ /GA(TCC){ #1+1~2 }A/x
TTAC starts at 2 and ends at 6
2> say "$1 starts at $-[1] and ends at $+[1]" if 'GATGGGGA' =~ /GA(TACAC){ #1+1-1~2 }A/x
# No match

The last example says that my capture group can have only one insertion, 1 deletion, or 1 substitution, but that it can’t have more than 2 errors in total. This brings up one of the gotchas though. It is sometimes hard to see’ the edits that lead to a match. For example, GATCCCCA would have matched and given TCCC as my capture, which is one substitution, and 1 deletion in the capture group, and 1 insertion outside of it.

Pragmas

When importing re::engine::TRE, you can set various global’ pragma, such as max_cost’, which specify the overall settings for TRE to use. The defaults can be found in the re::engine::TRE docs. If you set the global max cost to 2, but have an atom allowing up the 3 errors, the global cost will overrull that and no match will be returned.

I should note that I don’t know the details of how Perl’s use statement works well enough to speak to what kind of overhead throwing use re::engine::TRE max_cost => 2 into a function result in. I think that use is a compile time thingy, and therefor you can put it in a function had have low to no overhead from the import’ each time the function is called. But I could be totally wrong about that.

Other places TRE can be found

Other Fuzzy Matching Options

Excellent Articles

A Bunch of Example Test Cases

Some day I will hopefully PR more test cases into re::engine::TRE

#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
plan tests => 66;

# Approx match a basic string
{
    use re::engine::TRE max_cost => 1;
    ok( 'pearl' =~ /perl/, 'Test Approx match with 1 insertion' );
    ok( 'per' =~ /perl/,   'Test Approx match with 1 deletion' );
    ok( 'perd' =~ /perl/,  'Test Approx match with 1 substitution' );
}

# Approx match a basic string
{
    use re::engine::TRE max_cost => 2;
    ok( 'pesarl' =~ /perl/, 'Test Approx match with 2 insertion' );
    ok( 'er' =~ /perl/,     'Test Approx match with 2 deletion' );
    ok( 'berd' =~ /perl/,   'Test Approx match with 2 substitution' );
    ok( 'prjl' =~ /perl/,   'Test Approx match with 1I1D' );
    ok( 'ped' =~ /perl/,    'Test Approx match with 1D1S' );
    ok( 'derJl' =~ /perl/,  'Test Approx match with 1I1S' );
    ok( 'terd' =~ /perl/,   'Test Approx match with 2S' );
}

# Perl magic regex vars
# Note that for clasical 'bounds' {} and 'capture groups' () must be escaped
# Note also that the local approx match {} also need escaping
# All esamples with capture groups are locally scoped to avoid $1 issues
# qr strings specifically seem to not work at the moment
{
    use re::engine::TRE max_cost => 2;
    {
        ok( 'GATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex normal match' );
        is( $1,    'AAAA', 'Test capture group' );
        is( $-[1], 3,      'Test @- capture group 1 index' );
        is( $+[1], 7,      'Test @+ capture group 1 index' );
    }
    {

        ok( 'GAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex normal match non-greedy match'
        );
        is( $1,    'AAAA', 'Test capture group - non-greedy' );
        is( $-[1], 2,      'Test @- capture group 1 index - non-greedy' );
        is( $+[1], 6,      'Test @+ capture group 1 index - non-greedy' );
    }

    # 1 Insertion/Deletion/Substitution
    {
        ok( 'GTATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1I' );
        is( $1,    'TAAA', 'Test capture group mutated - 1I' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1I' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1I' );
    }
    {
        ok( 'GATAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1D'
        );
        is( $1,    'AAAA', 'Test capture group - 1D' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1D' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1D' );
    }
    {
        ok( 'GATAAGAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1S' );
        is( $1,    'AAGA', 'Test capture group - 1S' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1S' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1S' );
    }

    # 2 Insertion/Deletion/Substitution
    {
        ok( 'GACTAAAAAT' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 2I' );
        is( $1,    'AAAA', 'Test capture group - 2I' );
        is( $-[1], 4,      'Test @- capture group 1 index - 2I' );
        is( $+[1], 8,      'Test @+ capture group 1 index - 2I' );
    }
    {
        ok( 'ATAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 2D'
        );
        is( $1,    'AAAA', 'Test capture group - 2D' );
        is( $-[1], 2,      'Test @- capture group 1 index - 2D' );
        is( $+[1], 6,      'Test @+ capture group 1 index - 2D' );
    }
    {
        ok( 'GATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 2S' );
        is( $1,    'AAAA', 'Test capture group - 2S' );
        is( $-[1], 3,      'Test @- capture group 1 index - 2S' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 2S' );

    }
    {
        ok( 'CATAATAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1I1S'
        );
        is( $1,    'AATA', 'Test capture group - 1I1S' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1I1S' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1I1S' );
    }
    {
        ok( 'GAAAAAAT' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1I1D'
        );
        is( $1,    'AAAA', 'Test capture group - 1I1D' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1I1D' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1I1D' );
    }
    {
        ok( 'ATAAAAG' =~ /^G[AT]{2}?([GATC]{4})A/x,
            'Test complex mutated match - 1D1S'
        );
        is( $1,    'AAAG', 'Test capture group - 1D1S' );
        is( $-[1], 3,      'Test @- capture group 1 index - 1D1S' );
        is( $+[1], 7,      'Test @+ capture group 1 index - 1D1S' );
    }
}

# Use the local approximate match settings
# You can specify a max for a local subpattern, that counts toward global max
# local max total can't exceed global
{
    use re::engine::TRE max_cost => 2;

    # Show that everything is normal
    {
        ok( 'ATACTAC' =~ /^A(TACT)AC/x,
            'Local approx match settings - exact' );
        is( $1, 'TACT', 'Local approx match capture group 1 - exact' );
    }

    # Mutate 2 Bases in the capture group and still capture
    {
        ok( 'ATGTTAC' =~ /^A(TACT)AC/x,
            'Local approx match settings - 2S, no local' );
        is( $1, 'TGTT', 'Local approx match capture group 1 - 2S, no local' );
    }

    # Mutate 2 Bases in capture group, but set local max to 1 (no match)
    {
        ok( 'ATGTTAC' !~ /^A(TACT){~1}AC/x,
            'Local approx match settings - 2S, local 1'
        );
        is( $1, undef, 'Local approx match capture group 1 - 2S, local 1' );
    }

    # Mutate only 1 base in the capture group and 1 outside it
    {
        ok( 'ATATTAG' =~ /^A(TACT){~1}AC/x,
            'Local approx match settings - 1S global, local 1'
        );
        is( $1, 'TATT', 'Local approx match capture group 1 - 1S global, local 1' );
    }

    # Mutate only 1 base in the capture group and 2 outside it
    {
        ok( 'CTATTAG' !~ /^A(TACT){~1}AC/x,
            'Local approx match settings - 2S global, local 1'
        );
        is( $1, undef, 'Local approx match capture group 1 - 2S global, local 1' );
    }

    # Mutate only 3 base in the capture group and 0 outside it to test global limit
    {
        ok( 'ATGTGAG' !~ /^A(TACT){~3}AC/x,
            'Local approx match settings - 0S global, local 3'
        );
        is( $1, undef, 'Local approx match capture group 1 - 0 global, local 3' );
    }
}