TRE: A Regex Engine with Approximate Matching
TRE is a regex engine that allows for approximate matching. It does this by using calculating the Levenshtein Distance (number of insertions, deletions, or substitutions it would take to make the strings equal), as it searchs for a match.
The re::engine::TRE is a Perl wrapper around the TRE engine. It swaps out the default Perl regex engine with the TRE engine within the lexical scope that it is used.
Before we dive into the nitty gritty, here are a few examples.
S=Substitution, I=Insertion, D=Deletion
Fuzzy matching:
0> use re::engine::TRE max_cost => 2
1> say $& if 'pesarl' =~ /perl/x # 2I
pesa
2> say $& if 'prjl' =~ /perl/x # 2S
prjl
Fuzzy captures:
0> use re::engine::TRE max_cost => 6
1> say "\$1 = $1, \$2 = $2" if 'Fussy wuzzy has a beer' =~ /(fuzzy) wuzzy was a (bear)/xi
$1 = Fussy, $2 = beer
Perl regex variables returned (this is so handy):
0> use re::engine::TRE max_cost => 2
1> say "$1 starts at $-[1] and ends at $+[1]" if 'GATACA' =~ /GA(TCC)A/x
TAC starts at 2 and ends at 5
Implementation details
So now that you’ve seen the basic flavor of what the TRE is, let’s clear up some of the details that I found really confusing when I started with it.
The ERE vs the BRE Syntax
They syntax that TRE uses can be switched to either use the POSIX Extended Regular Expression syntax, ERE, or the Basic Regular Expression syntax, BRE. The docs page, however, does not make this clear. The re::engine::TRE implementation looks for the /x
flag in order to switch on ERE, which is almost certainly what you want.
The Approximate match specifier
TRE does mix in some of its own syntax as well though. The ‘Approximate Match Settings’ are the primary example of this. They allow you to set the number of mismatches, in total or specific type, for an atom. They also allow you to set the score of each type of mismatch for that atom.
EX:
0> use re::engine::TRE max_cost => 6
1> say "$1 starts at $-[1] and ends at $+[1]" if 'GATTACA' =~ /GA(TCC){ #1+1~2 }A/x
TTAC starts at 2 and ends at 6
2> say "$1 starts at $-[1] and ends at $+[1]" if 'GATGGGGA' =~ /GA(TACAC){ #1+1-1~2 }A/x
# No match
The last example says that my capture group can have only one insertion, 1 deletion, or 1 substitution, but that it can’t have more than 2 errors in total. This brings up one of the gotchas though. It is sometimes hard to ‘see’ the edits that lead to a match. For example, GATCCCCA would have matched and given ‘TCCC’ as my capture, which is one substitution, and 1 deletion in the capture group, and 1 insertion outside of it.
Pragmas
When importing re::engine::TRE, you can set various ‘global’ pragma, such as ‘max_cost’, which specify the overall settings for TRE to use. The defaults can be found in the re::engine::TRE docs. If you set the global max cost to 2, but have an atom allowing up the 3 errors, the global cost will overrull that and no match will be returned.
I should note that I don’t know the details of how Perl’s use
statement works well enough to speak to what kind of overhead throwing use re::engine::TRE max_cost => 2
into a function result in. I think that use
is a compile time thingy, and therefor you can put it in a function had have low to no overhead from the ‘import’ each time the function is called. But I could be totally wrong about that.
Other places TRE can be found
Other Fuzzy Matching Options
- Text::Fuzzy
- String::Approx
- Text::Levenshtein::XS
- Python’s alt regex module
- Python’s Fuzzywuzzy
Excellent Articles
- How to Write a Spelling Corrector by Peter Norvig
- SymSpell by Wolf Garbe
A Bunch of Example Test Cases
Some day I will hopefully PR more test cases into re::engine::TRE
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
plan tests => 66;
# Approx match a basic string
{
use re::engine::TRE max_cost => 1;
ok( 'pearl' =~ /perl/, 'Test Approx match with 1 insertion' );
ok( 'per' =~ /perl/, 'Test Approx match with 1 deletion' );
ok( 'perd' =~ /perl/, 'Test Approx match with 1 substitution' );
}
# Approx match a basic string
{
use re::engine::TRE max_cost => 2;
ok( 'pesarl' =~ /perl/, 'Test Approx match with 2 insertion' );
ok( 'er' =~ /perl/, 'Test Approx match with 2 deletion' );
ok( 'berd' =~ /perl/, 'Test Approx match with 2 substitution' );
ok( 'prjl' =~ /perl/, 'Test Approx match with 1I1D' );
ok( 'ped' =~ /perl/, 'Test Approx match with 1D1S' );
ok( 'derJl' =~ /perl/, 'Test Approx match with 1I1S' );
ok( 'terd' =~ /perl/, 'Test Approx match with 2S' );
}
# Perl magic regex vars
# Note that for clasical 'bounds' {} and 'capture groups' () must be escaped
# Note also that the local approx match {} also need escaping
# All esamples with capture groups are locally scoped to avoid $1 issues
# qr strings specifically seem to not work at the moment
{
use re::engine::TRE max_cost => 2;
{
ok( 'GATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex normal match' );
is( $1, 'AAAA', 'Test capture group' );
is( $-[1], 3, 'Test @- capture group 1 index' );
is( $+[1], 7, 'Test @+ capture group 1 index' );
}
{
ok( 'GAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex normal match non-greedy match'
);
is( $1, 'AAAA', 'Test capture group - non-greedy' );
is( $-[1], 2, 'Test @- capture group 1 index - non-greedy' );
is( $+[1], 6, 'Test @+ capture group 1 index - non-greedy' );
}
# 1 Insertion/Deletion/Substitution
{
ok( 'GTATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1I' );
is( $1, 'TAAA', 'Test capture group mutated - 1I' );
is( $-[1], 3, 'Test @- capture group 1 index - 1I' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1I' );
}
{
ok( 'GATAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1D'
);
is( $1, 'AAAA', 'Test capture group - 1D' );
is( $-[1], 3, 'Test @- capture group 1 index - 1D' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1D' );
}
{
ok( 'GATAAGAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1S' );
is( $1, 'AAGA', 'Test capture group - 1S' );
is( $-[1], 3, 'Test @- capture group 1 index - 1S' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1S' );
}
# 2 Insertion/Deletion/Substitution
{
ok( 'GACTAAAAAT' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 2I' );
is( $1, 'AAAA', 'Test capture group - 2I' );
is( $-[1], 4, 'Test @- capture group 1 index - 2I' );
is( $+[1], 8, 'Test @+ capture group 1 index - 2I' );
}
{
ok( 'ATAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 2D'
);
is( $1, 'AAAA', 'Test capture group - 2D' );
is( $-[1], 2, 'Test @- capture group 1 index - 2D' );
is( $+[1], 6, 'Test @+ capture group 1 index - 2D' );
}
{
ok( 'GATAAAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 2S' );
is( $1, 'AAAA', 'Test capture group - 2S' );
is( $-[1], 3, 'Test @- capture group 1 index - 2S' );
is( $+[1], 7, 'Test @+ capture group 1 index - 2S' );
}
{
ok( 'CATAATAAA' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1I1S'
);
is( $1, 'AATA', 'Test capture group - 1I1S' );
is( $-[1], 3, 'Test @- capture group 1 index - 1I1S' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1I1S' );
}
{
ok( 'GAAAAAAT' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1I1D'
);
is( $1, 'AAAA', 'Test capture group - 1I1D' );
is( $-[1], 3, 'Test @- capture group 1 index - 1I1D' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1I1D' );
}
{
ok( 'ATAAAAG' =~ /^G[AT]{2}?([GATC]{4})A/x,
'Test complex mutated match - 1D1S'
);
is( $1, 'AAAG', 'Test capture group - 1D1S' );
is( $-[1], 3, 'Test @- capture group 1 index - 1D1S' );
is( $+[1], 7, 'Test @+ capture group 1 index - 1D1S' );
}
}
# Use the local approximate match settings
# You can specify a max for a local subpattern, that counts toward global max
# local max total can't exceed global
{
use re::engine::TRE max_cost => 2;
# Show that everything is normal
{
ok( 'ATACTAC' =~ /^A(TACT)AC/x,
'Local approx match settings - exact' );
is( $1, 'TACT', 'Local approx match capture group 1 - exact' );
}
# Mutate 2 Bases in the capture group and still capture
{
ok( 'ATGTTAC' =~ /^A(TACT)AC/x,
'Local approx match settings - 2S, no local' );
is( $1, 'TGTT', 'Local approx match capture group 1 - 2S, no local' );
}
# Mutate 2 Bases in capture group, but set local max to 1 (no match)
{
ok( 'ATGTTAC' !~ /^A(TACT){~1}AC/x,
'Local approx match settings - 2S, local 1'
);
is( $1, undef, 'Local approx match capture group 1 - 2S, local 1' );
}
# Mutate only 1 base in the capture group and 1 outside it
{
ok( 'ATATTAG' =~ /^A(TACT){~1}AC/x,
'Local approx match settings - 1S global, local 1'
);
is( $1, 'TATT', 'Local approx match capture group 1 - 1S global, local 1' );
}
# Mutate only 1 base in the capture group and 2 outside it
{
ok( 'CTATTAG' !~ /^A(TACT){~1}AC/x,
'Local approx match settings - 2S global, local 1'
);
is( $1, undef, 'Local approx match capture group 1 - 2S global, local 1' );
}
# Mutate only 3 base in the capture group and 0 outside it to test global limit
{
ok( 'ATGTGAG' !~ /^A(TACT){~3}AC/x,
'Local approx match settings - 0S global, local 3'
);
is( $1, undef, 'Local approx match capture group 1 - 0 global, local 3' );
}
}