Matching expressions are potentially composed of ‘literals’, ‘character sets’, ‘alternators’, ‘character classes’, ‘quantifiers’,
‘anchors’ and ‘lookarounds’, along with ‘groups’ and ‘backreferences’.
A literal is simply a character that must be found for the match as a whole to succeed. A matching expression such as
'red' is composed solely of literals, and when used, turns the regular expression engine into a glorified Pos function:
I := Pos('red', 'the colour is red'); //15 I := TRegEx.Match('the colour is red', 'red').Index; //15
Because they have special meaning, certain characters (called ‘metacharacters’ in the regex jargon) must be ‘escaped’
with a preceding backslash to be used as a literal: *, +, ?, |, {, [, (, ), ^, $, ., # and the backslash itself, \. Consequently,
TRegEx.IsMatch('Is that a question?', '?')
will raise an exception, but
TRegEx.IsMatch('Is that a question?', '\?')
will return True.
Character sets enable matching to one of a group of characters. A set is defined by placing the characters concerned inside square brackets. Thus, [ab] will match against both 'a' and 'b'. A hyphen can be used to specify a range, for example [1-9] will match against any a positive digit, and multiple ranges can be included in the same set, e.g. [a-cM-O]
will match any one of 'a', 'b', 'c', 'M', 'N' and 'O'.
The ^ symbol is also available to require any character except one in the set: '[^euioa]' therefore matches any character that is not a lower-case vowel.
Alternators have a similar effect to character sets, only with respect to allowing for alternate sub-strings rather than alternate characters. The syntax is to put the acceptable variants in round brackets, delimited by a vertical bar. Thus,
(red|blue) matches against both 'red' and 'blue'.
A quantifier requires the preceding item to appear a certain number of times:
• ? matches zero or one time
• * matches any number of times
• + matches 1 or more times
• {n} (where n is an integer) requires the item to repeat the specified number of times
• {n,} requires the item to repeat at least n times
• {n,m} requires the item to repeat between n and m times
TRegEx.IsMatch('favorite', 'favou?rite'); //True TRegEx.IsMatch('favorite', 'favou+rite'); //False TRegEx.IsMatch('He snoozed: zzz...', ' z{4,5}\.'); //False TRegEx.IsMatch('He snoozed: zzzz...', ' z{4,5}\.'); //True TRegEx.IsMatch('He snoozed: zzzzzz...', ' z{4,5}\.'); //False
If the ‘item’ should be a substring rather than just a single character, enclose it in brackets:
TRegEx.IsMatch('They are playing', 'play(ing)?'); //True TRegEx.IsMatch('They have played', 'play(ing)?'); //True TRegEx.IsMatch('She was wearing plaits', 'play(ing)?'); //False
Quantifiers are by default ‘greedy’ in that they will chomp through as much as they can:
S := TRegEx.Match('The cow went moooo!', 'o{2,3}').Value; //ooo S := TRegEx.Match('The cow went moooo!', 'o{2,}').Value; //oooo
The converse of a greedy quantifier is a ‘lazy’ one, in which the smallest possible match is taken. To make a specific quantifier lazy, suffix it with a question mark:
S := TRegEx.Match('The cow went mooooo!', 'o{2,}?').Value; //oo
A character class is in effect a predefined character set:
• . (i.e., a full stop/period) matches any character but for #13 and #10:
S := TRegEx.Match('1st'#13#10'2nd', '.*').Value; //1st
An easy mistake to make with . is to forget quantifiers are greedy by default:
S := TRegEx.Match('stuff! more stuff!', '.*\!').Value;
//stuff! more stuff!
S := TRegEx.Match('stuff! more stuff!', '.*?\!').Value;
//stuff!
• \d matches any digit — this has the same effect as [0-9]
• \s matches any ASCII whitespace character
• \n matches the ASCII ‘line feed’ character, which is equivalent to embedding #10 in the matching expression, i.e.
IsMatch(S, '.*'#10) is equivalent to IsMatch(S, '.*\n').
• \r matches the ASCII ‘carriage return’ character, which is equivalent to embedding #13 in the matching expression.
• \t matches the ASCII ‘tab’ character, which is equivalent to embedding #9 in the matching expression.
• \w matches a ‘word’ character, defined as [0-9a-zA-Z]
TRegEx.IsMatch('A1!', '\w\d.'); //True
TRegEx.IsMatch('A1', '\w\d.'); //False (no 3rd char)
TRegEx.IsMatch('AA!', '\w\d.'); //False (2nd char not a digit) TRegEx.IsMatch('-1!', '\w\d.'); //False (1st char neither a //letter nor a digit)
• \p{c} matches a character of the specified Unicode category, where c is L for letters, N for numbers, P for punctuation, and so on (a fuller list will be given shortly). Where \w will not pick up accented letters, Cyrillic letters, and so on, \p{L}
will.
For any character class specified with a slash followed by a letter code, putting the letter in upper case causes the match to be negated, e.g. \D means anything but a digit:
Success := TRegEx.IsMatch('8', '\D'); //False
The category code casing for \p is fixed however, so that \p{N} picks up a number and \P{N} picks up anything but a number.
Anchors restrict matches to certain positions:
• ^ matches to the start of the input string, assuming multiline mode is not set.
• $ matches to the end of the input string, assuming multiline mode is not set.
• \A matches to the start of the input string regardless of multiline being set.
• \Z matches to the end of the input string regardless of multiline being set.
• \b matches to the start or end of a ‘word’. Like \w, this is ASCII only, so that TRegEx.Match('café', '\b.+\b').Value will return just 'caf' given 'é' is outside the ASCII range.
• \G restricts multiple matches to that a subsequent match is only valid if it begins immediately after its predecessor in the input string—
Num := TRegEx.Matches('Baden-Baden', 'Baden').Count; //2 Num := TRegEx.Matches('Baden-Baden', '\GBaden').Count; //1
When the ‘multiline’ option is set, the behaviour of ^ and $ changes to require matching to the start and end of a line respectively:
const
TestStr = 'First' + SLineBreak + 'Second';
var
Match: TMatch;
begin
Write('Without setting roMultiLine:');
for Match in TRegEx.Matches(TestStr, '^.*') do Write(' ' + Match.Value); //output: First WriteLn;
Write('With roMultiLine set:');
for Match in TRegEx.Matches(TestStr, '^.*', [roMultiLine]) do Write(' ' + Match.Value); //output: First Second
\A and \Z are unaffected however.
Lookaround assertions (also called ‘zero-width assertions’) are a generalisation of anchors in which the ‘anchor’ is custom defined. In each case, what is returned as the match is not altered — the lookaround assertion just restricts the context in which a match is valid:
• (?=expr) where expr is a valid matching expression defines a ‘positive lookahead’, in which the secondary expression must be found after the main one —
var
Match: TMatch;
begin
//without a lookahead
Match := TRegEx.Match('test test!', 'test');
WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 1 //require a trailing exclamation mark
Match := TRegEx.Match('test test!', 'test(?=\!)');
WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 6
• (?!expr) where expr is a valid matching expression defines a ‘negative lookahead’, in which the secondary expression must not be found after the main one —
var
Match: TMatch;
begin
{ Without a lookahead }
Match := TRegEx.Match('loser loses', 'lose');
WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 1 { With a lookahead (no 'r' may immediately follow) } Match := TRegEx.Match('loser loses', 'lose(?!r)');
WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 7
• (?<=expr) where expr is a valid, fixed length matching expression defines a ‘positive lookbehind’, in which the
secondary expression must be found before the main one.
• (?<!expr) where expr is a valid, fixed length matching expression defines a ‘negative lookbehind’, in which the secondary expression must not be found before the main one —
TRegEx.IsMatch('Joe has no hope', 'hope'); //True TRegEx.IsMatch('Joe has no hope', '(?<!no )hope'); //False
The fact lookbehind (unlike lookahead) matching expressions must be fixed length means you are restricted to literals, character sets, character classes and the {n} quantifier within the sub-expression.