Elements of a regular expression - Delphi XE2 Foundations - Part 2

Matching expressions are potentially composed of ‘literals’, ‘character sets’, ‘alternators’, ‘character classes’, ‘quantifiers’,

‘anchors’ and ‘lookarounds’, along with ‘groups’ and ‘backreferences’.

A literal is simply a character that must be found for the match as a whole to succeed. A matching expression such as

'red' is composed solely of literals, and when used, turns the regular expression engine into a glorified Pos function:

I := Pos('red', 'the colour is red'); //15 I := TRegEx.Match('the colour is red', 'red').Index; //15

Because they have special meaning, certain characters (called ‘metacharacters’ in the regex jargon) must be ‘escaped’

with a preceding backslash to be used as a literal: *, +, ?, |, {, [, (, ), ^, $, ., # and the backslash itself, \. Consequently,

TRegEx.IsMatch('Is that a question?', '?')

will raise an exception, but

TRegEx.IsMatch('Is that a question?', '\?')

will return True.

Character sets enable matching to one of a group of characters. A set is defined by placing the characters concerned inside square brackets. Thus, [ab] will match against both 'a' and 'b'. A hyphen can be used to specify a range, for example [1-9] will match against any a positive digit, and multiple ranges can be included in the same set, e.g. [a-cM-O]

will match any one of 'a', 'b', 'c', 'M', 'N' and 'O'.

The ^ symbol is also available to require any character except one in the set: '[^euioa]' therefore matches any character that is not a lower-case vowel.

Alternators have a similar effect to character sets, only with respect to allowing for alternate sub-strings rather than alternate characters. The syntax is to put the acceptable variants in round brackets, delimited by a vertical bar. Thus,

(red|blue) matches against both 'red' and 'blue'.

A quantifier requires the preceding item to appear a certain number of times:

• ? matches zero or one time

• * matches any number of times

• + matches 1 or more times

• {n} (where n is an integer) requires the item to repeat the specified number of times

• {n,} requires the item to repeat at least n times

• {n,m} requires the item to repeat between n and m times

TRegEx.IsMatch('favorite', 'favou?rite'); //True TRegEx.IsMatch('favorite', 'favou+rite'); //False TRegEx.IsMatch('He snoozed: zzz...', ' z{4,5}\.'); //False TRegEx.IsMatch('He snoozed: zzzz...', ' z{4,5}\.'); //True TRegEx.IsMatch('He snoozed: zzzzzz...', ' z{4,5}\.'); //False

If the ‘item’ should be a substring rather than just a single character, enclose it in brackets:

TRegEx.IsMatch('They are playing', 'play(ing)?'); //True TRegEx.IsMatch('They have played', 'play(ing)?'); //True TRegEx.IsMatch('She was wearing plaits', 'play(ing)?'); //False

Quantifiers are by default ‘greedy’ in that they will chomp through as much as they can:

S := TRegEx.Match('The cow went moooo!', 'o{2,3}').Value; //ooo S := TRegEx.Match('The cow went moooo!', 'o{2,}').Value; //oooo

The converse of a greedy quantifier is a ‘lazy’ one, in which the smallest possible match is taken. To make a specific quantifier lazy, suffix it with a question mark:

S := TRegEx.Match('The cow went mooooo!', 'o{2,}?').Value; //oo

A character class is in effect a predefined character set:

• . (i.e., a full stop/period) matches any character but for #13 and #10:

S := TRegEx.Match('1st'#13#10'2nd', '.*').Value; //1st

An easy mistake to make with . is to forget quantifiers are greedy by default:

S := TRegEx.Match('stuff! more stuff!', '.*\!').Value;

//stuff! more stuff!

S := TRegEx.Match('stuff! more stuff!', '.*?\!').Value;

//stuff!

• \d matches any digit — this has the same effect as [0-9]

• \s matches any ASCII whitespace character

• \n matches the ASCII ‘line feed’ character, which is equivalent to embedding #10 in the matching expression, i.e.

IsMatch(S, '.*'#10) is equivalent to IsMatch(S, '.*\n').

• \r matches the ASCII ‘carriage return’ character, which is equivalent to embedding #13 in the matching expression.

• \t matches the ASCII ‘tab’ character, which is equivalent to embedding #9 in the matching expression.

• \w matches a ‘word’ character, defined as [0-9a-zA-Z]

TRegEx.IsMatch('A1!', '\w\d.'); //True

TRegEx.IsMatch('A1', '\w\d.'); //False (no 3rd char)

TRegEx.IsMatch('AA!', '\w\d.'); //False (2nd char not a digit) TRegEx.IsMatch('-1!', '\w\d.'); //False (1st char neither a //letter nor a digit)

• \p{c} matches a character of the specified Unicode category, where c is L for letters, N for numbers, P for punctuation, and so on (a fuller list will be given shortly). Where \w will not pick up accented letters, Cyrillic letters, and so on, \p{L}

will.

For any character class specified with a slash followed by a letter code, putting the letter in upper case causes the match to be negated, e.g. \D means anything but a digit:

Success := TRegEx.IsMatch('8', '\D'); //False

The category code casing for \p is fixed however, so that \p{N} picks up a number and \P{N} picks up anything but a number.

Anchors restrict matches to certain positions:

• ^ matches to the start of the input string, assuming multiline mode is not set.

• $ matches to the end of the input string, assuming multiline mode is not set.

• \A matches to the start of the input string regardless of multiline being set.

• \Z matches to the end of the input string regardless of multiline being set.

• \b matches to the start or end of a ‘word’. Like \w, this is ASCII only, so that TRegEx.Match('café', '\b.+\b').Value will return just 'caf' given 'é' is outside the ASCII range.

• \G restricts multiple matches to that a subsequent match is only valid if it begins immediately after its predecessor in the input string—

Num := TRegEx.Matches('Baden-Baden', 'Baden').Count; //2 Num := TRegEx.Matches('Baden-Baden', '\GBaden').Count; //1

When the ‘multiline’ option is set, the behaviour of ^ and $ changes to require matching to the start and end of a line respectively:

const

TestStr = 'First' + SLineBreak + 'Second';

var

Match: TMatch;

begin

Write('Without setting roMultiLine:');

for Match in TRegEx.Matches(TestStr, '^.*') do Write(' ' + Match.Value); //output: First WriteLn;

Write('With roMultiLine set:');

for Match in TRegEx.Matches(TestStr, '^.*', [roMultiLine]) do Write(' ' + Match.Value); //output: First Second

\A and \Z are unaffected however.

Lookaround assertions (also called ‘zero-width assertions’) are a generalisation of anchors in which the ‘anchor’ is custom defined. In each case, what is returned as the match is not altered — the lookaround assertion just restricts the context in which a match is valid:

• (?=expr) where expr is a valid matching expression defines a ‘positive lookahead’, in which the secondary expression must be found after the main one —

var

Match: TMatch;

begin

//without a lookahead

Match := TRegEx.Match('test test!', 'test');

WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 1 //require a trailing exclamation mark

Match := TRegEx.Match('test test!', 'test(?=\!)');

WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 6

• (?!expr) where expr is a valid matching expression defines a ‘negative lookahead’, in which the secondary expression must not be found after the main one —

var

Match: TMatch;

begin

{ Without a lookahead }

Match := TRegEx.Match('loser loses', 'lose');

WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 1 { With a lookahead (no 'r' may immediately follow) } Match := TRegEx.Match('loser loses', 'lose(?!r)');

WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 7

• (?<=expr) where expr is a valid, fixed length matching expression defines a ‘positive lookbehind’, in which the

secondary expression must be found before the main one.

• (?<!expr) where expr is a valid, fixed length matching expression defines a ‘negative lookbehind’, in which the secondary expression must not be found before the main one —

TRegEx.IsMatch('Joe has no hope', 'hope'); //True TRegEx.IsMatch('Joe has no hope', '(?<!no )hope'); //False

The fact lookbehind (unlike lookahead) matching expressions must be fixed length means you are restricted to literals, character sets, character classes and the {n} quantifier within the sub-expression.

In document Delphi XE2 Foundations - Part 2 - Rolliston, Chris.pdf (Page 40-43)