• No results found

Elements of a regular expression

Matching expressions are potentially composed of ‘literals’, ‘character sets’, ‘alternators’, ‘character classes’, ‘quantifiers’,

‘anchors’ and ‘lookarounds’, along with ‘groups’ and ‘backreferences’.

A literal is simply a character that must be found for the match as a whole to succeed. A matching expression such as

'red' is composed solely of literals, and when used, turns the regular expression engine into a glorified Pos function:

I := Pos('red', 'the colour is red'); //15 I := TRegEx.Match('the colour is red', 'red').Index; //15

Because they have special meaning, certain characters (called ‘metacharacters’ in the regex jargon) must be ‘escaped’

with a preceding backslash to be used as a literal: *, +, ?, |, {, [, (, ), ^, $, ., # and the backslash itself, \. Consequently,

TRegEx.IsMatch('Is that a question?', '?')

will raise an exception, but

TRegEx.IsMatch('Is that a question?', '\?')

will return True.

Character sets enable matching to one of a group of characters. A set is defined by placing the characters concerned inside square brackets. Thus, [ab] will match against both 'a' and 'b'. A hyphen can be used to specify a range, for example [1-9] will match against any a positive digit, and multiple ranges can be included in the same set, e.g. [a-cM-O]

will match any one of 'a', 'b', 'c', 'M', 'N' and 'O'.

The ^ symbol is also available to require any character except one in the set: '[^euioa]' therefore matches any character that is not a lower-case vowel.

Alternators have a similar effect to character sets, only with respect to allowing for alternate sub-strings rather than alternate characters. The syntax is to put the acceptable variants in round brackets, delimited by a vertical bar. Thus,

(red|blue) matches against both 'red' and 'blue'.

A quantifier requires the preceding item to appear a certain number of times:

? matches zero or one time

* matches any number of times

+ matches 1 or more times

{n} (where n is an integer) requires the item to repeat the specified number of times

{n,} requires the item to repeat at least n times

{n,m} requires the item to repeat between n and m times

TRegEx.IsMatch('favorite', 'favou?rite'); //True TRegEx.IsMatch('favorite', 'favou+rite'); //False TRegEx.IsMatch('He snoozed: zzz...', ' z{4,5}\.'); //False TRegEx.IsMatch('He snoozed: zzzz...', ' z{4,5}\.'); //True TRegEx.IsMatch('He snoozed: zzzzzz...', ' z{4,5}\.'); //False

If the ‘item’ should be a substring rather than just a single character, enclose it in brackets:

TRegEx.IsMatch('They are playing', 'play(ing)?'); //True TRegEx.IsMatch('They have played', 'play(ing)?'); //True TRegEx.IsMatch('She was wearing plaits', 'play(ing)?'); //False

Quantifiers are by default ‘greedy’ in that they will chomp through as much as they can:

S := TRegEx.Match('The cow went moooo!', 'o{2,3}').Value; //ooo S := TRegEx.Match('The cow went moooo!', 'o{2,}').Value; //oooo

The converse of a greedy quantifier is a ‘lazy’ one, in which the smallest possible match is taken. To make a specific quantifier lazy, suffix it with a question mark:

S := TRegEx.Match('The cow went mooooo!', 'o{2,}?').Value; //oo

A character class is in effect a predefined character set:

. (i.e., a full stop/period) matches any character but for #13 and #10:

S := TRegEx.Match('1st'#13#10'2nd', '.*').Value; //1st

An easy mistake to make with . is to forget quantifiers are greedy by default:

S := TRegEx.Match('stuff! more stuff!', '.*\!').Value;

//stuff! more stuff!

S := TRegEx.Match('stuff! more stuff!', '.*?\!').Value;

//stuff!

\d matches any digit — this has the same effect as [0-9]

\s matches any ASCII whitespace character

\n matches the ASCII ‘line feed’ character, which is equivalent to embedding #10 in the matching expression, i.e.

IsMatch(S, '.*'#10) is equivalent to IsMatch(S, '.*\n').

\r matches the ASCII ‘carriage return’ character, which is equivalent to embedding #13 in the matching expression.

\t matches the ASCII ‘tab’ character, which is equivalent to embedding #9 in the matching expression.

\w matches a ‘word’ character, defined as [0-9a-zA-Z]

TRegEx.IsMatch('A1!', '\w\d.'); //True

TRegEx.IsMatch('A1', '\w\d.'); //False (no 3rd char)

TRegEx.IsMatch('AA!', '\w\d.'); //False (2nd char not a digit) TRegEx.IsMatch('-1!', '\w\d.'); //False (1st char neither a //letter nor a digit)

\p{c} matches a character of the specified Unicode category, where c is L for letters, N for numbers, P for punctuation, and so on (a fuller list will be given shortly). Where \w will not pick up accented letters, Cyrillic letters, and so on, \p{L}

will.

For any character class specified with a slash followed by a letter code, putting the letter in upper case causes the match to be negated, e.g. \D means anything but a digit:

Success := TRegEx.IsMatch('8', '\D'); //False

The category code casing for \p is fixed however, so that \p{N} picks up a number and \P{N} picks up anything but a number.

Anchors restrict matches to certain positions:

^ matches to the start of the input string, assuming multiline mode is not set.

$ matches to the end of the input string, assuming multiline mode is not set.

\A matches to the start of the input string regardless of multiline being set.

\Z matches to the end of the input string regardless of multiline being set.

\b matches to the start or end of a ‘word’. Like \w, this is ASCII only, so that TRegEx.Match('café', '\b.+\b').Value will return just 'caf' given 'é' is outside the ASCII range.

\G restricts multiple matches to that a subsequent match is only valid if it begins immediately after its predecessor in the input string—

Num := TRegEx.Matches('Baden-Baden', 'Baden').Count; //2 Num := TRegEx.Matches('Baden-Baden', '\GBaden').Count; //1

When the ‘multiline’ option is set, the behaviour of ^ and $ changes to require matching to the start and end of a line respectively:

const

TestStr = 'First' + SLineBreak + 'Second';

var

Match: TMatch;

begin

Write('Without setting roMultiLine:');

for Match in TRegEx.Matches(TestStr, '^.*') do Write(' ' + Match.Value); //output: First WriteLn;

Write('With roMultiLine set:');

for Match in TRegEx.Matches(TestStr, '^.*', [roMultiLine]) do Write(' ' + Match.Value); //output: First Second

\A and \Z are unaffected however.

Lookaround assertions (also called ‘zero-width assertions’) are a generalisation of anchors in which the ‘anchor’ is custom defined. In each case, what is returned as the match is not altered — the lookaround assertion just restricts the context in which a match is valid:

(?=expr) where expr is a valid matching expression defines a ‘positive lookahead’, in which the secondary expression must be found after the main one —

var

Match: TMatch;

begin

//without a lookahead

Match := TRegEx.Match('test test!', 'test');

WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 1 //require a trailing exclamation mark

Match := TRegEx.Match('test test!', 'test(?=\!)');

WriteLn(Match.Value); //output: test WriteLn(Match.Index); //output: 6

(?!expr) where expr is a valid matching expression defines a ‘negative lookahead’, in which the secondary expression must not be found after the main one —

var

Match: TMatch;

begin

{ Without a lookahead }

Match := TRegEx.Match('loser loses', 'lose');

WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 1 { With a lookahead (no 'r' may immediately follow) } Match := TRegEx.Match('loser loses', 'lose(?!r)');

WriteLn(Match.Value); //output: lose WriteLn(Match.Index); //output: 7

(?<=expr) where expr is a valid, fixed length matching expression defines a ‘positive lookbehind’, in which the

secondary expression must be found before the main one.

(?<!expr) where expr is a valid, fixed length matching expression defines a ‘negative lookbehind’, in which the secondary expression must not be found before the main one —

TRegEx.IsMatch('Joe has no hope', 'hope'); //True TRegEx.IsMatch('Joe has no hope', '(?<!no )hope'); //False

The fact lookbehind (unlike lookahead) matching expressions must be fixed length means you are restricted to literals, character sets, character classes and the {n} quantifier within the sub-expression.

Related documents