Perl Compatible Regular Expressions (normally abbreviated as “PCRE”) offer a very powerful string-matching and replacement mechanism that far surpasses anything we have examined so far.
Regular expressions are often thought of as very complex—and they can be at times. However, properly used they are relatively simple to understand and fairly easy to use. Given their complexity, of course, they are also much more computa- tionally intensive than the simple search-and-replace functions we examined ear-
90 ” Strings And Patterns
lier in this chapter. Therefore, you should use them only when appropriate—that is, when using the simpler functions is either impossible or so complicated that it’s not worth the effort.
A regular expression is a string that describes a set of matching rules. The simplest possible regular expression is one that matches only one string; for example,Davey
matches only the string “Davey”. In fact, such a simple regular expression would be pointless, as you could just as easily perform the match usingstrpos(), which is a much faster alternative.
The real power of regular expressions comes into play when youdon’t know the exact string that you want to match. In this case, you can specify one or moremeta- characters and quantifiers, which do not have a literal meaning, but instead stand to be interpreted in a special way.
In this chapter, we will discuss the basics of regular expressions that are required by the exam. More thorough coverage is provided by the PHP manual, or by one of the many regular expression books available (most notably,Mastering Regular Ex- pressions, by Jeffrey Friedl, published by O’Reilly Media).
Delimiters
A regular expression is alwaysdelimited by a starting and ending character. Any char- acter can be used for this purpose (as long as the beginning and ending delimiter match); since any occurrence of this character inside the expression itself must be escaped, it’s usually a good idea to pick a delimiter that isn’t likely to appear inside the expression. By convention, the forward slash is used for this purpose—although, for example, another character like the octothorpe is sometimes used when dealing with pathnames or URLs.
Metacharacters
The term “metacharacter” is a bit of a misnomer—as a metacharacter can actually be composed of more than one character. However,every metacharacter represents a single character in the matched expression. Here are the most common ones:
. Match any character
ˆ Match the start of the string
$ Match the end of the string
\s Match any whitespace character
\d Match any digit
\w Match any “word” character
Metacharacters can also be expressed usinggrouping expressions. For example, a series of valid alternatives for a character can be provided by using square brackets:
/ab[cd]e/
The expression above will match both abce and abde. You can also use other metacharacters, and provideranges of valid characters inside a grouping expression:
/ab[c-e\d]/
This will matchabc,abd,abeand any combination ofabfollowed by a digit.
Quantifiers
A quantifier allows you to specify the number of times a particular character or metacharacter can appear in a matched string. There are four types of quantifiers:
* The character can appear zero or more times + The character can appear one or more times
? The character can appear zero or one times
{n,m} The character can appear at leastntimes, and no more thanm. Either parameter can be omitted to indicated a minimum limit with no maximum, or a maximum limit without a minimum, but not both.
Thus, for example, the expression ab?cmatches both acand abc, while ab{1,3}c
92 ” Strings And Patterns
Sub-Expressions
A sub-expression is a regular expression contained within the main regular expres- sion (or another sub-expression); you define one by encapsulating it in parentheses:
/a(bc.)e/
This expression will match the lettera, followed by the lettersbandc, followed by any character and, finally the lettere. As you can see, sub-expressions by themselves do not have any influence on the way a regular expression is executed; however, you can use them in conjunction with quantifiers to allow for complex expressions to happen more than once. For example:
/a(bc.)+e/
This expression will match the lettera, followed by the expressionbc. repeated one or more times, followed by the lettere.
Sub-expressions can also be used ascapturing patterns, which we will examine in the next section.
Matching and Extracting Strings
Thepreg_match()function can be used to match a regular expression against a given string. The function returnstrueif the match is successful, and can return all the captured subpatterns in an array if an optional third parameter is passed by refer- ence. Here’s an example:
$name = "Davey Shafik";
// Simple match
$regex = "/[a-zA-Z\s]/";
if (preg_match($regex, $name)) {
// Valid Name
}
$regex = ’/^(\w+)\s(\w+)/’; $matches = array();
if (preg_match ($regex, $name, $matches)) {
var_dump ($matches); }
If you run the second example, you will notice that the$matchesarray is populated, on return with the following values:
array(3) { [0]=>
string(12) "Davey Shafik" [1]=>
string(5) "Davey" [2]=>
string(6) "Shafik" }
As you can see, the first element of the array contains the entire matched string, while the second element (index 1) contains the first captured subpattern, and the third element contains the second matched subpattern.
Performing Multiple Matches
Thepreg_match_all()function allows you to perform multiple matches on a given string based on a single regular expression. For example:
$string = "a1bb b2cc c2dd"; $regex = "#([abc])\d#"; $matches = array();
if (preg_match_all ($regex, $string, $matches)) {
var_dump ($matches); }
This script outputs the following:
94 ” Strings And Patterns [0]=> array(3) { [0]=> string(2) "a1" [1]=> string(2) "b2" [2]=> string(2) "c2" } [1]=> array(3) { [0]=> string(1) "a" [1]=> string(1) "b" [2]=> string(1) "c" } }
As you can see, all the whole-pattern matches are stored in the first sub-array of the result, while the first captured subpattern of every match is stored in the correspond- ing slot of the second sub-array.
Using PCRE to Replace Strings
Whilststr_replace()is quite flexible, it still only works on “whole” strings, that is, where you know the exact text to search for. Usingpreg_replace(), however, you can replace text that matches a pattern we specify. It is even possible to reuse captured subpatterns directly in the substitution string by prefixing their index with a dollar sign. In the example below, we use this technique to replace the entire matched pattern with a string that is composed using the first captured subpattern ($1).
$body = "[b]Make Me Bold![/b]";
$regex = "@\[b\](.*?)\[/b\]@i"; $replacement = ’<b>$1</b>’;
Just like with str_replace(), we can pass arrays of search and replacement argu- ments; however, unlike str_replace(), we can also pass in an array of subjects on which to perform the search-and-replace operation. This can speed things up considerably, since the regular expression (or expressions) are compiled once and reused multiple times. Here’s an example:
$subjects[’body’] = "[b]Make Me Bold![/b]"; $subjects[’subject’] = "[i]Make Me Italics![/i]";
$regex[] = "@\[b\](.*?)\[/b\]@i"; $regex[] = "@\[i\](.*?)\[/i\]@i"; $replacements[] = "<b>$1</b>"; $replacements[] = "<i>$1</i>";
$results = preg_replace($regex, $replacements, $subjects);
When you execute the code shown above, you will end up with an array that looks like this:
array(2) { ["body"]=>
string(20) "<b>Make Me Bold!</b>" ["subject"]=>
string(23) "<i>Make Me Italic!</i>" }
Notice how the resulting array maintains the array structure of our$subjectsarray that we passed in, which, however, is not passed by reference, nor is it modified.
Summary
This chapter covered what is most likely going to be the bulk of your work as a de- veloper—manipulating strings, and while regular expressions may be complex, they are extremely powerful. Just remember: with great power, comes great responsibil- ity—in this case, don’t use them if you don’t have to. Never underestimate the power of the string functions and regular expressions.