POSIX-Style Regular Expressions

Subpatterns

You can use parentheses to group bits of a regular expression together to be treated as a single unit called asubpattern:

ereg('a (very )+big dog', 'it was a very very big dog'); // returns true ereg('^(cat|dog)$', 'cat'); // returns true ereg('^(cat|dog)$', 'dog'); // returns true

The parentheses also cause the substring that matches the subpattern to be captured. If you pass an array as the third argument to a match function, the array is populated with any captured substrings:

ereg('([0-9]+)', 'You have 42 magic beans', $captured); // returns true and populates $captured

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern (if there is one), the second element is the substring that matched the second subpattern, and so on.

Now that you understand the basics of regular expressions, we can explore the details. POSIX-style regular expressions use the Unix locale system. The locale system provides functions for sorting and identifying characters that let you intelli- gently work with text from languages other than English. In particular, what constitutes a “letter” varies from language to language (think ofàandç), and there are character classes in POSIX regular expressions that take this into account. However, POSIX regular expressions are designed for use with only textual data. If your data has a NUL-byte (\x00) in it, the regular expression functions will interpret it as the end of the string, and matching will not take place beyond that point. To do matches against arbitrary binary data, you’ll need to use Perl-compatible regular expressions, which are discussed later in this chapter. Also, as we already men- tioned, the Perl-style regular expression functions are often faster than the equiva- lent POSIX-style ones.

Character Classes

As shown in Table 4-7, POSIX defines a number of named sets of characters that you can use in character classes. The expansions given in Table 4-7 are for English. The actual letters vary from locale to locale.

Table 4-7. POSIX character classes

Class Description Expansion

[:alnum:] Alphanumeric characters [0-9a-zA-Z]

Each[:something:]class can be used in place of a character in a character class. For instance, to find any character that’s a digit, an uppercase letter, or an at sign (@), use the following regular expression:

[@[:digit:][:upper:]]

However, you can’t use a character class as the endpoint of a range:

ereg('[A-[:lower:]]', 'string'); // invalid regular expression

Some locales consider certain character sequences as if they were a single character— these are calledcollating sequences. To match one of these multicharacter sequences in a character class, enclose it with[.and.]. For example, if your locale has the collating sequencech, you can matchs,t, orch with this character class:

[st[.ch.]]

The final POSIX extension to character classes is theequivalence class, specified by enclosing the character in[=and=]. Equivalence classes match characters that have the same collating order, as defined in the current locale. For example, a locale may definea,á, andäas having the same sorting precedence. To match any one of them, the equivalence class is[=a=].

Anchors

An anchor limits a match to a particular location in the string (anchors do not match actual characters in the target string). Table 4-8 lists the anchors supported by POSIX regular expressions.

[:ascii:] 7-bit ASCII [\x01-\x7F]

[:blank:] Horizontal whitespace (space, tab) [ \t]

[:cntrl:] Control characters [\x01-\x1F]

[:digit:] Digits [0-9]

[:graph:] Characters that use ink to print (non-space, non-control)

[^\x01-\x20]

[:lower:] Lowercase letter [a-z]

[:print:] Printable character (graph class plus space and tab)

[\t\x20-\xFF] [:punct:] Any punctuation character, such as the period (.)

and the semicolon (;)

[-!"#$%&'( )*+,./:;<=>?@[\\]^_`{|}~] [:space:] Whitespace (newline, carriage return, tab, space,

vertical tab)

[\n\r\t \x0B]

[:upper:] Uppercase letter [A-Z]

[:xdigit:] Hexadecimal digit [0-9a-fA-F]

Table 4-7. POSIX character classes (continued)

POSIX-Style Regular Expressions | 101

A word boundary is defined as the point between a whitespace character and an identifier (alphanumeric or underscore) character:

ereg('[[:<:]]gun[[:>:]]', 'the Burgundy exploded'); // returns false ereg('gun', 'the Burgundy exploded'); // returns true

Note that the beginning and end of a string also qualify as word boundaries.

Functions

There are three categories of functions for POSIX-style regular expressions: matching, replacing, and splitting.

Matching

Theereg( )function takes a pattern, a string, and an optional array. It populates the array, if given, and returnstrueorfalsedepending on whether a match for the pattern was found in the string:

$found = ereg(pattern, string [, captured ]);

For example:

ereg('y.*e$', 'Sylvie'); // returns true

ereg('y(.*)e$', 'Sylvie', $a); // returns true, $a is array('Sylvie', 'lvi')

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern, the second element is the substring that matched the second subpattern, and so on.

Theeregi( )function is a case-insensitive form of ereg( ). Its arguments and return values are the same as those forereg( ).

Example 4-1 uses pattern matching to determine whether a credit-card number passes the Luhn checksum and whether the digits are appropriate for a card of a spe- cific type.

Table 4-8. POSIX anchors

Anchor Matches

^ Start of string

$ End of string

[[:<:]] Start of word

[[:>:]] End of word

Example 4-1. Credit-card validator

// The Luhn checksum determines whether a credit-card number is syntactically // correct; it cannot, however, tell if a card with the number has been issued, // is currently active, or has enough space left to accept a charge.

function IsValidCreditCard($inCardNumber, $inCardType) { // Assume it's okay

$isValid = true;

// Strip all non-numbers from the string

$inCardNumber = ereg_replace('[^[:digit:]]','', $inCardNumber); // Make sure the card number and type match

switch($inCardType) { case 'mastercard':

$isValid = ereg('^5[1-5].{14}$', $inCardNumber); break;

case 'visa':

$isValid = ereg('^4.{15}$|^4.{12}$', $inCardNumber); break;

case 'amex':

$isValid = ereg('^3[47].{13}$', $inCardNumber); break;

case 'discover':

$isValid = ereg('^6011.{12}$', $inCardNumber); break;

case 'diners':

$isValid = ereg('^30[0-5].{11}$|^3[68].{12}$', $inCardNumber); break;

case 'jcb':

$isValid = ereg('^3.{15}$|^2131|1800.{11}$', $inCardNumber); break;

}

// It passed the rudimentary test; let's check it against the Luhn this time if($isValid) {

// Work in reverse

$inCardNumber = strrev($inCardNumber);

// Total the digits in the number, doubling those in odd-numbered positions $theTotal = 0;

for ($i = 0; $i < strlen($inCardNumber); $i++) { $theAdder = (int) $inCardNumber{$i};

// Double the numbers in odd-numbered positions if($i % 2) { $theAdder << 1; if($theAdder > 9) { $theAdder -= 9; } } $theTotal += $theAdder; }

In document programming PHP pdf (Page 99-103)