IT 3203
Introduction to Web Development
Regular Expressions
October 12
Copyright © 2007 by Bob Brown
Notice: This session is being recorded.
University Convocation
• Tuesday, October 13, 11:00 AM – 12:15 PM
• Student Center Theatre
• Convocation Speaker: Dr. John Palfrey
• Speaking on “Born Digital in a Network Society” • Professor at Harvard Law School• Vice Dean for Library and Information Resources • Co-author of Born Digital: Understanding the First
Generation of Digital Natives and also Access Denied: The Practice and Politics of Internet Filtering
Pattern Matching
Pattern matching in JavaScript is based on regular expressions. Regular expressions are patterns that are compared with strings or substrings
In reality, regular expressions are a small formal language. Two approaches in JavaScript:
regexp object
methods of the string object
8
9Why Match Patterns?
• Most data validation that can be done on the
client-side consists of testing data for
conformance to a pattern.
• Telephone numbers • Email addresses • Dates • Money amounts • … what else?The Search Method
My_pos becomes 2.
/er/is a pattern. The search method searches for the pattern in the string.
Returns -1 if there is no match.
Search is a method of the ‘string’ object
var my_string = "Abernathy";
var my_pos = my_string.search(/er/);
The Replace Method
var bobs = "Bob, Bobbie"; bobs.replace(/Bob/g, "Bill");The string bobs now contains
“Bill, Billbie”
/Bob/
is a pattern, but “Bill” is just a string.
The “g” means “global.”
The Match Method
Match is the most general of the methods
var fruit = "4 apples 3 oranges"; var my_nbrs = fruit.match(/\d/g);
my_nbrs contains [4, 3] (it’s an array)
gall matches
no
gfirst match, plus parenthesized subpatterns
\dmatches digits ( and \D
matches non-digits
.)Forming Regular Expressions
/ /
enclose patterns
“normal” characters match themselves
(e.g. “rabbit”)
Metacharacters have special meanings
\ | ( ) [ ] { } ^ $ * + ? .
Metacharacters can be included in patterns by
escaping with a backslash, like
\$
A “real” dollar sign
Wildcard Matching
.
(period) matches any character except
newline
/snow./
matches snows, snowy
matches “snowi” in “snowing”
Classes
[ ]
(brackets) define classes
[abc]
/[abc]/
matches a or b or c
/[a-h]/
matches lower-case a through h
^
(circumflex) inverts a class
/[^aeiou]/
matches all except a,e,i,o,u
Predefined Classes
\x
backslash and class abbreviation
See your textbook or a JavaScript reference
\d
matches a digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
/\d+\.\d*/
One or more digits
a period
zero or more digits
Word and Space Characters
Word characters: [a-zA-Z0-9_] \wNon-word characters: [^a-zA-Z0-9_] \W
Space characters: space, tab, new line: Non-space characters:
\s \S
Capitalization reverses the sense of the predefined class names.
Boundary Matches
\b matches boundary between word and non-word Foo baz
↑
/Fred\b/ Matches “Fred is” but not “Frederick is…”
/Fred\B/ Matches “Frederick is” but not “Fred is…” \bis\bmatches “is” in: This island is beautiful This allows a whole-words-only search.
zero-length match
Repetition
*zero or more+one or more ?one or none
{ } a count (applies to pattern character on left)
/xy{4}z/ == /xyyyyz/ /X*y+z?/
Repetition Examples
*zero or more +one or more ?one or none /\d*\.\d+/ /\d*\.?\d*/Repetition Exercise
/\d*\.\d+/ 1. 0.0 2. .25 3. 137 4. 137. 5. 4.5678 6. xyz.123Can We Fix The Pattern?
/\d+\.?\d*/ 1. 0.0 2. .25 3. 137 4. 137. 5. 4.5678 6. xyz.123
Assume we are trying to match “valid” numbers
in various combinations with decimal point. Is
this any better? (Not much!)
Repetition Exercise: Case 2
/\d+\.?\d*/ 2. .25This expression does match test case 2 at
position 1, the digit 2. But…
the decimal point is skipped by
\d+,
which matches 25
\.?
makes (another) decimal optional
\d*
matches nothing
It also matches within:
.25.67! Why?
What about:
.25.67.89?
Repetition Exercise: Case 6
/\d+\.?\d*/ 6. xyz.123This expression does match test case 6 at
position 4, the digit 1. But…
the decimal point is skipped by
\d+,
which matches 123
\.?
makes (another) decimal optional
\d*
matches nothing
8
Another Repetition Exercise
/X*y+z?/ 1. Xyyyz 2. Xzzy 3. yyyyz 4. yyyy 5. wxyzz 6. zzzXyzzAnchors
Specify where to start matching/^pearl/ Match starts at beginning of string “pearls are...” but not “my pearls...”
Same character as pattern inversion, but different context, different meaning.
/gold$/Anchors to end of string“I like gold” but not “sunset is golden”
Grouping and Alternatives
Parentheses group items.
The pipe or vertical bar matches one of two or
more alternatives.
Matches ABCDEF or ABCXYZ
abc(def|xyz)
Now We Can Fix The Pattern
/^\d*(|\.\d*)?$/
Almost! We are trying to match either a digit or a
decimal point:
If a decimal point, then one or more digits
Otherwise, an optional decimal point followed by
zero or more digits.
Problem: This matches a decimal point all by
itself. To fix, we need conditional expressions,
which are beyond the scope of the course because
conditionals are not supported in JavaScript.
A Closer Look
• Anchored at the beginning of the string
• Zero or more digits
• A group containing either nothing, or a
decimal point and zero or more digits,
• Repeated zero or one times.
• Anchored at the end of the string
/^\d*(|\.\d*)?$/Did That Work?
/^\d*(|\.\d*)?$/ 1. 0.0 2. .25 3. 137 4. 137. 5. 4.5678 6. xyz.123 7. .Modifiers
Follow the pattern:g global i case-insensitive
/buffalo/i
Matches “Buffalo” and “buffalo”
The Split Method
Splits a string into substringsReturns an array of substrings
var my_str = "grapes:apples:oranges"; var fruit = my_str.split(":");
fruit is ["grapes", "apples", "oranges"] Split can take a regular expression as a delimiter
What about this?
var my_nbrs = "12, 3,4, 56"; nbr_array=my_nbrs.split(/\s*,\s*/);
Split with a Regular Expression
Splitting a comma-delimited string:var my_nbrs = "12,34,56";
var nbr_array = my_nbrs.split(",");
How does this work?
var ok = phNum.search(/\d{3}-\d{4}/); What does the search method return for this?
555-1212
A 7-Digit Phone Number
A 7-Digit Phone Number
How does this work?
var ok = phNum.search(/\d{3}-\d{4}/);
A 7-Digit Phone Number
How does this work?var ok = phNum.search(/\d{3}-\d{4}/);
What about this? 444555-12123456
var ok = phNum.search(/^\d{3}-\d{4}$/);
“Anchoring” the beginning and end gives an expression that works: No match here!
10-Digit Phone Number
Can it be extended for Atlanta-style phone numbers? var ok=phNum.search(/^\d{3}-\d{3}-\d{4}$/);10-Digit Phone Number
Can the format be made less rigid? (Yes!)/^\(?\d{3}\D*\d{3}\D*\d{4}$/
• Anchor at the beginning of the string • Optional left parenthesis
• Three digits • Optional non-digits • Three digits • Optional non-digits • Four digits
• Anchored at the end of the string.
Accepting Free-Form Phone Numbers Parentheses act as grouping and storage operators.
var ok = datum.search(/^\(?\d{3}\D*\d{3}\D*\d{4}$/); if (ok==0) {
var parts = datum.match
(/^\(?(\d{3})\D*(\d{3})\D*(\d{4})$/);
output.value='('+parts[1]+') '+parts[2]+'-'+parts[3]; }
Accepts: 404-555-1234, 4045551234, (404) 555-1234, etc. Returns: (404) 555-1234
Regular Expressions as NFAs
• “Nondeterministic Finite Automata”• Nondeterministic is not the same as “random” • Each part of a regular expression will match as
much as it can.
matches to end of string!
• The regular expression engine backtracks when necessary, i.e. when a match would otherwise fail.
.*
Regular Expressions are Greedy
A regular expression will match as much of the target string as possible19202122232425252627282930313233
/2.*2/
Stars by the <b>billions</b> and <b>billions</b>.
Regular Expressions are Greedy
Consider parsing HTML with a regular expression./<b>.*<\/b>/
Friedl, J.
Mastering Regular Expressions
Stars by the <b>billions</b> and <b>billions</b>.
Regular Expressions are Greedy
Consider parsing HTML with a regular expression.The ? is also the “lazy” modifier:
/<b>.*?<\/b>/ /<b>.*<\/b>/
Friedl, J.
Mastering Regular Expressions