1
Regular Expressions in Java
2010-09-22
Birgit Grohe
2
Content of this lecture
• A very small Java program
• Regular expressions in Java
• Metacharacters
• Character classes and boundaries
• Quantifiers
• Backreferences
• Flag Expressions and Modifiers
• Summary
3
Programming in Java
• Object oriented programming language
• In some languages, the first step is to write small
programs from scratch (e.g. Perl).
• Learning Java is about to learn how to use
objects, classes
and
packages
, often before
you write your own.
• A Java program is first
compiled
into a .class file,
then you can run the program (remember lab1!)
• Different from Perl where a
interpreter
takes
care of both compilation and execution.
4
”Hello, world!” In Java
public class Hello {
public static void main (String[] args){
// Printing to a terminal window
System.out.println(”Hello, world!”);
}
}
>javac Hello.java
>java Hello
Hello, world!
Class definition comment method5
Regular Expressions in Java
• The package
java.util.regex
consist of classes
Pattern, Matcher and PatternSyntaxException.
• A Pattern object is a compiled representation of
a regular expression.
• A Matcher object is the engine that interprets
the pattern and performs match operations
against an input string.
• For syntax errors: PatternSyntaxException.
6
Example
• The next slide shows Java code for a class for
regular expression processing:
• It reads an input string and a regular expression
from the user.
• The output are the matches, if any.
• The class is taken from a Java regular expression
tutorial:
http://download.oracle.com/javase/tutorial/essential/regex/index.html
The class will be used in lab 5!
Import..;
public class RegexTestHarness { public static void main(String[] args){
Console console = System.console();
if (console == null) { System.err.println("No console."); System.exit(1); }
while (true) {
Pattern pattern = Pattern.compile(console.readLine( "%nEnter your regex: ")); Matcher matcher = pattern.matcher(console.readLine(
"Enter input string to search: ")); boolean found = false;
while (matcher.find()) {
console.format("I found the text \"%s\" starting at " + "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end()); found = true;
}
if(!found){ console.format("No match found.%n"); } }
From a Java regexp tutorial, see
references.
Pattern pattern = Pattern.compile(console.readLine( "%nEnter your regex: "));
Matcher matcher = pattern.matcher(console.readLine( "Enter input string to search: "));
boolean found = false; while (matcher.find()) {
console.format("I found the text \"%s\" starting at " + "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end()); }
Enter your regex: foo
Enter input string to search: foo
I found the text "foo" starting at index 0 and ending at index 3. Enter your regex: cat.
Enter input string to search: cats
I found the text "cats" starting at index 0 and ending at index 4.
%n newline %s string %d number
9
Metacharacters
There are characters with a special meaning
within regular expressions in Java
To use their literal meanings:
• use the
escpape symbol
\
• or the
escape sequence
\Q <text> \E
. * ? + [ ] ( ) { } ^ $ |
\-10
Character Classes
• Simple character classes: [abc]
• Negation: [^abc]
• Ranges: [a-d]
• Union: [a-d[m-p]]
• Intersection: [a-z&&[def]]
• Subtraction: [a-z&&[^bc]]
negation d,e or f [ad-z] [a-dm-p] 11Predefined Character Classes
• Digit: [0-9] or \d
• Non-digit: [^0-9] or \D
• Whitespace character: [ \t\n\x0B\f\r] or \s
• Word character: [a-zA-Z_0-9] or \w
• Other negations: \S \W
12
Boundary Matchers
• The beginning of a line: ^
• The end of a line: $
• Word boundary: \b
• The beginning of the input: \A
• The end of the previous match: \G
• The end of the input: \z
• For more matchers see literature!
Interesting since quantifiers in Java
work slightly differently compared
13
Quantifiers
zero or more times X*+ X*? X* X, exactly ntimes X{n}+ X{n}? X{n}
one ore more times X++
X+? X+
once or not at all X?+ X?? X? Possessive Reluctant Greedy More alternatives: X{n,} and X{n,m} 14
Greedy Quantifiers
Enter your regex: a?
Enter input string to search: aaaa
I found the text "a" starting at index 0 and ending at index 1. I found the text "a" starting at index 1 and ending at index 2. I found the text "a" starting at index 2 and ending at index 3. I found the text "a" starting at index 3 and ending at index 4. I found the text "" starting at index 4 and ending at index 4. Enter your regex: a*
Enter input string to search: aaaa
I found the text "aaaaa" starting at index 0 and ending at index 4. I found the text "" starting at index 4 and ending at index 4. Enter your regex: a+
Enter input string to search: aaaa
I found the text "aaaaa" starting at index 0 and ending at index 4.
Multiple matches!
Greedy!
? and * match ””
Greedy Quantifiers
Enter your regex: (cat){3}
Enter input string to search: catcatcatcatcatcat
I found the text ”catcatcat" starting at index 0 and ending at index 9. I found the text ”catcatcat" starting at index 9 and ending at index 18. Enter your regex: cat{3}
Enter input string to search: catcatcatcatcatcat No match found.
Enter your regex: a{3,5}
Enter input string to search: aaaaaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5. I found the text "aaa" starting at index 5 and ending at index 8.
Greedy! Grouping strings for
quatifiers with ( )
Reluctant and Possessive
Quantifiers
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13. Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4. I found the text "xxxxxxfoo" starting at index 4 and ending at index 13. Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo No match found.
Tries to finish as early as possible
17
Summary Quantifiers
• The greedy quatifier
tries to match as much as it can
until the end of the string is reached. If it fails, it goes
back one letter at a time and tries again until a match is
found or the start of the input is reached (= no match).
• The reluctant quantifier
tries to match as early as
possible, increasing a letter at a time until a match is
found or the end of the input string is reached (= no
match).
• The possessive quantifier
consumes the entire string
once and if it did not suceed, it just stops without looking
back.
Fast performance!
18
Backreferences
Backreferences
work approximately the
same as in Perl, i.e. those parts of the
regular expression that are placed in ( ),
can be accessed with \1, \2 ...
19
Modifiers
In Java there exist similar features as the
modifiers
in Perl. There are two possibilities to implement
and use them:
•
Embedded Flag expression
(the flag is given
inside the regular expression)
•
Flags
and methods from the Pattern-class
(extra code and function calls required)
More modifies can be found in the Java Regexp
tutorial.
20
Embedded Flag Expressions
Example: Case insensitivity:
Enter your regex: (?i)foo
Enter input string to search: FOOfooFoO
I found the text "FOO" starting at index 0 and ending at index 3. I found the text "foo" starting at index 3 and ending at index 6. I found the text "FoO" starting at index 6 and ending at index 9.
21
Methods from the Pattern Class
Example: Case insensitivity
Pattern pattern = Pattern.compile(
console.readLine("%nEnter your regex: "),
Pattern.CASE_INSENSITIVE);
Enter your regex: dog
Enter input string to search: DoGDOg
I found the text "DoG" starting at index 0 and ending at index 3. I found the text "DOg" starting at index 3 and ending at index 6.
Modify the code!
22
Other Modifiers and Flags
The Pattern and Matcher classes support
similar features that are present in Perl,
e.g.
split
, several different substitution
methods (called ’
replacement
´ in Java),
comments, line versus file mode, etc.
Please read the Java Regexp tutorial for more details!
Summary
• Java provides a package for regular
expressions:
java.util.regex
• The syntax and usage of regular expressions in
Perl and Java are similar.
• There are minor differences in the regular
expression engine, e.g. on how the quantifiers
are implemented.
• Both Java and Perl provide similar features, e.g.
classes and functions and you will explore some
differences in lab 5.