Sie sind auf Seite 1von 14

Matchmaking with regular expressions

Use the power of regular expressions to ease text parsing and processing

By Benedict Chng, JavaWorld.com, 07/13/01

If you've programmed in Perl or any other language with built-in regular-expression


capabilities, then you probably know how much easier regular expressions make text
processing and pattern matching. If you're unfamiliar with the term, a regular expression
is simply a string of characters that defines a pattern used to search for a matching string.

Many languages, including Perl, PHP, Python, JavaScript, and JScript, now support
regular expressions for text processing, and some text editors use regular expressions for
powerful search-and-replace functionality. What about Java? At the time of this writing, a
Java Specification Request that includes a regular expression library for text processing
has been approved; you can expect to see it in a future version of the JDK.

But what if you need a regular expression library now? Luckily, you can download the
open source Jakarta ORO library from Apache.org. In this article, I'll first give you a
short primer on regular expressions, and then I'll show you how to use regular
expressions with the open source Jakarta-ORO API.

Regular expressions 101

Let's start simple. Suppose you want to search for a string with the word "cat" in it; your
regular expression would simply be "cat". If your search is case-insensitive, the words
"catalog", "Catherine", or "sophisticated" would also match:

Regular expression: cat


Matches: cat, catalog, Catherine, sophisticated

The period notation

Imagine you are playing Scrabble and need a three-letter word starting with the letter "t"
and ending with the letter "n". Imagine also that you have an English dictionary and will
search through its entire contents for a match using a regular expression. To form such a
regular expression, you would use a wildcard notation -- the period (.) character. The
regular expression would then be "t.n" and would match "tan", "Ten", "tin", and "ton"; it
would also match "t#n", "tpn", and even "t n", as well as many other nonsensical words.
This is because the period character matches everything, including the space, the tab
character, and even line breaks:

Regular expression: t.n


Matches: tan, Ten, tin, ton, t n, t#n, tpn, etc.
The bracket notation

To solve the problem of the period's indiscriminate matches, you can specify characters
you consider meaningful with the bracket ("[]") expression, so that only those characters
would match the regular expression. Thus, "t[aeio]n" would just match "tan", "Ten",
"tin", and "ton". "Toon" would not match because you can only match a single character
within the bracket notation:

Regular expression: t[aeio]n


Matches: tan, Ten, tin, ton

The OR operator

If you want to match "toon" in addition to all the words matched in the previous section,
you can use the "|" notation, which is basically an OR operator. To match "toon", use the
regular expression "t(a|e|i|o|oo)n". You cannot use the bracket notation here because it
will only match a single character. Instead, use parentheses -- "()". You can also use
parentheses for groupings (more on that later):

Matchmaking with regular expressions


Use the power of regular expressions to ease text parsing and processing

By Benedict Chng, JavaWorld.com, 07/13/01

Page 2 of 3

Regular expression: t(a|e|i|o|oo)n


Matches: tan, Ten, tin, ton, toon

The quantifier notations

Table 1 shows the quantifier notations used to determine how many times a given
notation to the immediate left of the quantifier notation should repeat itself:

Notation Number of Times


* 0 or more times
+ 1 or more times
? 0 or 1 time
{n} Exactly n number of times
{n,m} n to m number of times
Table 1. Quantifier notations

Let's say you want to search for a social security number in a text file. The format for US
social security numbers is 999-99-9999. The regular expression you would use to match
this is shown in Figure 1. In regular expressions, the hyphen ("-") notation has special
meaning; it indicates a range that would match any number from 0 to 9. As a result, you
must escape the "-" character with a forward slash ("\") when matching the literal
hyphens in a social security number.

Figure 1. Matches: All social security numbers of the form 123-12-1234

If, in your search, you wish to make the hyphen optional -- if, say, you consider both 999-
99-9999 and 999999999 acceptable formats -- you can use the "?" quantifier notation.
Figure 2 shows that regular expression:

Figure 2. Matches: All social security numbers of the forms 123-12-1234 and
123121234

Let's take a look at another example. One format for US car plate numbers consists of
four numeric characters followed by two letters. The regular expression first comprises
the numeric part, "[0-9]{4}", followed by the textual part, "[A-Z]{2}". Figure 3 shows
the complete regular expression:

Figure 3. Matches: Typical US car plate numbers, such as 8836KV

The NOT notation


The "^" notation is also called the NOT notation. If used in brackets, "^" indicates the
character you don't want to match. For example, the expression in Figure 4 matches all
words except those starting with the letter X.

Figure 4. Matches: All words except those that start with the letter X

The parentheses and space notations

Say you're trying to extract the birth month from a person's birthdate. The typical
birthdate is in the following format: June 26, 1951. The regular expression to match the
string would be like the one in Figure 5:

Figure 5. Matches: All dates with the format of Month DD, YYYY

The new "\s" notation is the space notation and matches all blank spaces, including tabs.
If the string matches perfectly, how do you extract the month field? You simply put
parentheses around the month field, creating a group, and later retrieve the value using
the ORO API (discussed in a following section). The appropriate regular expression is in
Figure 6:

Figure 6. Matches: All dates with the format Month DD, YYYY, and extracts Month
field as Group 1

Other miscellaneous notations


To make life easier, some shorthand notations for commonly used regular expressions
have been created, as shown in Table 2:

Notation Equivalent Notation


\d [0-9]
\D [^0-9]
\w [A-Z0-9]
\W [^A-Z0-9]
\s [ \t\n\r\f]
\S [^ \t\n\r\f]
Table 2. Commonly used notations

To illustrate, we can use "\d" for all instances of "[0-9]" we used before, as was the case
with our social security number expressions. The revised regular expression is in Figure
7:

Matchmaking with regular expressions


Use the power of regular expressions to ease text parsing and processing

By Benedict Chng, JavaWorld.com, 07/13/01

Page 3 of 3

Figure 7. Matches: All social security numbers of the form 123-12-1234

Jakarta-ORO library

Many open source regular expression libraries are available for Java programmers, and
many support the Perl 5-compatible regular expression syntax. I use the Jakarta-ORO
regular expression library because it is one of the most comprehensive APIs available and
is fully compatible with Perl 5 regular expressions. It is also one of the most optimized
APIs around.

The Jakarta-ORO library was formerly known as OROMatcher and has been kindly
donated to the Jakarta Project by Daniel Savarese. You can download the package from a
link in the Resources section below.
The Jakarta-ORO objects

I'll start by briefly describing the objects you need to create and access in order to use this
library, and then I will show how you use the Jakarta-ORO API.

The PatternCompiler object

First, create an instance of the Perl5Compiler class and assign it to the


PatternCompiler interface object. Perl5Compiler is an implementation of the
PatternCompiler interface and lets you compile a regular expression string into a
Pattern object used for matching:

PatternCompiler compiler=new Perl5Compiler();

The Pattern object

To compile a regular expression into a Pattern object, call the compile() method of the
compiler object, passing in the regular expression. For example, you can compile the
regular expression "t[aeio]n" like so:
Pattern pattern=null;
try {
pattern=compiler.compile("t[aeio]n");
} catch (MalformedPatternException e) {
e.printStackTrace();
}

By default, the compiler creates a case-sensitive pattern, so that the above setup only
matches "tin", "tan", "ten", and "ton", but not "Tin" or "taN". To create a case-insensitive
pattern, you would call a compiler with an additional mask:

pattern=compiler.compile("t[aeio]n",Perl5Compiler.CASE_INSENSIT
IVE_MASK);

Once you've created the Pattern object, you can use it for pattern matching with the
PatternMatcher class.

The PatternMatcher object

The PatternMatcher object tests for a match based on the Pattern object and a string.
You instantiate a Perl5Matcher class and assign it to the PatternMatcher interface. The
Perl5Matcher class is an implementation of the PatternMatcher interface and matches
patterns based on the Perl 5 regular expression syntax:

PatternMatcher matcher=new Perl5Matcher();


You can obtain a match using the PatternMatcher object in one of several ways, with
the string to be matched against the regular expression passed in as the first parameter:

• boolean matches(String input, Pattern pattern): Used if the input string


and the regular expression should match exactly; in other words, the regular
expression should totally describe the string input
• boolean matchesPrefix(String input, Pattern pattern): Used if the
regular expression should match the beginning of the input string
• boolean contains(String input, Pattern pattern): Used if the regular
expression should match part of the input string (i.e., should be a substring)

You could also pass in a PatternMatcherInput object instead of a String object to the
above three method calls; if you did so, you could continue matching from the point at
which the last match was found in the string. This is useful when you have many
substrings that are likely to be matched by a given regular expression. The method
signatures with the PatternMatcherInput object instead of String are as follows:

• boolean matches(PatternMatcherInput input, Pattern pattern)


• boolean matchesPrefix(PatternMatcherInput input, Pattern pattern)
• boolean contains(PatternMatcherInput input, Pattern pattern)

Scenarios for using the API

Now let's discuss some example uses of the Jakarta-ORO library.

Log file processing

Your job: analyze a Web server log file and determine how long each user spends on the
Website. An entry from a typical BEA WebLogic log file looks like this:

172.26.155.241 - - [26/Feb/2001:10:56:03 -0500] "GET /IsAlive.htm


HTTP/1.0" 200 15

After analyzing this entry, you'll realize that you need to extract two things from the log
file: the IP address and a page's access time. You can use the grouping notation
(parentheses) to extract the IP address field and the timestamp field from the log entry.

Let's first discuss the IP address. It consists of 4 bytes, each with values between 0 and
255; each byte is separated from the others by a period. Thus, in each individual byte in
the IP address, you have at least one and at most three digits. You can see the regular
expression for this field in Figure 8:

Figure 8. Matches: IP addresses that consist of 4 bytes, each with values between 0
and 255

You need to escape the period character because you literally want it to be there; you do
not want it read in terms of its special meaning in regular expression syntax, which I
explained earlier.

The log entry's timestamp part is surrounded by square brackets. You can extract
whatever is within these brackets by first searching for the opening square bracket
character ("[") and extracting whatever is not within the closing square bracket character
("]"), continuing until you reach the closing square bracket. Figure 9 shows the regular
expression for this:

Figure 9. Matches: At least one character until "]" is found

Now you combine these two regular expressions into a single expression with grouping
notation (parentheses) for extraction of your IP address and timestamp. Notice that "\s-\s-
\s" is added in the middle so that matching occurs,
although you won't extract that. You can see the complete
regular expression in Figure 10. Figure 10. Matches: The IP
address and timestamp by
combining two regular
expressions. Click on thumbnail
to view full-size image. (4 KB)
Now that you've formulated this regular expression, you can begin writing Java code
using the regular expression library.

Using the Jakarta-ORO library

To begin using the Jakarta-ORO library, first create the regular expression string and the
sample string to parse:
String logEntry="172.26.155.241 - - [26/Feb/2001:10:56:03 -0500]
\"GET /IsAlive.htm HTTP/1.0\" 200 15 ";
String regexp="([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-
9]{1,3})\\s-\\s-\\s\\[([^\\]]+)\\]";

The regular expression used here is nearly identical to the one found in Figure 10, with
only one difference: in Java, you need to escape every forward slash ("\"). Figure 10 is
not in Java, so we need to escape the forward-slash character so as not to cause a
compilation error. Unfortunately, this process is prone to error and you must do it
carefully. You can type in the regular expression first without escaping the forward
slashes, and then visually scan the string from left to right and replace every occurrence
of the "\" character with "\\". To double check, print out the resulting string to the
console.

After initializing the strings, instantiate the PatternCompiler object and create a
Pattern object by using the PatternCompiler to compile the regular expression:

PatternCompiler compiler=new Perl5Compiler();


Pattern pattern=compiler.compile(regexp);

Now, create the PatternMatcher object and call the contain() method in the
PatternMatcher interface to see if you have a match:

PatternMatcher matcher=new Perl5Matcher();


if (matcher.contains(logEntry,pattern)) {
MatchResult result=matcher.getMatch();
System.out.println("IP: "+result.group(1));
System.out.println("Timestamp: "+result.group(2));
}

Next, print out the matched groups using the MatchResult object returned from the
PatternMatcher interface. Since the logEntry string contains the pattern to be matched,
you could expect the following output:

IP: 172.26.155.241
Timestamp: 26/Feb/2001:10:56:03 -0500
HTML processing

Your next task is to churn through your company's HTML pages and perform an analysis
of all of a font tag's attributes. The typical font tag in your HTML looks like this:

<font face="Arial, Serif" size="+2" color="red">

Your program will print out the attributes for every font tag encountered in the following
format:

face=Arial, Serif
size=+2
color=red

In this case, I would suggest that you use two regular expressions. The first, shown in
Figure 11, extracts "face="Arial, Serif" size="+2" color="red" from the font tag:

Figure 11. Matches: The all-attribute part of the font tag

The second regular expression, shown in Figure 12, breaks down each individual attribute
into a name-value pair:

Figure 12. Matches: Each individual attribute, broken down into a name-value pair

Figure 12 breaks into:

font Arial, Serif


size +2
color red

Let's now discuss the code to achieve this. First, create the two regular expression strings
and compile them into a Pattern object using the Perl5Compiler. Use the
Perl5Compiler.CASE_INSENSITIVE_MASK option here when compiling the regular
expression for a case-insensitive match.

Next, create a Perl5Matcher object to perform matching:

String regexpForFontTag="<\\s*font\\s+([^>]*)\\s*>";
String regexpForFontAttrib="([a-z]+)\\s*=\\s*\"([^\"]+)\"";
PatternCompiler compiler=new Perl5Compiler();
Pattern
patternForFontTag=compiler.compile(regexpForFontTag,Perl5Compiler.CASE_
INSENSITIVE_MASK);
Pattern
patternForFontAttrib=compiler.compile(regexpForFontAttrib,Perl5Compiler
.CASE_INSENSITIVE_MASK);
PatternMatcher matcher=new Perl5Matcher();

Assume you have a variable called html of type String that represents a line in the
HTML file. If the content of the html string contains the font tag, the matcher will return
true, and you'll use the MatchResult object returned from the matcher object to get your
first group, which includes all of your font attributes:

if (matcher.contains(html,patternForFontTag)) {
MatchResult result=matcher.getMatch();
String attribs=result.group(1);
PatternMatcherInput input=new PatternMatcherInput(attribs);
while (matcher.contains(input,patternForFontAttrib)) {
result=matcher.getMatch();
System.out.println(result.group(1)+":
"+result.group(2));
}
}

Next, create a PatternMatcherInput object. As previously mentioned, this object lets


you continue matching from where the last match was found in the string; thus, it's
perfect for extracting the font tag's name-value pair. Create a PatternMatcherInput
object by passing in the string to be matched. Then, use the matcher instance to extract
each font attribute as it is encountered. This is done by repeatedly calling the contains()
method of the PatternMatcher object with the PatternMatcherInput object instead of
a string. Every iteration through the PatternMatcherInput object will advance a pointer
within it, so the next test will start where the previous one left off.

The output of the example is as follows:


face: Arial, Serif
size: +1
color: red

More HTML processing

Let's continue with another HTML example. This time, imagine that your Web server has
moved from widgets.acme.com to newserver.acme.com. You'll need to change the
links on some of your Webpages from:

<a href="http://widgets.acme.com/interface.html#How_To_Buy">
<a href="http://widgets.acme.com/interface.html#How_To_Sell">
etc.

to

<a href="http://newserver.acme.com/interface.html#How_To_Buy">
<a href="http://newserver.acme.com/interface.html#How_To_Sell">
etc.

The regular expression to perform the search


is shown in Figure 13. Figure 13. Matches: The link
"http://widgets.acme.com/interface.html#(any
anchor). Click on thumbnail to view full-size
image. (30 KB)
If this regular expression is found, you can make your substitution for the link in Figure
13 with the following expression:

<a href="http://newserver.acme.com/interface.html#">

Notice that you use after the # character. Perl regular expression syntax uses , , and so
forth to represent groups that have been matched and extracted. The expression shown in
Figure 13 appends whatever text has been matched and extracted as Group 1 to the link.

Now, back to Java. As usual, you must create your testing strings, the necessary object for
compiling the regular expression into a Pattern object, and a PatternMatcher object:

String link="<a
href=\"http://widgets.acme.com/interface.html#How_To_Trade\">";
String
regexpForLink="<\\s*a\\s+href\\s*=\\s*\"http://widgets.acme.com/interfa
ce.html#([^\"]+)\">";
PatternCompiler compiler=new Perl5Compiler();
Pattern
patternForLink=compiler.compile(regexpForLink,Perl5Compiler.CASE_INSENS
ITIVE_MASK);
PatternMatcher matcher=new Perl5Matcher();

Next, use the static method substitute() from the Util class in the
com.oroinc.text.regex package for performing a substitution, and print out the
resulting string:

String result=Util.substitute(matcher,
patternForLink,
new Perl5Substitution(
"<a
href=\"http://newserver.acme.com/interface.html#\">"),
link,
Util.SUBSTITUTE_ALL);
System.out.println(result);

The syntax of the Util.substitute() method is as follows:

public static String substitute(PatternMatcher matcher,


Pattern pattern,
Substitution sub,
String input,
int numSubs)

The first two parameters for this call are the PatternMatcher and Pattern objects
created earlier. The input for the third parameter is a Substitution object that
determines how the substitution is to be performed. In this case, use the
Perl5Substitution object, which lets you use a Perl 5-style substitution. The fourth
parameter is the actual string on which you wish to perform the substitution, and the last
parameter lets you specify whether you wish to substitute on every occurrence of the
pattern found (Util.SUBSTITUTE_ALL) or only substitute a specified number of times.

Author Bio

Benedict Chng is a Sun-certified developer currently consulting in the Boston area. He


hails from sunny and tropical Singapore and has been working in the software
development field for close to four years. His current interests include writing
applications for Palm devices and sightseeing in the New England region.

Express yourself

In this article, I've shown you the powerful features of regular expressions. When used
appropriately, they can help a great deal in string extraction and text changes. I have also
shown how you can incorporate regular expressions into your Java application using the
open source Jakarta-ORO library. Now, it's up to you to decide whether the old string
manipulation approach (using StringTokenizers, charAt, or substring) or a regular
expression library, like Jakarta-ORO, works for you.

Das könnte Ihnen auch gefallen