Regular Expressions

... are used to either check/validate text against a certain pattern or to extract text.

Purpose Input Requirements Regular Expression
Checking Markus Is it a valid username? Thus no special characters, maybe numbers and not empty. [a-zA-Z0-9]+
Extracting Product=BigMac, Price=2,50 Extract the product name and the price Product=([^,]+), Price=([0-9]+),([0-9]+)

Table of Context

Usage in Java

00: import java.util.regex.Pattern;
01: import java.util.regex.Matcher;
02: 
03: public class SomeClass
04: {
05:	public static void main(String[] args)
06:	{
07:		String line = "Product=BigMac, Price=2,50";
08:
09:		Pattern p = Pattern.compile("Product=([^,]+), Price=([0-9]+),([0-9]+)"); // create the pattern only once, it can't change!
10:	
11:		Matcher m = p.matcher(line);
12:
13:		if(m.find()) // use while(m.find) if the pattern can occur more than once per line
14:		{
15:			String prodName = m.group(1); // capturing group number 1
16:			String priceStr = m.group(2) + "." + m.group(3);
17:			double price = Double.parseDouble(priceStr);
18:
19:			// use the extracted information...
20:			System.out.println(prodName);
21:			System.out.println(price);
22:		}
23:	}
24:}

Basic Patterns

This list is far from completion, but it is meant to be a quick reference.

Arbitrary text

Text in a regular expression is simply searched for.

Notes on escaping

Any special character (as listed later, e.g. [ or ]) must be escaped using a backslash \.
Important: as Java uses the \ as well for escaping, thus [ properly escaped in a Java string would look like: "\\[".

Sample

Input Product[BigMac] Price[2,50]
Pattern Price\[[0-9]+\]
Java Pattern p = Pattern.compile("Price\\[[0-9]+\\]");

Sample Pattern Explained

Price\[ The pattern searches for Price[. The square bracket [ must be escaped, as it is a special character.
[0-9]+ Matches 1 or more numbers
\] Matches a square bracket ].

Any character

The dot . character is used to match any character.

Sample

Input <word conf="1.0" end="1234">That</word>
Pattern <word .* end="([0-9]+)"
Java Pattern p = Pattern.compile("<word .* end=\"([0-9]+)\"");

Sample Pattern Explained

<word [ The pattern searches for <word.
.* Matches 0 or more characters (any) is matched. Note: the .* pattern should be used carefully, thus use it only once per pattern.
end=" Matches the text end=". The double quotes must be escaped in the Java string as it would close the string.
([0-9]+) Introduces a new capturing group, containing 1 or more numbers.

Character classes

A character class is a list of characters allowed in the string. Some shortcuts: 0-9, a-z, A-Z.

Sample

Input a=123 b=456 <word>Your phone number is 123</word>
Pattern [ab]=([0-9]+) <word>([^<]+)</word>
Java Pattern p = Pattern.compile("[ab]=([0-9]+)"); Pattern p = Pattern.compile("<word>([^<]+)</word>");

Sample Pattern Explained - [ab]=([0-9]+)

[ab] A single a or a single b.
([0-9]+) Introduces a new capturing group, containing 1 or more numbers.

Sample Pattern Explained - <word>([^<]+)</word>

([^<]+) Introduces a new capturing group, containing 1 or more characters except <.

Quantifiers

We already used + and * to specify multiple occurences of the element just before it. Definition:
? 0 or 1 occurence. Thus the element before would be optional.
+ 1 or more occurences. Thus the element before must occur at least once.
* 0 or more occurences.
{N}The element before must occur exactly N times.

Capturing groups

By bracketing an expression with ( ), we introduce a capturing group. It is used to mark the part of the text that is of our interest.
See line 15 of the sample code at the top for how to retrieve the group.
The groups are number from 1, as 0 is the complete match.

Non-capturing groups

Sometimes not only single characters are optional, but words or even expressions. To be able to specify quantifiers for an expression and still not mess tbe capture group number, non-capturing groups are helpful.


done with
gvim by eisber _at_ eisber.net - last update 30/4/2005