正则表达式(regular expressions) 是一种描述字符串集的方法,它是以字符串集中各种字符串的公有特征为依据的。
java.util.regex 包主要由三个类所组成:Parten、Matcher和PatternSyntaxException。
Pattern 对象表示一个已编译的正则表达式。Patter类没有提供公有的构造方法。要构造一个模型,首先必须调用公共的静态compile方法,它将返回一个Pattern对象。这个方法接受正则表达式作为第一个参数。
Matcher 是一个靠着输入的字符串来解析这个模式和完成匹配操作的对象。与Pattern相似,Matcher也没有定义公共的构造方法,需要通过调用Pattern对象的matcher方法来获得一个Matcher对象。
PatternSyntaxException 对象是一个未检查异常,指示了正则表达式中的一个语法错误。
package com.fortune.test; import java.io.Console; import java.util.Scanner; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * Created with IntelliJ IDEA. * User: Alan * Date: 12-5-28 * Time: 下午1:46 */ public class RegexTestHarness { public static void main(String[] args) { Scanner scanner = new Scanner(System.in); while (true) { System.out.printf("%nEnter your regex: "); Pattern pattern = Pattern.compile(scanner.nextLine()); System.out.printf("Enter input string to search: "); Matcher matcher = pattern.matcher(scanner.nextLine()); boolean found = false; while (matcher.find()) { System.out.printf( "I found the text \"%s\" starting at index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end() ); found = true; } if (!found) { System.out.printf("No match found.%n"); } } } }
Enter your regex: foo Enter input string to search: foo I found the text "foo" starting at index 0 and ending at index 3.
Enter your regex: foo Enter input string to search: foofoofoo I found the text "foo" starting at index 0 and ending at index 3. I found the text "foo" starting at index 3 and ending at index 6. I found the text "foo" starting at index 6 and ending at index 9.
API也支持许多可以影响模式匹配的特殊字符。把正则表达式改成 cat. , 并输入字符串"cats",输出结果是下所示:
Enter your regex: cat. Enter input string to search: cats I found the text "cats" starting at index 0 and ending at index 4.
虽然在输入的字符串中没有点(.),但这个匹配仍然是成功的。这是由于点(.)是一个元字符(metacharacters)(被这个匹配翻译成了具有特殊意义的字符了)。这个例子为什么能匹配在于, 元字符.指的是"任意字符".
API所支持的元字符有:( [ { \ ^ - $ | } ] ) ? * + .
2.把它放在\Q (引用开始) 和 \E (引用结束)之间。在使用这种技术时,\Q 和\E能被放于表达式中的任何位置
a, b 或 c(简单类) |
除 a, b 或 c 之外的任意字符(取反) |
a 到 z,或 A 到 Z,包括(范围) |
a 到 d,或 m 到 p:
d,e 或 f(交集) |
除 b 和 c 之外的 a 到 z 字符:
a 到 z,并且不包括 m 到 p:
注意:"字符类(character class)"这个词中的"类(class)"指的并不是一个.class文件。在正则表达式的语义中,字符类是指放在方框号里的字符集,指定了一些字符中的一个能被给定的字符串所匹配。
5.1 简单类(Simple Classes)
字符类最基本的格式是把一些字符放到一个方框号内。例如:正则表达式[bcr]at 会匹配"bat"、"cat"或者"rat",这是由于其定义了一个字符类(接受"b","c"或"r"中的一个字符)作为它的首字符。
Enter your regex: [bcr]at Enter input string to search: bat I found the text "bat" starting at index 0 and ending at index 3. Enter your regex: [bcr]at Enter input string to search: cat I found the text "cat" starting at index 0 and ending at index 3. Enter your regex: [bcr]at Enter input string to search: rat I found the text "rat" starting at index 0 and ending at index 3.
5.1.1 否定匹配
Enter your regex: [^bcr]at Enter input string to search: bat No match found. Enter your regex: [^bcr]at Enter input string to search: cat No match found. Enter your regex: [^bcr]at Enter input string to search: hat I found the text "hat" starting at index 0 and ending at index 3.
5.1.2 范围
有时会想要定义一个包含范围的字符类,诸如,"a到h"的字母或者是"1"到"5"的数字。指定一个范围,只要在匹配的首字符和末字符间插入-元字符,比如:[1-5] 或者是 [a-h] 。也可以在类里每个的边上放置不同的范围来提高匹配的可能性,例如:[a-zA-Z]将会匹配a到z(小写字母)或者A到Z(大写字母)中任何一个字符。
Enter your regex: [a-c] Enter input string to search: a I found the text "a" starting at index 0 and ending at index 1. Enter your regex: [a-c] Enter input string to search: b I found the text "b" starting at index 0 and ending at index 1. Enter your regex: [a-c] Enter input string to search: c I found the text "c" starting at index 0 and ending at index 1. Enter your regex: [a-c] Enter input string to search: d No match found. Enter your regex: foo[1-5] Enter input string to search: foo1 I found the text "foo1" starting at index 0 and ending at index 4. Enter your regex: foo[1-5] Enter input string to search: foo5 I found the text "foo5" starting at index 0 and ending at index 4. Enter your regex: foo[1-5] Enter input string to search: foo6 No match found. Enter your regex: foo[^1-5] Enter input string to search: foo6 I found the text "foo6" starting at index 0 and ending at index 4.
5.1.3 并集
Enter your regex: [0-4[6-8]] Enter input string to search: 0 I found the text "0" starting at index 0 and ending at index 1. Enter your regex: [0-4[6-8]] Enter input string to search: 5 No match found. Enter your regex: [0-4[6-8]] Enter input string to search: 6 I found the text "6" starting at index 0 and ending at index 1. Enter your regex: [0-4[6-8]] Enter input string to search: 8 I found the text "8" starting at index 0 and ending at index 1. Enter your regex: [0-4[6-8]] Enter input string to search: 9 No match found.
5.1.4 交集
Enter your regex: [0-9&&[345]] Enter input string to search: 3 I found the text "3" starting at index 0 and ending at index 1. Enter your regex: [0-9&&[345]] Enter input string to search: 4 I found the text "4" starting at index 0 and ending at index 1. Enter your regex: [0-9&&[345]] Enter input string to search: 5 I found the text "5" starting at index 0 and ending at index 1. Enter your regex: [0-9&&[345]] Enter input string to search: 2 No match found. Enter your regex: [0-9&&[345]] Enter input string to search: 6 No match found.
Enter your regex: [2-8&&[4-6]] Enter input string to search: 3 I found the text "3" starting at index 0 and ending at index 1. Enter your regex: [2-8&&[4-6]] Enter input string to search: 4 I found the text "4" starting at index 0 and ending at index 1. Enter your regex: [2-8&&[4-6]] Enter input string to search: 5 I found the text "5" starting at index 0 and ending at index 1. Enter your regex: [2-8&&[4-6]] Enter input string to search: 6 I found the text "6" starting at index 0 and ending at index 1. Enter your regex: [2-8&&[4-6]] Enter input string to search: 7 No match found.
5.1.5 差集
最后,可以使用差集(subtraction)来否定一个或多个嵌套的字符类,比如:[0-9&&[^345]], 这个是构建一个匹配除3,4,5之外所有0-9间数字的简单字符类。
Enter your regex: [0-9&&[^345]] Enter input string to search: 2 I found the text "2" starting at index 0 and ending at index 1. Enter your regex: [0-9&&[^345]] Enter input string to search: 3 No match found. Enter your regex: [0-9&&[^345]] Enter input string to search: 4 No match found. Enter your regex: [0-9&&[^345]] Enter input string to search: 5 No match found. Enter your regex: [0-9&&[^345]] Enter input string to search: 6 I found the text "6" starting at index 0 and ending at index 1. Enter your regex: [0-9&&[^345]] Enter input string to search: 9 I found the text "9" starting at index 0 and ending at index 1.
Pattern的API包有许多有用的预定义字符类(predefined character classes),提供了常用正则表达式的简写形式
任何字符(匹配或者不匹配行结束符) |
以反斜线(\) 开始的构造称为构造(escaped constructs)。回顾一下在字符串中一节中的转义构造,在那里我们提及了使用反斜线,以及用于引用的\Q和\E。在字符串中使用转义构造,必须在一个反斜线前在增加一个反斜线用于字符串的编译,例如:
private final String REGEX="\\d";
Enter your regex: . Enter input string to search: @ I found the text "@" starting at index 0 and ending at index 1. Enter your regex: . Enter input string to search: 1 I found the text "1" starting at index 0 and ending at index 1. Enter your regex: . Enter input string to search: a I found the text "a" starting at index 0 and ending at index 1. Enter your regex: \d Enter input string to search: 1 I found the text "1" starting at index 0 and ending at index 1. Enter your regex: \d Enter input string to search: a No match found. Enter your regex: \D Enter input string to search: 1 No match found. Enter your regex: \D Enter input string to search: a I found the text "a" starting at index 0 and ending at index 1. Enter your regex: \s Enter input string to search: No match found. Enter your regex: \s Enter input string to search: a No match found. Enter your regex: \s Enter input string to search: " " I found the text " " starting at index 1 and ending at index 2. Enter your regex: \S Enter input string to search: a I found the text "a" starting at index 0 and ending at index 1. Enter your regex: \w Enter input string to search: a I found the text "a" starting at index 0 and ending at index 1. Enter your regex: \w Enter input string to search: ! No match found. Enter your regex: \W Enter input string to search: ! I found the text "!" starting at index 0 and ending at index 1.
\d 匹配数字字符
\s 匹配空白字符
\w 匹配单词字符
\D 匹配非数字字符
\S 匹配非空白字符
\W 匹配非单词字符