正则表达式实用指南（三）：边界匹配、Pattern模式、Matcher方法||沉潜飞动|君子藏器于身，待时而动。

正则表达式实用指南（三）：边界匹配、Pattern模式、Matcher方法

本文收录在《从小工到专家的 Java 进阶之旅》系列专栏中。

你好，我是看山。

正则表达式是Java变成中一把利器，常出现在文本检查、替换等逻辑中。本文中，我们将深入探讨 Java 正则表达式 API，并研究如何在 Java 编程语言中运用正则表达式。

在正则表达式的领域中，存在多种不同的“风格”可供选择，例如 grep、Perl、Python、PHP、awk 等等。这就意味着，在一种编程语言中有效的正则表达式，在另一种语言中可能无法正常工作。Java 中的正则表达式语法与 Perl 中的最为相似。

八、边界匹配器

Java 正则表达式 API 还支持边界匹配，如果我们需要判断匹配的位置，就可以使用边界匹配器。

要仅在文本开头满足所需正则表达式时进行匹配，我们使用“^”。下面的测试将通过，因为“dog”在文本开头：

int matches = runTest("^dog", "dogs are friendly");

assertTrue(matches > 0);

下面示例就会匹配失败：

int matches = runTest("^dog", "are dogs are friendly?");

assertFalse(matches > 0);

要仅在文本结尾满足所需正则表达式时进行匹配，我们使用“$”。在下面的情况下，我们会找到匹配：

int matches = runTest("dog$", "Man's best friend is a dog");

assertTrue(matches > 0);

下面示例就会匹配失败：

int matches = runTest("dog$", "is a dog man's best friend?");

assertFalse(matches > 0);

\\b表示单词边界，正则表达是中所说的单词是指\w+空格，即数字、大小写字母、下划线、空格。

int matches = runTest("\\bdog\\b", "a dog is friendly");

assertEquals(matches, 1);

行首的空字符串也是一个单词边界：

int matches = runTest("\\bdog\\b", "dog is man's best friend");

assertEquals(matches, 1);

下面这个示例则会无匹配字符：

int matches = runTest("\\bdog\\b", "snoop dogg is a rapper");

assertEquals(matches, 0);

两个连续的单词字符不标记为单词边界，但我们可以通过更改正则表达式的结尾来查找非单词边界，使其通过测试：

int matches = runTest("\\bdog\\B", "snoop dogg is a rapper");
assertEquals(matches, 1);

这里用到了\B，表示非边界单词，与\b正好相反。

九、Pattern 类方法

Pattern的compile方法可以传入一组标志，这些标志会影响匹配方式。

让我们在测试类中重载runTest方法，使其能够接受一个标志：

public static int runTest(String regex, String text, int flags) {
    Pattern pattern = Pattern.compile(regex, flags);
    Matcher matcher = pattern.matcher(text);
    int matches = 0;
    while (matcher.find()) {
        matches++;
    }
    return matches;
}

Pattern提供了标记的常量，我们一一看一下。

Pattern.CANON_EQ

当指定此标志时，两个字符只有在其完整的规范分解匹配时才被视为匹配。主要用于组合码与组成元素之间的匹配。

比如，带重音的 Unicode 字符“é”。它的组合码点是“u00E9”，Unicode 也为其组成字符“e”（“u0065”）和锐音符（“u0301”）提供了单独的码点。如果启用CANON_EQ模式，组合字符“u00E9”与两个字符序列“u0065 u0301”可以算作匹配的。

默认情况下，下面示例是不会命中：

int matches = runTest("\u00E9", "\u0065\u0301");

assertEquals(matches, 0);

如果我们设置了CANON_EQ标志，测试将通过：

int matches = runTest("\u00E9", "\u0065\u0301", Pattern.CANON_EQ);

assertTrue(matches > 0);

Pattern.CASE_INSENSITIVE

此标志启用不区分大小写的匹配。

默认情况下，匹配是区分大小写的：

int matches = runTest("dog", "This is a Dog");

assertEquals(matches, 0);

我们可以使用CASE_INSENSITIVE设置为不区分大小写：

int matches = runTest("dog", "This is a Dog", Pattern.CASE_INSENSITIVE);

assertTrue(matches > 0);

我们可以使用内联修饰符(?i)，可以使正则表达式的匹配过程忽略大小写，使用(?-i)关闭。

int matches = runTest("(?i)d(?-i)og", "This is a Dog");

assertTrue(matches > 0);

Pattern.COMMENTS

Java API 允许我们在正则表达式中使用“#”添加注释。这有助于为复杂的正则表达式添加文档说明，使其对其他程序员更易理解。

COMMENTS标志可以忽略正则表达式中的任何空白字符或注释。

在默认匹配模式下，下面这种会无匹配：

int matches = runTest("dog$  #check for word dog at end of text", "This is a dog");

assertEquals(matches, 0);

这是因为匹配器会在输入文本中查找整个正则表达式，包括空格和“#”字符。但是当我们使用COMMENTS标志时，它会忽略多余的空格，并且每行中以“#”开头的所有文本都将被视为注释而被忽略：

int matches = runTest("dog$  #check end of text", "This is a dog", Pattern.COMMENTS);

assertTrue(matches > 0);

这里也有一个内联修饰符(?x)，可以启用扩展模式，在这种模式下：白空格（如空格、制表符、换行符等）在正则表达式中被忽略；可以使用 # 开始的注释，直到行尾。

比如：

int matches = runTest("(?x)dog$  #check end of text", "This is a dog");

assertTrue(matches > 0);

Pattern.DOTALL

默认情况下，当我们在正则表达式中使用点“.”表达式时，它会匹配输入字符串中的每个字符，直到遇到换行符为止。使用DOTALL标志后，可以匹配换行符。

首先，来看默认行为：

Pattern pattern = Pattern.compile("(.*)");
Matcher matcher = pattern.matcher("this is a text" + System.getProperty("line.separator")
                + " continued on another line");
matcher.find();

assertEquals("this is a text", matcher.group(1));

如我们所见，匹配了换行符之前的部分。如果在DOTALL模式下，整个文本，包括换行符，都会被匹配：

Pattern pattern = Pattern.compile("(.*)", Pattern.DOTALL);
Matcher matcher = pattern.matcher("this is a text" + System.getProperty("line.separator")
                + " continued on another line");
matcher.find();
assertEquals("this is a text" + System.getProperty("line.separator")
                + " continued on another line", matcher.group(1));

我们也可以使用内联表达式(?s)开启单行模式，与DOTALL标志行为相同。

Pattern pattern = Pattern.compile("(?s)(.*)");
Matcher matcher = pattern.matcher("this is a text" + System.getProperty("line.separator")
                + " continued on another line");
matcher.find();
assertEquals("this is a text" + System.getProperty("line.separator")
                + " continued on another line", matcher.group(1));

Pattern.LITERAL

在这种模式下，匹配器不会给任何元字符、转义字符或正则表达式语法赋予特殊含义。

在没有此标志时，匹配器会将以下正则表达式与任何输入字符串进行匹配：

int matches = runTest("(.*)", "text");

assertTrue(matches > 0);

使用LITERAL标志时，如果输入字符串与正则表达式字面相同才会匹配，否则不会找到匹配，因为匹配器会查找(.*)而不是对其进行解释：

int matches = runTest("(.*)", "text", Pattern.LITERAL);

assertEquals(matches, 0);

上面示例中，如果待匹配字符串变为text(.*)，matches就是1。

Pattern.MULTILINE

默认情况下，“^”和“$”元字符分别绝对匹配整个输入字符串的开头和结尾。匹配器会忽略任何换行符：

int matches = runTest("dog$",
        "This is a dog" + System.getProperty("line.separator") + "this is a fox");

assertEquals(matches, 0);

这个匹配会失败，因为匹配器在整个字符串的末尾查找“dog”，但是示例中文本是分两行，会认为并没有结束。

使用MULTILINE标志后，匹配器会考虑换行符，遇到换行符即认为结束：

int matches = runTest("dog$",
        "This is a dog" + System.getProperty("line.separator") + "this is a fox",
        Pattern.MULTILINE);

assertTrue(matches > 0);

我们也可以使用内联表达式(?m)实现相同逻辑：

int matches = runTest("(?m)dog$",
        "This is a dog" + System.getProperty("line.separator") + "this is a fox");

assertTrue(matches > 0);

十、Matcher 类方法

（一）索引方法

索引方法提供了有用的索引值，能准确地告诉我们在输入字符串中匹配项的位置。

在以下测试中，我们将确认输入字符串中“dog”的匹配起始和结束索引：

Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("This dog is mine");
matcher.find();

assertEquals(5, matcher.start());
assertEquals(8, matcher.end());

（二）查找方法

查找方法遍历输入字符串，并返回一个布尔值，指示是否找到模式。常用的方法是matches和lookingAt。

matches和lookingAt方法都尝试将输入序列与模式进行匹配。区别在于matches要求整个输入序列都匹配，而lookingAt不需要。

这两个方法都从输入字符串的开头开始：

Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are friendly");

assertTrue(matcher.lookingAt());
assertFalse(matcher.matches());

在以下情况下，matches方法将返回 true：

Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dog");

assertTrue(matcher.matches());

（三）替换方法

替换方法用于替换输入字符串中的文本。常用的方法是replaceFirst和replaceAll。

replaceFirst和replaceAll方法替换与给定正则表达式匹配的文本。顾名思义，replaceFirst替换第一次出现的匹配项，replaceAll替换所有匹配项：

Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceFirst("cat");

assertEquals("cats are domestic animals, dogs are friendly", newStr);

替换所有匹配项：

Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceAll("cat");

assertEquals("cats are domestic animals, cats are friendly", newStr);

replaceAll方法允许我们用相同的替换内容替换所有匹配项。

String类的replaceFirst和replaceAll方法啊，底层也是使用了Matcher的方法。

文末总结

正则表达式是一个开发利器，用的好的话，会大大提升我们的开发效率。本文介绍了边界匹配、Pattern模式、Matcher方法。

青山不改，绿水长流，我们下次见。