正则表达式从「入门」到「入门」

一、概述

正则表达式，又称正规表示式、正规表示法、正规表达式、规则表达式、常规表示法（英语：Regular Expression，在代码中常简写为 regex 、 regexp 或 RE），是计算机科学的一个概念。正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。在很多文本编辑器里，正则表达式通常被用来检索、替换那些匹配某个模式的文本。

Regular Expression 的「Regular」一般被译为「正则」、「正规」、「常规」。此处的「Regular」即是「规则」、「规律」的意思，Regular Expression 即「描述某种规则的表达式」之意。

本篇将介绍正则表达式的基本语法，所有代码基于 python 完成，环境：python2.7 + re 模块，python 操作正则的方法详见： python 正则表达式的使用方法

二、语法规则

2.1 元字符

2.1.1 规则

\s：匹配空白区域，空白区域也包含、\t 和\n 等。
\d：匹配数字 0-9 。
\w：匹配字母、数字或者下划线。
\b：匹配边界，单词的便捷或者字符串的开头和结尾。
.：匹配除换行符以外的所有字符。
^：匹配字符串的开始。
$：匹配字符串的结束。
[]：匹配 [] 中列举的字符，例如 [abc] 可以匹配 a 、 b 或者 c 字符。

2.1.2 案例

匹配"HelloWorld"中的"Hello"

s = "HelloWorld"
p = r"Hello"
rs = re.match(p, s)
print rs.group()  #Hello

s = "HelloWorld"

p = r"Hello"

rs = re.match(p, s)

print rs.group() #Hello

匹配手机号码

s = "13977889988"
p = r"1\d\d\d\d\d\d\d\d\d\d"
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print "no matched"

s = "13977889988"

p = r"1\d\d\d\d\d\d\d\d\d\d"

rs = re.match(p, s)

if rs != None:

print rs.group() #13977889988

else:

print "no matched"

匹配非 0 开头的两位数

s = "02 33 45 87 09"
p = r"[1-9]d"
print re.findall(p, s)  # ["33", "45", "87"]

s = "02 33 45 87 09"

p = r"[1-9]d"

print re.findall(p, s) # ["33", "45", "87"]

匹配以小写字母开头的字符串

s = "adf Bdc A45 e87 c09"
p = r"[a-z]ww"
print re.findall(p, s)  # ["adf", "e87", "c09"]

s = "adf Bdc A45 e87 c09"

p = r"[a-z]ww"

print re.findall(p, s) # ["adf", "e87", "c09"]

2.2 限定符

限定符用来限定字符出现的次数

2.2.1 规则

*：重复 0 次或以上。
+：重复 0 次以上。
?：重复 0 次或者 1 次。
{n}：重复 n 次。
{m, n}：重复 m-n 次。
{n, }：重复 n 次以上。

2.2.2 使用案例

在上面，判断手机号码需要写 10 个 d，有了限定符之后可以这样写：

s = "13977889988"
p = r"1\d{10}"
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print "no matched"

s = "13977889988"

p = r"1\d{10}"

rs = re.match(p, s)

if rs != None:

print rs.group() #13977889988

else:

print "no matched"

找出一段字符串中所有的三位数

s = "100 88 9 112 9998 197 9876 77"
p = r"\d{3}"
print re.findall(p, s)  #["100", "112", "197"]

s = "100 88 9 112 9998 197 9876 77"

p = r"\d{3}"

print re.findall(p, s) #["100", "112", "197"]

找出所有长度为 4-5 的单词

s = "abc hello defg world higkli"
p = r"w{4,5}"
print re.findall(p, s)  #["hello", "defg", "world"]

s = "abc hello defg world higkli"

p = r"w{4,5}"

print re.findall(p, s) #["hello", "defg", "world"]

2.3 反义符

2.3.1 规则

\W：匹配任意不是字母、数字以及下划线的字符。
\D：匹配非数字。
\S：匹配非空白字符。
\B：匹配非边界。
[^x]：匹配除 x 以外的字符。
[^abc]：匹配除 a 、 b 、 c 以外的字符。

2.3.2 案例

找出字符串中所有的非空白部分

s = "hello 123 {world} world [456] "
p1 = r"\S+"
print re.findall(p1, s)  #["hello", "123", "world", "world", "456"]

s = "hello 123 {world} world [456] "

p1 = r"\S+"

print re.findall(p1, s) #["hello", "123", "world", "world", "456"]

找到所有的非 0 开头的三位数

s = "012 3456 789 1011 999"
p = r"[^0]d{2}"
print re.findall(p, s)  #789 999

s = "012 3456 789 1011 999"

p = r"[^0]d{2}"

print re.findall(p, s) #789 999

2.4 分组

2.4.1 设置分组

相对于上面的内容来说，分组算是正则语法中的高级部分了，相对也复杂一点，个人感觉分组也是正则表达式的精髓所在，只要用好了分组，正则表达式将会变得非常灵活。

要想把一个匹配内容作为分组，只需用括号包起来即可，例如匹配一段 html 代码中的标签：

s = "This is regex"
p = r"<(w+)><(w+)>([ws]*)<(/w+)><(/w+)>"
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # ("html", "head", "This is regex", "/head", "/html")
    print rs.group(1)  # html
    print rs.group(2)  # head
    print rs.group(3)  # This is regex
    print rs.group(4)  # /head
    print rs.group(5)  # /html
else :
    print "no matched"

s = "This is regex"

p = r"<(w+)><(w+)>([ws]*)<(/w+)><(/w+)>"

rs = re.match(p, s)

if rs != None:

print rs.groups() # ("html", "head", "This is regex", "/head", "/html")

print rs.group(1) # html

print rs.group(2) # head

print rs.group(3) # This is regex

print rs.group(4) # /head

print rs.group(5) # /html

else :

print "no matched"

这里的把匹配字符串<(w+)><(w+)>([ws]*)<(/w+)><(/w+)> 分为了五组，分别是五个括号包起来的区块。

2.4.2 引用分组

假设把上面匹配 html 标签例子的字符串改成<head><html>This is regex</head></html> 再使用同样的正则表达式来匹配，发现同样也能匹配到结果 ("head", "html", "This is regex", "/head", "/html")，然而在网页中这段代码就是错误的，因为标签根本不匹配。

这个问题要怎么解决，这里就需要用到引用分组了，引用分组其实就是引用匹配过程中前面分组匹配到的字符串，引用的方法是+分组序号，例如表示引用第一个分组，在上面的例子中就相当于引用 html 字符串。有了引用分组之后，上面的 html 匹配就可以改为：

s = "This is regex"
p = r"<(w+)><(w+)>([ws]*)<(/2)><(/1)>"
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # ("html", "head", "This is regex", "/head", "/html")
else :
    print "no matched"

s = "This is regex"

p = r"<(w+)><(w+)>([ws]*)<(/2)><(/1)>"

rs = re.match(p, s)

if rs != None:

print rs.groups() # ("html", "head", "This is regex", "/head", "/html")

else :

print "no matched"

这时，再把匹配字符串修改为<head><html>This is regex</head></html> 就不会有匹配，将会输出 no matched 。

2.4.3 给分组取别名

引用分组有两种方式，一种是使用序号引用，另一种是取别名引用，规则为：

(?P<name>)：给分组设置别名。
(?P=name)：引用 name 分组匹配到的字符串。

使用别名匹配 html 标签：r"<(?P<html_tag>w+)><(?P<head_tag>w+)>([ws]*)</(?P=head_tag)></(?P=html_tag)>"

2.5 贪婪模式和非贪婪模式

假设有一段字符串如下所示：

MaQian,HuNan,166-7788-8877

1	MaQian,HuNan,166-7788-8877

我想匹配出其中的手机号码，正则表达式为：(.*)(d*-d*-d*)

s = "MaQian HuNan 168-8877-7788"
p = r"(.+)(d*-d*-d*)"
rs = re.match(p, s)
if rs == None:
    print "no matched"
else:
    print rs.group(1)
    print rs.group(2)

s = "MaQian HuNan 168-8877-7788"

p = r"(.+)(d*-d*-d*)"

rs = re.match(p, s)

if rs == None:

print "no matched"

else:

print rs.group(1)

print rs.group(2)

按照预想，结果应该为：

MaQian HuNan
168-8877-7788

1 2	MaQian HuNan 168-8877-7788

然而实际上当我们运行程序之后发现结果为：

MaQian HuNan 168
-8877-7788

1 2	MaQian HuNan 168 -8877-7788

和想象中的并不一样，这是为什么呢？其实仔细一看也能发现，字符串 168 也属于.*的匹配范围之内，所以 168 默认匹配到了第一个分组里去了。这里涉及到正则的贪婪运算，贪婪的意思是尽可能多，在满足匹配条件的情况下，尽可能多的匹配当前规则字符串。默认情况下正则表达式是贪婪的，如果要取消贪婪模式，只要在限定符后面加一个? 就可以了，规则如下：

*?：重复一次或多次，尽可能少重复
+?：重复一次以上，尽可能少重复
??：重复 0 次或 1 次，尽可能少重复

所以上面的正则写成 (.+?)(d+-d+-d+)就能按照预想来输出了。

一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

一、概述

二、语法规则

2.1 元字符

2.1.1 规则

2.1.2 案例

2.2 限定符

2.2.1 规则

2.2.2 使用案例

2.3 反义符

2.3.1 规则

2.3.2 案例

2.4 分组

2.4.1 设置分组

2.4.2 引用分组

2.4.3 给分组取别名

2.5 贪婪模式和非贪婪模式

发表评论 取消回复

发表评论取消回复