关于正则

Intro

最近工作上遇到一些字符串匹配问题,自然就用到了正则。很早以前自己翻过一些正则相关的书以及资料,但是最终如果没有应用到实际场景中,渐渐就会忘了。

Regular Expression 正则表达式,是计算机科学的一个概念。正则表达式使用单个字符来描述、匹配一系列匹配某个句法规则的字符串。

Repeaters

A repeater indicates how many times the previous character has to be repeated. When put just after a group, it indicates how many times the group has to be repeated.

* Zero or more times.
+ One or more times.
? Zero or one times.
{m,n} At least m and at most n times.

Repeaters are greedy.

Wildcards

In software, a wildcard character is a single character, such as an asterisk *,
used to represent a number of characters or an empty string.
It is often used in file searches so the full name need not be typed.

. Matches anything, except the newline character.
\d Matches any digit. Equal to [0-9].
\D Matches any non-digit. Equal to [^0-9].
\s Matches any whitespace. Equal to [ \t\n\r\f\v].
\S Matches any non-whitespace. Equal to [^ \t\n\r\f\v].
\w Matches any alphanumeric char. Equal to [a-zA-Z0-9] if ASCII.
\W Matches any non-alphanumeric char. Equal to [^a-zA-Z0-9].

Character classes

A character class defines a set of characters to be matched. The character class [abc] matches either char a, b or c. The characters in the class are OR-ed, not AND-ed. A shorthand notation would be [a-c].

[.] Loses special nature.
[\d] Keeps special nature.
[\D] Matches any non-digit.
[\s] Keeps special nature.
[\S] Keeps special nature.
[\w] Keeps special nature.
[\W] Keeps special nature.

Character groups

A character group forms one unit. The group (abc) matches the exact substring abc and can be backreferenced by its group number.

Please do not confuse character groups with classes! This overview might help:

  • abc : matches the exact substring “abc”.
  • (abc) : matches the exact substring “abc”, and can be backreferenced.
  • [abc] : matches character a, b or c.
1
2
3
4
5
6
7
8
9
10
11
12
13
>>> import re
>>> p = re.compile(r"(a(b)c)d")
>>> m = p.match("abcdefghijklmn")
>>>
>>> m.group(0) <- Group 0 is by default the whole match.
"abcd"
>>> m.group(1) <- The group belonging to the first ( encountered.
"abc"
>>> m.group(2) <- The group belonging to the second ( encountered.
"b"
>>> m.groups() <- Don't forget the s.
("abc", "b")
>>>

零宽断言(zero-width assertions)

Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

零宽断言用来定位到某个位置,得到的结果也是零宽匹配。零宽断言分为预测先行(lookahead)回顾后发(lookbehind),两者合称为环视(lookaround)。

Postive and Negative Lookahead

Positive lookahead works just the same. q(?=exp) 断言指定位置的后面能够匹配表达式exp。

Negative lookahead provides the solution: q(?!exp). 断言指定位置的后面不满足表达式exp。

Positive and Negative Lookbehind

The construct for positive lookbehind is (?<=exp) 断言指定位置的前面能够匹配表达式exp。

Negative lookbehind is written as (?<!exp) 断言指定位置的前面不满足表达式exp。

Reference

https://qscintilla.com/zero-width-assertions/

https://www.regular-expressions.info/lookaround.html

More than your eyes can see