Unicode 字符属性
自 PHP 5.1.0 起,
三个额外的转义序列在选用 UTF-8 模式时用于匹配通用字符类型。他们是:
- \p{xx}
- 一个有属性 xx 的字符
- \P{xx}
- 一个没有属性 xx 的字符
- \X
- 一个扩展的 Unicode 字符
上面 xx
代表的属性名用于限制 Unicode 通常的类别属性。
每个字符都有一个这样的确定的属性,通过两个缩写的字母指定。
为了与 Perl 兼容,
可以在左花括号 { 后面增加 ^ 表示取反。比如:
\p{^Lu}
就等同于 \P{Lu}
。
如果通过 \p
或 \P
仅指定了一个字母,它包含所有以这个字母开头的属性。
在这种情况下,花括号的转义序列是可选的;以下两个例子是等同的:
支持的 Unicode 属性
Property |
Matches |
Notes |
C |
Other |
|
Cc |
Control |
|
Cf |
Format |
|
Cn |
Unassigned |
|
Co |
Private use |
|
Cs |
Surrogate |
|
L |
Letter |
包含以下属性:Ll 、
Lm 、Lo 、Lt 、
Lu .
|
Ll |
小写字母 |
|
Lm |
Modifier letter |
|
Lo |
Other letter |
|
Lt |
Title case letter |
|
Lu |
Upper case letter |
|
M |
Mark |
|
Mc |
Spacing mark |
|
Me |
Enclosing mark |
|
Mn |
Non-spacing mark |
|
N |
Number |
|
Nd |
Decimal number |
|
Nl |
Letter number |
|
No |
Other number |
|
P |
Punctuation |
|
Pc |
Connector punctuation |
|
Pd |
Dash punctuation |
|
Pe |
Close punctuation |
|
Pf |
Final punctuation |
|
Pi |
Initial punctuation |
|
Po |
Other punctuation |
|
Ps |
Open punctuation |
|
S |
Symbol |
|
Sc |
Currency symbol |
|
Sk |
Modifier symbol |
|
Sm |
Mathematical symbol |
|
So |
Other symbol |
|
Z |
Separator |
|
Zl |
Line separator |
|
Zp |
Paragraph separator |
|
Zs |
Space separator |
|
InMusicalSymbols
等扩展属性在 PCRE 中不支持
指定大小写不敏感匹配对这些转义序列不会产生影响,比如,
\p{Lu}
始终匹配大写字母。
Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如:
不在确定文字中的则被集中到 Common
。当前的文字列表中有:
支持的文字
Arabic |
Armenian |
Avestan |
Balinese |
Bamum |
Batak |
Bengali |
Bopomofo |
Brahmi |
Braille |
Buginese |
Buhid |
Canadian_Aboriginal |
Carian |
Chakma |
Cham |
Cherokee |
Common |
Coptic |
Cuneiform |
Cypriot |
Cyrillic |
Deseret |
Devanagari |
Egyptian_Hieroglyphs |
Ethiopic |
Georgian |
Glagolitic |
Gothic |
Greek |
Gujarati |
Gurmukhi |
Han |
Hangul |
Hanunoo |
Hebrew |
Hiragana |
Imperial_Aramaic |
Inherited |
Inscriptional_Pahlavi |
Inscriptional_Parthian |
Javanese |
Kaithi |
Kannada |
Katakana |
Kayah_Li |
Kharoshthi |
Khmer |
Lao |
Latin |
Lepcha |
Limbu |
Linear_B |
Lisu |
Lycian |
Lydian |
Malayalam |
Mandaic |
Meetei_Mayek |
Meroitic_Cursive |
Meroitic_Hieroglyphs |
Miao |
Mongolian |
Myanmar |
New_Tai_Lue |
Nko |
Ogham |
Old_Italic |
Old_Persian |
Old_South_Arabian |
Old_Turkic |
Ol_Chiki |
Oriya |
Osmanya |
Phags_Pa |
Phoenician |
Rejang |
Runic |
Samaritan |
Saurashtra |
Sharada |
Shavian |
Sinhala |
Sora_Sompeng |
Sundanese |
Syloti_Nagri |
Syriac |
Tagalog |
Tagbanwa |
Tai_Le |
Tai_Tham |
Tai_Viet |
Takri |
Tamil |
Telugu |
Thaana |
Thai |
Tibetan |
Tifinagh |
Ugaritic |
Vai |
Yi |
|
|
|
|
\X
转义匹配了 Unicode 可扩展字符集(Unicode extended grapheme clusters)。
可扩展字符集是一个或多个 Unicode 字符,组合表达了单个象形字符。
因此无论渲染时实际使用了多少个独立字符,可以视该 Unicode 等同于 .
,
会匹配单个组合后的字符。
小于 PCRE 8.32 的版本中(对应小于 PHP 5.4.14 的内置绑定 PCRE 库),
\X
等价于 (?>\PM\pM*)
。
也就是说,它匹配一个没有 ”mark” 属性的字符,紧接着任意多个由 ”mark” 属性的字符。
并将这个序列认为是一个原子组(详见下文)。
典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。
用 Unicode 属性来匹配字符的速度并不快,
因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。
这就是为什么在 PCRE中 要使用传统的转义序列\d
、
\w
而不使用 Unicode 属性的原因。
huhwatnouDONTspamPLEASE at hotmail dot com ¶9 years ago
To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).
I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.
Steve ¶1 year ago
Examples are always useful! See https://unicodeplus.com/category for more.
C Other
Cc Control (Unicode code points in the ranges U+0000-U+001F and U+007F-U+009F)
Cf Format (Soft hyphen (U+00AD), zero width space (U+200B), etc.)
Cn Unassigned (Any code point that is not in the Unicode table)
Co Private use
Cs Surrogate (Characters in the range U+D800 to U+DFFF, which are invalid in utf-8)
L Letter
Ll Lower case letter (a-z, µßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ and more)
Lm Modifier letter (Letter-like characters that are usually combined with others, but here they stand alone:
ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁˆˇˈˉˊˋˌˍˎˏːˑˠˡˢˣˤˬˮʹͺՙ and more)
Lo Other letter (ªºƻǀǁǂǃʔ and many more ideographs and letters from unicase alphabets)
Lt Title case letter (DžLjNjDzᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾼῌῼ)
Lu Upper case letter (A-Z, ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ and more)
L& Ordinary letter (Any character that has the Lu, Ll, or Lt property)
M Mark
Mc Spacing mark (None in latin scripts)
Me Enclosing mark (Combining enclosing square (U+20DE) like in a⃞ , combining enclosing circle backslash (U+20E0) like in a⃠)
Mn Non-spacing mark (Combining diacritical marks U+0300-U+036f, like the accents on this letter a: áâãāa̅ăȧäảåa̋ǎa̍a̎ȁa̐ȃ)
N Number
Nd Decimal number (0123456789, ٠١٢٣٤٥٦٧٨٩ and digits in many other scripts.)
Nl Letter number (ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ and some more)
No Other number (⁰¹²³⁴⁵⁶⁷⁸⁹ ₀₁₂₃₄₅₆₇₈₉ ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒ ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳, etc.)
P Punctuation
Pc Connector punctuation (_ underscore (U+005F), ‿ undertie U+203F, ⁀ character tie (U+2040), etc.)
Pd Dash punctuation (- hyphen-minus (U+002D), ‐ hyphen (U+2010), ‑ non-breaking hyphen (U+2011), ‒ figure dash (U+2012),
– en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), etc.)
Pe Close punctuation (right parenthesis, bracket, or brace: `)` (U+0029), `]` (U+005D), `}` (U+007D), etc.)
Pf Final punctuation (right quotation marks: » (U+00BB), ’ (U+2019), ” (U+201D), etc.)
Pi Initial punctuation (left quotation marks: « (U+00AB), ‘ (U+2018), “ (U+201C), etc.)
Po Other punctuation (!"#%&'*,./:;?@\¡§¶·¿)
Ps Open punctuation (left parenthesis, bracket, or brace: `(` (U+0028), `[` (U+005B), `{` (U+007B), etc.)
S Symbol
Sc Currency symbol ($¢£¤¥, ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿ (U+20A0-U+20BF), etc.)
Sk Modifier symbol (Symbol-like characters that are usually combined with others, but here they stand alone:
^`¨¯´¸ and more)
Sm Mathematical symbol (+<=>|~¬±×÷϶ and many more)
So Other symbol (¦ broken bar (U+00A6), © copyright sign (U+00A9), ® registered sign (U+00AE), ° degree sign (U+00B0);
arrows, signs, emojis and many many more)
Z Separator
Zl Line separator (line separator (U+2028))
Zp Paragraph separator (paragraph separator (U+2029))
Zs Space separator (space, no-break space, en quad, em quad, en space, em space, figure space, thin space, hair space, etc.)
o_shes01 at uni-muenster dot de ¶14 years ago
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
(*) uppercase "LJ": U+01C7
(*) titlecase "Lj": U+01C8
(*) lowercase "lj": U+01C9
suit at rebell dot at ¶14 years ago
these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"
if you want to match any word but want to provide a fallback, you can do something like that:
<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
}
?>
php at lnx-bsp dot net ¶7 years ago
Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:
<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>
Will match any combination of letters and numbers.