Unicode 字符属性
  
  自 PHP 5.1.0 起,
  三个额外的转义序列在选用 UTF-8 模式时用于匹配通用字符类型。他们是:
  
  
   
    - \p{xx}
- 一个有属性 xx 的字符
- \P{xx}
- 一个没有属性 xx 的字符
- \X
- 一个扩展的 Unicode 字符
  上面 xx 代表的属性名用于限制 Unicode 通常的类别属性。
  每个字符都有一个这样的确定的属性,通过两个缩写的字母指定。
   为了与 Perl 兼容,
  可以在左花括号 { 后面增加 ^ 表示取反。比如:
  \p{^Lu} 就等同于 \P{Lu}。
  
  
  如果通过 \p 或 \P 仅指定了一个字母,它包含所有以这个字母开头的属性。
  在这种情况下,花括号的转义序列是可选的;以下两个例子是等同的:
  
  
  
   支持的 Unicode 属性
   
    
     
      | Property | Matches | Notes | 
    
    
     
      | C | Other |  | 
     
      | Cc | Control |  | 
     
      | Cf | Format |  | 
     
      | Cn | Unassigned |  | 
     
      | Co | Private use |  | 
     
      | Cs | Surrogate |  | 
     
      | L | Letter | 包含以下属性: Ll、Lm、Lo、Lt、Lu. | 
     
      | Ll | 小写字母 |  | 
     
      | Lm | Modifier letter |  | 
     
      | Lo | Other letter |  | 
     
      | Lt | Title case letter |  | 
     
      | Lu | Upper case letter |  | 
     
      | M | Mark |  | 
     
      | Mc | Spacing mark |  | 
     
      | Me | Enclosing mark |  | 
     
      | Mn | Non-spacing mark |  | 
     
      | N | Number |  | 
     
      | Nd | Decimal number |  | 
     
      | Nl | Letter number |  | 
     
      | No | Other number |  | 
     
      | P | Punctuation |  | 
     
      | Pc | Connector punctuation |  | 
     
      | Pd | Dash punctuation |  | 
     
      | Pe | Close punctuation |  | 
     
      | Pf | Final punctuation |  | 
     
      | Pi | Initial punctuation |  | 
     
      | Po | Other punctuation |  | 
     
      | Ps | Open punctuation |  | 
     
      | S | Symbol |  | 
     
      | Sc | Currency symbol |  | 
     
      | Sk | Modifier symbol |  | 
     
      | Sm | Mathematical symbol |  | 
     
      | So | Other symbol | Includes emojis | 
     
      | Z | Separator |  | 
     
      | Zl | Line separator |  | 
     
      | Zp | Paragraph separator |  | 
     
      | Zs | Space separator |  | 
    
   
  
  
  InMusicalSymbols 等扩展属性在 PCRE 中不支持
  
  
  指定大小写不敏感匹配对这些转义序列不会产生影响,比如,
  \p{Lu} 始终匹配大写字母。
  
  
      Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如:
  
  
  
   不在确定文字中的则被集中到 Common。当前的文字列表中有:
  
  
   支持的文字
   
    
     
      | Arabic | Armenian | Avestan | Balinese | Bamum | 
     
      | Batak | Bengali | Bopomofo | Brahmi | Braille | 
     
      | Buginese | Buhid | Canadian_Aboriginal | Carian | Chakma | 
     
      | Cham | Cherokee | Common | Coptic | Cuneiform | 
     
      | Cypriot | Cyrillic | Deseret | Devanagari | Egyptian_Hieroglyphs | 
     
      | Ethiopic | Georgian | Glagolitic | Gothic | Greek | 
     
      | Gujarati | Gurmukhi | Han | Hangul | Hanunoo | 
     
      | Hebrew | Hiragana | Imperial_Aramaic | Inherited | Inscriptional_Pahlavi | 
     
      | Inscriptional_Parthian | Javanese | Kaithi | Kannada | Katakana | 
     
      | Kayah_Li | Kharoshthi | Khmer | Lao | Latin | 
     
      | Lepcha | Limbu | Linear_B | Lisu | Lycian | 
     
      | Lydian | Malayalam | Mandaic | Meetei_Mayek | Meroitic_Cursive | 
     
      | Meroitic_Hieroglyphs | Miao | Mongolian | Myanmar | New_Tai_Lue | 
     
      | Nko | Ogham | Old_Italic | Old_Persian | Old_South_Arabian | 
     
      | Old_Turkic | Ol_Chiki | Oriya | Osmanya | Phags_Pa | 
     
      | Phoenician | Rejang | Runic | Samaritan | Saurashtra | 
     
      | Sharada | Shavian | Sinhala | Sora_Sompeng | Sundanese | 
     
      | Syloti_Nagri | Syriac | Tagalog | Tagbanwa | Tai_Le | 
     
      | Tai_Tham | Tai_Viet | Takri | Tamil | Telugu | 
     
      | Thaana | Thai | Tibetan | Tifinagh | Ugaritic | 
     
      | Vai | Yi |  |  |  | 
    
   
  
  
   \X 转义匹配了 Unicode 可扩展字符集(Unicode extended grapheme clusters)。
   可扩展字符集是一个或多个 Unicode 字符,组合表达了单个象形字符。
   因此无论渲染时实际使用了多少个独立字符,可以视该 Unicode 等同于 .,
   会匹配单个组合后的字符。
  
  
   小于 PCRE 8.32 的版本中(对应小于 PHP 5.4.14 的内置绑定 PCRE 库),
   \X 等价于 (?>\PM\pM*)。
  也就是说,它匹配一个没有 ”mark” 属性的字符,紧接着任意多个由 ”mark” 属性的字符。
  并将这个序列认为是一个原子组(详见下文)。
  典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。
  
  
  用 Unicode 属性来匹配字符的速度并不快,
  因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。
  这就是为什么在 PCRE中 要使用传统的转义序列\d、
  \w 而不使用 Unicode 属性的原因。
  
      
 
    
  
  huhwatnouDONTspamPLEASE at hotmail dot com ¶9 years ago
  
To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).
I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so. 
  
  
    
  
  Steve ¶2 years ago
  
Examples are always useful! See https://unicodeplus.com/category for more.
C    Other     
Cc   Control      (Unicode code points in the ranges U+0000-U+001F and U+007F-U+009F)
Cf   Format       (Soft hyphen (U+00AD), zero width space (U+200B), etc.)
Cn   Unassigned   (Any code point that is not in the Unicode table)
Co   Private use     
Cs   Surrogate    (Characters in the range U+D800 to U+DFFF, which are invalid in utf-8)
L    Letter
Ll   Lower case letter (a-z, µßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ and more)
Lm   Modifier letter   (Letter-like characters that are usually combined with others, but here they stand alone:
                        ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁˆˇˈˉˊˋˌˍˎˏːˑˠˡˢˣˤˬˮʹͺՙ and more)
Lo   Other letter      (ªºƻǀǁǂǃʔ and many more ideographs and letters from unicase alphabets)
Lt   Title case letter (DžLjNjDzᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾼῌῼ)
Lu   Upper case letter (A-Z, ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ and more)
L&   Ordinary letter   (Any character that has the Lu, Ll, or Lt property)
M    Mark
Mc   Spacing mark      (None in latin scripts)
Me   Enclosing mark    (Combining enclosing square (U+20DE) like in a⃞ , combining enclosing circle backslash (U+20E0) like in a⃠)
Mn   Non-spacing mark  (Combining diacritical marks U+0300-U+036f, like the accents on this letter a: áâãāa̅ăȧäảåa̋ǎa̍a̎ȁa̐ȃ)
N    Number      
Nd   Decimal number (0123456789, ٠١٢٣٤٥٦٧٨٩ and digits in many other scripts.)
Nl   Letter number  (ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ and some more)
No   Other number   (⁰¹²³⁴⁵⁶⁷⁸⁹ ₀₁₂₃₄₅₆₇₈₉ ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒ ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳, etc.)
P    Punctuation      
Pc   Connector punctuation (_ underscore (U+005F), ‿ undertie U+203F, ⁀ character tie (U+2040), etc.)
Pd   Dash punctuation      (- hyphen-minus (U+002D), ‐ hyphen (U+2010), ‑ non-breaking hyphen (U+2011), ‒ figure dash (U+2012),
                            – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), etc.)
Pe   Close punctuation     (right parenthesis, bracket, or brace: `)` (U+0029), `]` (U+005D), `}` (U+007D), etc.) 
Pf   Final punctuation     (right quotation marks: » (U+00BB), ’ (U+2019), ” (U+201D), etc.)
Pi   Initial punctuation   (left  quotation marks: « (U+00AB), ‘ (U+2018), “ (U+201C), etc.)
Po   Other punctuation     (!"#%&'*,./:;?@\¡§¶·¿)
Ps   Open punctuation      (left parenthesis, bracket, or brace: `(` (U+0028), `[` (U+005B), `{` (U+007B), etc.) 
S    Symbol      
Sc   Currency symbol     ($¢£¤¥, ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿ (U+20A0-U+20BF), etc.)
Sk   Modifier symbol     (Symbol-like characters that are usually combined with others, but here they stand alone:
                          ^`¨¯´¸ and more)
Sm   Mathematical symbol (+<=>|~¬±×÷϶ and many more)
So   Other symbol        (¦ broken bar (U+00A6), © copyright sign (U+00A9), ® registered sign (U+00AE), ° degree sign (U+00B0);
                          arrows, signs, emojis and many many more)
Z    Separator      
Zl   Line separator      (line separator (U+2028))
Zp   Paragraph separator (paragraph separator (U+2029))
Zs   Space separator     (space, no-break space, en quad, em quad, en space, em space, figure space, thin space, hair space, etc.) 
  
  
    
  
  o_shes01 at uni-muenster dot de ¶14 years ago
  
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
  (*) uppercase "LJ": U+01C7 
  (*) titlecase "Lj": U+01C8 
  (*) lowercase "lj": U+01C9
   
  
    
  
  suit at rebell dot at ¶15 years ago
  
these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"
if you want to match any word but want to provide a fallback, you can do something like that: 
<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
  }
?>
   
  
    
  
  php at lnx-bsp dot net ¶8 years ago
  
Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:
<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>
Will match any combination of letters and numbers.