Suchmuster-Modifikatoren

Die zurzeit möglichen PCRE-Modifikatoren sind unten aufgelistet. Die Bezeichnungen in Klammern beziehen sich auf die internen PCRE-Bezeichnungen für diese Modifikatoren. Leerzeichen und Zeilenumbrüche in Modifikatoren werden ignoriert. Andere Zeichen führen zu einem Fehler.

i (PCRE_CASELESS)

Wenn dieser Modifikator gesetzt ist, passen Buchstaben im Suchmuster sowohl auf groß- als auch auf kleingeschriebene Buchstaben.

m (PCRE_MULTILINE)

Standardmäßig behandelt PCRE eine zu durchsuchende Zeichenkette wie eine einzige Zeile von Zeichen (auch wenn sie tatsächlich mehrere Zeilenumbrüche enthält). Das Metazeichen für einen Zeilenanfang (^) passt nur auf den Anfang der Zeichenkette, das Metazeichen für ein Zeilenende ($) nur auf das Ende der Zeichenkette (falls der Modifikator D nicht gesetzt ist). Das ist genauso wie bei Perl. Wenn dieser Modifikator gesetzt ist, passen die Zeilenanfang- und Zeilenende-Konstrukte in der Zeichenkette sowohl direkt nach, bzw. vor einem Zeilenumbruch als auch auf deren Anfang und Ende. Das entspricht dem Perl-Modifikator /m. Falls die Zeichenkette keine Sequenz "\n" enthält, oder im Suchmuster kein ^ oder $ vorkommt, hat dieser Modifikator keine Wirkung.

s (PCRE_DOTALL)

Wenn dieser Modifikator gesetzt ist, passt das Metazeichen Punkt im Suchmuster auf alle Zeichen inklusive Zeilenumbrüche. Ohne diesen Modifikator sind Zeilenumbrüche ausgeschlossen. Dieser Modifikator entspricht dem Perl-Modifikator /s. Unabhängig davon, ob dieser Modifikator gesetzt ist, passt eine verneinende Zeichenklasse, z. B. [^a], immer auf einen Zeilenumbruch.

x (PCRE_EXTENDED)

Wenn dieser Modifikator gesetzt ist, werden Leerräume im Suchmuster ignoriert, sofern sie nicht maskiert sind oder sich innerhalb einer Zeichenklasse befinden. Außerdem werden Zeichen, die außerhalb einer Zeichenklasse zwischen nicht maskierten # stehen, einschließlich dem nächsten Zeilenumbruch ignoriert. Das entspricht dem Perl-Modifikator /x und bietet die Möglichkeit, Kommentare in komplizierte Suchmuster einzufügen. Beachten Sie aber, dass dies nur für Datenzeichen gilt. Leerräume dürfen niemals innerhalb einer Folge spezieller Zeichen auftreten, zum Beispiel in der Folge (?(, die einen bedingten Teilausdruck einleitet.

A (PCRE_ANCHORED)

Wenn dieser Modifikator gesetzt ist, wird das Suchmuster "verankert", das bedeutet, dass es gezwungen wird, nur auf den Anfang der durchsuchten Zeichenkette zu passen. Diese Wirkung kann auch durch geeignete Konstrukte im Suchmuster selbst erreicht werden, was in Perl die einzige Möglichkeit ist, sie zu realisieren.

D (PCRE_DOLLAR_ENDONLY)

Wenn dieser Modifikator gesetzt ist, passt ein Dollar-Metazeichen im Suchmuster nur auf das Ende der durchsuchten Zeichenkette. Ohne diesen Modifikator passt ein Dollarzeichen auch direkt vor dem letzten Zeichen, falls es ein Zeilenumbruch ist (aber nicht vor anderen Zeilenumbrüchen). Wenn der Modifikator m gesetzt ist, wird dieser Modifikator ignoriert. Für diesen Modifikator gibt es in Perl keine Entsprechung.

S

Wenn ein Suchmuster mehrmals verwendet werden soll, lohnt es sich, mehr Zeit für dessen Analyse aufzubringen um die Suche zu optimieren. Wenn dieser Modifikator gesetzt ist, wird diese zusätzliche Analyse durchgeführt. Gegenwärtig ist die Untersuchung eines Suchmusters nur für nicht verankerte Suchmuster brauchbar, die am Anfang kein einzelnes fixiertes Zeichen haben. Seit PHP 7.3.0 hat dieses Flag keine Wirkung mehr.

U (PCRE_UNGREEDY)

Dieser Modifikator kehrt die Gier von Quantifikatoren um, sodass sie standardmäßig nicht gierig sind, aber gierig werden, wenn ihnen ein ? folgt. Das ist nicht mit Perl kompatibel. Es kann auch innerhalb des Suchmusters mit dem Modifikator (?U) oder durch ein Fragezeichen hinter dem Quantifikator (z. B. .*?) gesetzt werden.
Hinweis:
Im Ungreedy-Modus ist es nicht möglich, mehr als pcre.backtrack_limit Übereinstimmungen zu treffen.

X (PCRE_EXTRA)

Dieser Modifikator schaltet zusätzliche PCRE-Funktionalität ein, die nicht mit Perl kompatibel ist. Ein Backslash vor einem Buchstaben, der keine spezielle Bedeutung hat, verursacht eine Fehlermeldung und reserviert diese Kombinationen somit für künftige Erweiterungen. Standardmäßig wird ein Backslash vor einem Buchstaben, der keine spezielle Bedeutung hat, wie in Perl als Buchstabensymbol behandelt. Gegenwärtig werden von diesem Modifikator keine weiteren Eigenschaften kontrolliert.

J (PCRE_INFO_JCHANGED)

Die interne Option (?J) ändert die lokale Option PCRE_DUPNAMES. Erlaubt doppelte Namen für Teilsuchmuster. Von PHP 7.2.0 an wird der J-Modifikator ebenfalls unterstützt.

u (PCRE_UTF8)

Dieser Modifikator schaltet zusätzliche PCRE-Funktionalität ein, die nicht mit Perl kompatibel ist. Suchmuster und durchsuchte Zeichenketten werden als UTF-8 behandelt. Eine ungültige Zeichenkette bewirkt, dass die preg_*-Funktionen keine Übereinstimmungen finden. Ein ungültiges Suchmuster erzeugt einen Fehler der Stufe E_WARNING. UTF-8-Zeichenfolgen mit fünf oder sechs Oktetten werden als ungültig betrachtet.

n (PCRE_NO_AUTO_CAPTURE)

Dieser Modifikator bewirkt, dass einfache (xyz)-Gruppen nicht erfassend sind. Nur benannte Gruppen wie (?<name>xyz) sind erfassend. Dies wirkt sich nur darauf aus, welche Gruppen erfassend sind. Es ist weiterhin möglich, nummerierte Teilsuchmuster zu verwenden, und das Array matches enthält weiterhin nummerierte Ergebnisse. Verfügbar seit PHP 8.2.0.

r (PCRE2_EXTRA_CASELESS_RESTRICT)

Wenn u (PCRE_UTF8) und i (PCRE_CASELESS) verwendet werden, verhindert dieser Modifikator, dass ASCII- und Nicht-ASCII-Zeichen übereinstimmen. Zum Beispiel passt preg_match('/\x{212A}/iu', "K") auf das Kelvin-Zeichen K (U+212A). Wenn r verwendet wird (preg_match('/\x{212A}/iur', "K")), gibt es keine Übereinstimmung. Verfügbar seit PHP 8.4.0.

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of; 1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5" 2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8 3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places ) 4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/ The following script should give you an idea of what works and what doesn't; <?php $examples = array( 'Valid ASCII' => "a", 'Valid 2 Octet Sequence' => "\xc3\xb1", 'Invalid 2 Octet Sequence' => "\xc3\x28", 'Invalid Sequence Identifier' => "\xa0\xa1", 'Valid 3 Octet Sequence' => "\xe2\x82\xa1", 'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1", 'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28", 'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc", 'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc", 'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc", 'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28", 'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1", 'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1", ); echo "++Invalid UTF-8 in pattern\n"; foreach ( $examples as $name => $str ) { echo "$name\n"; preg_match("/".$str."/u",'Testing'); } echo "++ preg_match() examples\n"; foreach ( $examples as $name => $str ) { preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar); echo "$name: "; if ( count($ar) == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched {$ar[0]}\n"; } } echo "++ preg_match_all() examples\n"; foreach ( $examples as $name => $str ) { preg_match_all('/./u', $str, $ar); echo "$name: "; $num_utf8_chars = count($ar[0]); if ( $num_utf8_chars == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched $num_utf8_chars character\n"; } } ?>

Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them: For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used: preg_match('/[\x{2460}-\x{2468}]/u', $str); Here $str is a haystack string \x{hex} - is an UTF-8 hex char-code and /u is used for identifying the class as a class of Unicode chars. Hope, it'll be useful.

The description of the "u" flag is a bit misleading. It suggests that it is only required if the pattern contains UTF-8 characters, when in fact it is required if either the pattern or the subject contain UTF-8. Without it, I was having problems with preg_match_all returning invalid multibyte characters when given a UTF-8 subject string. It's fairly clear if you read the documentation for libpcre: In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. [from http://www.pcre.org/pcre.txt]

the PCRE_INFO_JCHANGED modifier is apparently not accepted as a global option (after the closing delimiter) in PHP versions <= 5.4 (not checked in PHP 5.5) but allowed in PHP 5.6 (also not checked in PHP 7.X) The following pattern doesn't work in PHP 5.4, but it works in PHP 5.6: <?php //test.php preg_match_all('/(?<dup_name>\d{1,4})\-(?<dup_name>\d{1,2})/J', '1234-23', $matches); var_dump($matches); /* output in PHP 5.4: Warning: preg_match_all(): Unknown modifier 'J' in test.php on line 3 NULL -------------- output PHP 5.6: array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } */ ?> in order to resolve this issue in PHP 5.4, one can use the (?J) pattern modifier, which indicates the pattern (from that point forward) allows duplicate names for subpatterns. code which works in PHP 5.4: <?php preg_match_all('/(?J)(?<dup_name>\d{1,4})\-(?<dup_name>\d{1,2})/', '1234-23', $matches); var_dump($matches); /* output in PHP 5.4: array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } -------------- output in PHP 5.6 (the same as with /J): array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } */ ?>

If the _subject_ contains utf-8 sequences the 'u' modifier should be set, otherwise a pattern such as /./ could match a utf-8 sequence as two to four individual ASCII characters. It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier. If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.

A warning about the /i modifier and POSIX character classes: If you're using POSIX character classes in your regex that indicate case such as [:upper:] or [:lower:] in combination with the /i modifier, then in PHP < 7.3 the /i modifier will take precedence and effectively make both those character classes work as [:alpha:], but in PHP >= 7.3 the character classes overrule the /i modifier.

A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m). <?php // Various OS-es have various end line (a.k.a line break) chars: // - Windows uses CR+LF (\r\n); // - Linux LF (\n); // - OSX CR (\r). // And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?). $str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_"; // C 3 p 0 _ $pat1='/\w$/mi'; // This works excellent in JavaScript (Firefox 7.0.1+) $pat2='/\w\r?$/mi'; $pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org $pat4='/\w\v?$/mi'; $pat5='/(*ANYCRLF)\w$/mi'; // Excellent but undocumented on php.net at the moment $n=preg_match_all($pat1, $str, $m1); $o=preg_match_all($pat2, $str, $m2); $p=preg_match_all($pat3, $str, $m3); $r=preg_match_all($pat4, $str, $m4); $s=preg_match_all($pat5, $str, $m5); echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true) ."\n2 !!! $pat2 ($o): ".print_r($m2[0], true) ."\n3 !!! $pat3 ($p): ".print_r($m3[0], true) ."\n4 !!! $pat4 ($r): ".print_r($m4[0], true) ."\n5 !!! $pat5 ($s): ".print_r($m5[0], true); // Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 (\R), $pat4 (\v) and altered newline option in $pat5 ((*ANYCRLF)) - for some applications at least. /* The code above results in the following output: ABC ABC 123 123 def def nop nop 890 890 QRS QRS ~-_ ~-_ 1 !!! /\w$/mi (3): Array ( [0] => C [1] => 0 [2] => _ ) 2 !!! /\w\r?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 3 !!! /\w\R?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 4 !!! /\w\v?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 5 !!! /(*ANYCRLF)\w$/mi (7): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S [6] => _ ) */ ?> Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.

An important addendum (with new $pat3_2 utilising \R properly, its results and comments): Note that there are (sometimes difficult to grasp at first glance) nuances of meaning and application of escape sequences like \r, \R and \v - none of them is perfect in all situations, but they are quite useful nevertheless. Some official PCRE control options and their changes come in handy too - unfortunately neither (*ANYCRLF), (*ANY) nor (*CRLF) is documented here on php.net at the moment (although they seem to be available for over 10 years and 5 months now), but they are described on Wikipedia ("Newline/linebreak options" at https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) and official PCRE library site ("Newline convention" at http://www.pcre.org/original/doc/html/pcresyntax.html#SEC17) pretty well. The functionality of \R appears somehow disappointing (with default configuration of compile time option) according to php.net as well as official description ("Newline sequences" at https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq) when used improperly. A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end (or at the beginning) of any line even without the multiple lines mode (/m) or meta-character assertions ($ or ^). <?php // Various OS-es have various end line (a.k.a line break) chars: // - Windows uses CR+LF (\r\n); // - Linux LF (\n); // - OSX CR (\r). // And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?) of default configuration option for meta-character assertions (^ and $) at compile time of PCRE. $str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_"; // C 3 p 0 _ $pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly $pat3_2='/\w(?=\R)/i'; // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected, but '/(*ANYCRLF)\w$/mi' is more straightforward in use anyway $p=preg_match_all($pat3, $str, $m3); $r=preg_match_all($pat3_2, $str, $m4); echo $str."\n3 !!! $pat3 ($p): ".print_r($m3[0], true) ."\n3_2 !!! $pat3_2 ($r): ".print_r($m4[0], true); // Note the difference between the two very helpful escape sequences in $pat3 and $pat3_2 (\R) - for some applications at least. /* The code above results in the following output: ABC ABC 123 123 def def nop nop 890 890 QRS QRS ~-_ ~-_ 3 !!! /\w\R?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 3_2 !!! /\w(?=\R)/i (6): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S ) */ ?> Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.

In case you're wondering, what is the meaning of "S" modifier, this paragraph might be useful: When "S" modifier is set, PHP calls the pcre_study() function from the PCRE API before executing the regexp. Result from the function is passed directly to pcre_exec(). For more information about pcre_study() and "Studying the pattern" check the PCRE manual on http://www.pcre.org/pcre.txt PS: Note that function names "pcre_study" and "pcre_exec" used here refer to PCRE library functions written in C language and not to any PHP functions.

When adding comments with the /x modifier, don't use the pattern delimiter in the comments. It may not be ignored in the comments area. Example: <?php $target = 'some text'; if(preg_match('/ e # Comments here /x',$target)) { print "Target 1 hit.\n"; } if(preg_match('/ e # /Comments here with slash /x',$target)) { print "Target 1 hit.\n"; } ?> prints "Target 1 hit." but then generates a PHP warning message for the second preg_match(): Warning: preg_match() [function.preg-match]: Unknown modifier 'C' in /ebarnard/x-modifier.php on line 11

Suchmuster-Modifikatoren

Found A Problem?

User Contributed Notes 11 notes