grapheme_extract

(PHP 5 >= 5.3.0, PHP 7, PHP 8, PECL intl >= 1.0.0)

grapheme_extract — Extrait un groupe de graphème d'une chaîne UTF-8

Description

Style procédural

grapheme_extract(
    string $haystack,
    int $size,
    int $type = GRAPHEME_EXTR_COUNT,
    int $offset = 0,
    int &$next = null
): string|false

Cette fonction extrait une séquence de groupes de graphèmes par défaut d'un texte en UTF-8.

Liste de paramètres

haystack

La chaîne à étudier.

size

Le nombre maximal d'élément, en fonction de type, à retourner.

type

Définit le type d'unités indiquées par le paramètre size :

GRAPHEME_EXTR_COUNT (par défaut) : size est le nombre de groupe de graphèmes à extraire.
GRAPHEME_EXTR_MAXBYTES : size est le nombre d'octets à retourner.
GRAPHEME_EXTR_MAXCHARS : size est le nombre de caractères UTF-8 à retourner.

offset

La position de début dans haystack, exprimée en octets. Elle doit être positive, nulle ou inférieure à la taille de haystack en octets, ou une valeur négative, qui compterait à partir de la fin de haystack. Si offset ne correspond pas au premier octets d'un caractère UTF-8 valide, la position de démarrage sera déplacée au prochain octet valide.

next

Référence à une variable qui recevra la prochaine position de début valide. Lorsque la fonction se termine, cela peut être une position qui est au dela de la taille de la chaîne.

Valeurs de retour

Une chaîne qui débute à la position offset et se termine à la limite valide d'un graphème, et qui se conforment aux conditions size et type spécifiées, ou false si une erreur survient.

Historique

Version	Description
7.1.0	Le support des valeurs négatives pour `offset` a été ajouté.

Exemples

Exemple #1 Exemple avec grapheme_extract()

<?php

$char_a_ring_nfd = "a\xCC\x8A";  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_o_diaeresis_nfd = "o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2));

?>

L'exemple ci-dessus va afficher :

o%CC%88

Voir aussi

grapheme_substr() - Retourne une partie d'une chaîne
» Unicode Text Segmentation: Grapheme Cluster Boundaries

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 3 notes

down

AJH ¶

14 years ago

Here's how to use grapheme_extract() to loop across a UTF-8 string character by character.

<?php

$str = "سabcक’…";
// if the previous line didn't come through, the string contained:
//U+0633,U+0061,U+0062,U+0063,U+0915,U+2019,U+2026

$n = 0;

for (    $start = 0, $next = 0, $maxbytes = strlen($str), $c = '';
        $start < $maxbytes;
        $c = grapheme_extract($str, 1, GRAPHEME_EXTR_MAXCHARS , ($start = $next), $next)
    )
{
    if (empty($c))
        continue;
    echo "This utf8 character is " . strlen($c) . " bytes long and its first byte is " . ord($c[0]) . "\n";
    $n++;
}
echo "$n UTF-8 characters in a string of $maxbytes bytes!\n";
// Should print: 7 UTF8 characters in a string of 14 bytes!
?>

down

Philo ¶

2 years ago

The other comments on this page were helpful for me.
However, consider using something better than empty($value) when checking the value returned by grapheme_extract since it could as well return something like "0" (which of course evaluates to false).

down

yevgen dot grytsay at gmail dot com ¶

5 years ago

Looping through grapheme clusters:

<?php

// Example taken from Rust documentation: https://doc.rust-lang.org/book/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my
$str = "नमस्ते";
// Alternatively:
//$str = pack('C*', ...[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]);
$next = 0;
$maxbytes = strlen($str);

var_dump($str);

while ($next < $maxbytes) {
    $char = grapheme_extract($str, 1, GRAPHEME_EXTR_COUNT, $next, $next);
    if (empty($char)) {
        continue;
    }
    echo "{$char} - This utf8 character is " . strlen($char) . ' bytes long', PHP_EOL;
}

//string(18) "नमस्ते"
//न - This utf8 character is 3 bytes long
//म - This utf8 character is 3 bytes long
//स् - This utf8 character is 6 bytes long
//ते - This utf8 character is 6 bytes long
?>

＋add a note