mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding检测字符的编码

说明

mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

从有序的候选列表中检测 string string 最可能的字符编码。

对预期(intended)字符编码的自动检测不可能永远完全可靠;没有额外的信息,就类似于在没有密钥的情况下解码已编码的字符串。最好使用与数据一起存储或传输的字符编码表示,例如“Content-Type” HTTP 头。

此函数适用于多字节编码,但并非所有字节顺序都构成有效字符串。如果输入字符串包含这样的顺序,则将会拒绝该编码,并检查下一个编码。

参数

string

要检测的 string

encodings

按顺序尝试的字符编码列表。该列表可以指定为字符串数组,或以逗号分隔的单个字符串。

如果省略 encodings 被或为 null,则将使用当前的 detect_order(使用 mbstring.detect_order 配置选项或 mb_detect_order() 函数设置)。

strict

控制 string 在列出的所有 encodings 中无效时的行为。如果 strict 设置为 false,将返回最接近的匹配编码;如果 strict 设置为 true,将返回 false

可以使用 mbstring.strict_detection 配置选项设置 strict 的默认值。

返回值

检测到的字符编码,如果字符串在任何列出的编码中均无效,则返回 false

更新日志

版本 说明
8.2.0 mb_detect_encoding() 将不再返回以下非文本编码:"Base64""QPrint""UUencode""HTML entities""7 bit""8 bit"

示例

示例 #1 mb_detect_encoding() 示例

<?php
// 使用当前的 detect_order 来检测字符编码
echo mb_detect_encoding($str);

// "auto" 将根据 mbstring.language 来扩展
echo mb_detect_encoding($str, "auto");

// 通过以逗号分隔的列表指定“encodings”参数
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

// 使用数组指定“encodings”参数
$encodings = [
"ASCII",
"JIS",
"EUC-JP"
];
echo
mb_detect_encoding($str, $encodings);
?>

示例 #2 strict 参数的影响

<?php
// 'áéóú' 在 ISO-8859-1 中的编码
$str = "\xE1\xE9\xF3\xFA";

// 该字符串不是有效的 ASCII 或 UTF-8,但 UTF-8 被认为是更接近的匹配
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// 如果找到有效编码,则严格参数不会更改结果
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

以上示例会输出:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

在某些情况下,相同的字节顺序可能会在多种字符编码中形成有效的字符串,并且无法知道其意图是哪种解释。例如,在众多字符编码中,字节序列“\xC4\xA2”可能是:

  • "Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS followed by U+00A2 CENT SIGN) encoded in any of ISO-8859-1, ISO-8859-15, or Windows-1252
  • "ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF followed by U+0402 CYRILLIC CAPITAL LETTER DJE) encoded in ISO-8859-5
  • "Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) encoded in UTF-8

示例 #3 匹配多个编码时顺序的影响

<?php
$str
= "\xC4\xA2";

// 该字符串在所有三种编码中均有效,因此将返回列出的第一个
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

以上示例会输出:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

参见

添加备注

用户贡献的备注 19 notes

up
83
Gerg Tisza
14 years ago
If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.<?php    $str = 'áéóú'; // ISO-8859-1    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'    mb_detect_encoding($str, 'UTF-8', true); // false?>
up
24
mta59066 at gmail dot com
2 years ago
The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.<?phpfunction mb_detect_enconding_in_order(string $string, array $encodings): string|false{    foreach($encodings as $enc) {        if (mb_check_encoding($string, $enc)) {            return $enc;        }    }    return false;}?>
up
5
geompse at gmail dot com
2 years ago
Major undocumented breaking change since 8.1.7https://3v4l.org/BLjZ3Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding
up
21
Chrigu
20 years ago
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:mb_detect_encoding($string, 'UTF-8, ISO-8859-1');if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
up
19
chris AT w3style.co DOT uk
19 years ago
Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.<?phpfunction detectUTF8($string){        return preg_match('%(?:        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16        )+%xs', $string);}?>
up
19
nat3738 at gmail dot com
16 years ago
A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)<?php// Unicode BOM is U+FEFF, but after encoded, it will look like this.define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));function detect_utf_encoding($filename) {    $text = file_get_contents($filename);    $first2 = substr($text, 0, 2);    $first3 = substr($text, 0, 3);    $first4 = substr($text, 0, 3);        if ($first3 == UTF8_BOM) return 'UTF-8';    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';}?>
up
10
dennis at nikolaenko dot ru
16 years ago
Beware of bug to detect Russian encodingshttp://bugs.php.net/bug.php?id=38138
up
5
rl at itfigures dot nl
17 years ago
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:if(detectUTF8($str)){  $str=str_replace("\xE2\x82\xAC","&euro;",$str);   $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);  $str=str_replace("&euro;","\x80",$str); }If html-output is needed the last line is not necessary (and even unwanted).
up
5
php-note-2005 at ryandesign dot com
20 years ago
Much simpler UTF-8-ness checker using a regular expression created by the W3C:<?php// Returns true if $string is valid UTF-8 and false otherwise.function is_utf8($string) {        // From http://w3.org/International/questions/qa-forms-utf-8.html    return preg_match('%^(?:          [\x09\x0A\x0D\x20-\x7E]            # ASCII        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16    )*$%xs', $string);    } // function is_utf8?>
up
5
hmdker at gmail dot com
16 years ago
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.<?phpfunction is_utf8($str) {    $c=0; $b=0;    $bits=0;    $len=strlen($str);    for($i=0; $i<$len; $i++){        $c=ord($str[$i]);        if($c > 128){            if(($c >= 254)) return false;            elseif($c >= 252) $bits=6;            elseif($c >= 248) $bits=5;            elseif($c >= 240) $bits=4;            elseif($c >= 224) $bits=3;            elseif($c >= 192) $bits=2;            else return false;            if(($i+$bits) > $len) return false;            while($bits > 1){                $i++;                $b=ord($str[$i]);                if($b < 128 || $b > 191) return false;                $bits--;            }        }    }    return true;}?>
up
4
eyecatchup at gmail dot com
12 years ago
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:<?php  if (preg_match("//u", $string)) {      // $string is valid UTF-8  }
up
2
garbage at iglou dot eu
8 years ago
For detect UTF-8, you can use:if (preg_match('!!u', $str)) { echo 'utf-8'; }- Norihiori
up
-2
d_maksimov
3 years ago
It was helpful for my exec(...) call. When it returned cp866 or cp1251:try {    $line = iconv('CP866', 'CP1251', $line);} catch(Exception $e) {}return iconv('CP1251', 'UTF-8', $line);
up
0
emoebel at web dot de
11 years ago
if the  function " mb_detect_encoding" does not exist  ... ... try: <?php // ---------------------------------------------------- if ( !function_exists('mb_detect_encoding') ) { // ---------------------------------------------------------------- function mb_detect_encoding ($string, $enc=null, $ret=null) {                static $enclist = array(             'UTF-8', 'ASCII',             'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5',             'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10',             'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',             'Windows-1251', 'Windows-1252', 'Windows-1254',             );                $result = false;                 foreach ($enclist as $item) {             $sample = iconv($item, $item, $string);             if (md5($sample) == md5($string)) {                 if ($ret === NULL) { $result = $item; } else { $result = true; }                 break;             }        }            return $result; } // ---------------------------------------------------------------- } // ---------------------------------------------------- ?>example / usage of: mb_detect_encoding() <?php // ------------------------------------------------------ function str_to_utf8 ($str) {         if (mb_detect_encoding($str, 'UTF-8', true) === false) {     $str = utf8_encode($str);     }    return $str;}// ------------------------------------------------------ ?>$txtstr = str_to_utf8($txtstr);
up
0
maarten
20 years ago
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.To verify utf 8 use the following:////    utf8 encoding validation developed based on Wikipedia entry at://    http://en.wikipedia.org/wiki/UTF-8////    Implemented as a recursive descent parser based on a simple state machine//    copyright 2005 Maarten Meijer////    This cries out for a C-implementation to be included in PHP core//    function valid_1byte($char) {        if(!is_int($char)) return false;        return ($char & 0x80) == 0x00;    }        function valid_2byte($char) {        if(!is_int($char)) return false;        return ($char & 0xE0) == 0xC0;    }    function valid_3byte($char) {        if(!is_int($char)) return false;        return ($char & 0xF0) == 0xE0;    }    function valid_4byte($char) {        if(!is_int($char)) return false;        return ($char & 0xF8) == 0xF0;    }        function valid_nextbyte($char) {        if(!is_int($char)) return false;        return ($char & 0xC0) == 0x80;    }        function valid_utf8($string) {        $len = strlen($string);        $i = 0;            while( $i < $len ) {            $char = ord(substr($string, $i++, 1));            if(valid_1byte($char)) {    // continue                continue;            } else if(valid_2byte($char)) { // check 1 byte                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;            } else if(valid_3byte($char)) { // check 2 bytes                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;            } else if(valid_4byte($char)) { // check 3 bytes                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;                if(!valid_nextbyte(ord(substr($string, $i++, 1))))                    return false;            } // goto next char        }        return true; // done    }for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png
up
-1
bmrkbyet at web dot de
12 years ago
a) if the FUNCTION mb_detect_encoding is not available: ### mb_detect_encoding ... iconv ###<?php// -------------------------------------------if(!function_exists('mb_detect_encoding')) { function mb_detect_encoding($string, $enc=null) {         static $list = array('utf-8', 'iso-8859-1', 'windows-1251');        foreach ($list as $item) {        $sample = iconv($item, $item, $string);        if (md5($sample) == md5($string)) {             if ($enc == $item) { return true; }    else { return $item; }         }    }    return null;}}// -------------------------------------------?>b) if the FUNCTION mb_convert_encoding is not available: ### mb_convert_encoding ... iconv ###<?php// -------------------------------------------if(!function_exists('mb_convert_encoding')) { function mb_convert_encoding($string, $target_encoding, $source_encoding) {     $string = iconv($source_encoding, $target_encoding, $string);     return $string; }}// -------------------------------------------?>
up
-1
telemach
20 years ago
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')returns ISO-8859-1, while mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')returns UTF-8bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding
up
-1
recentUser at example dot com
7 years ago
In my environment (PHP 7.1.12),"mb_detect_encoding()" doesn't work     where "mb_detect_order()" is not set appropriately.To enable "mb_detect_encoding()" to work in such a case,     simply put "mb_detect_order('...')"     before "mb_detect_encoding()" in your script file.Both      "ini_set('mbstring.language', '...');"     and     "ini_set('mbstring.detect_order', '...');"DON'T work in script files for this purposewhereas setting them in PHP.INI file may work.
up
-3
lotushzy at gmail dot com
7 years ago
About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:mb_detect_encoding('áéóú', 'UTF-8', true); // falsebut now the result is not false, can you give me reason, thanks!
To Top