Funciones de strings multibyte

Referencias

Los esquemas de codificación de caracteres multibyte y temas relacionados son muy complicados y están fuera del alcance de esta documentación. Se aconseja visitar los siguientes URLs y otros recursos para tener unos conocimientos más amplios que los escritos en estos temas.

Tabla de contenidos

  • mb_check_encoding — Verifica si las cadenas son válidas para el encodage especificado
  • mb_chr — Devuelve un carácter por su valor de punto de código Unicode
  • mb_convert_case — Realiza una conversión a mayúsculas/minúsculas de un string
  • mb_convert_encoding — Convertir una cadena de un codificación de caracteres a otra
  • mb_convert_kana — Convierte un "kana" en otro ("zen-kaku", "han-kaku" y más)
  • mb_convert_variables — Convierte la codificación de variables
  • mb_decode_mimeheader — Decodifica un encabezado MIME
  • mb_decode_numericentity — Decodificar referencia numérica de cadena HTML a carácter
  • mb_detect_encoding — Detectar la codificación de caracteres
  • mb_detect_order — Lee/modifica el orden de detección de codificaciones
  • mb_encode_mimeheader — Codifica una cadena para un encabezado MIME
  • mb_encode_numericentity — Codifica caracteres a referencia numérica HTML
  • mb_encoding_aliases — Obtiene los alias de un tipo de codificación conocido
  • mb_ereg — Búsqueda por expresión regular con soporte para caracteres multibyte
  • mb_ereg_match — Expresión regular POSIX para strings multibyte
  • mb_ereg_replace — Reemplaza segmentos de cadena mediante expresiones regulares
  • mb_ereg_replace_callback — Buscar y reemplazar mediante expresión regular con soporte multi byte utilizando una función de devolución de llamada
  • mb_ereg_search — Búsqueda por expresión regular multioctets
  • mb_ereg_search_getpos — Devuelve la posición de inicio para la siguiente comparación de una expresión regular
  • mb_ereg_search_getregs — Lee el último segmento de cadena multioctets que coincide con el patrón
  • mb_ereg_search_init — Configura las cadenas y las expresiones regulares para el soporte de caracteres multioctetos
  • mb_ereg_search_pos — Retorna la posición y la longitud del segmento de string que cumple con el patrón de expresión regular
  • mb_ereg_search_regs — Retorna el segmento de cadena encontrado por una expresión regular multioctets
  • mb_ereg_search_setpos — Selecciona el punto de partida para la búsqueda mediante expresión regular
  • mb_eregi — Expresión regular insensible a mayúsculas/minúsculas con soporte para caracteres multioctetos
  • mb_eregi_replace — Expresión regular con soporte para caracteres multibyte, sin distinción de mayúsculas y minúsculas
  • mb_get_info — Lee la configuración interna de la extensión mbstring
  • mb_http_input — Detecta el tipo de codificación de caracteres HTTP
  • mb_http_output — Lee/modifica la codificación de visualización
  • mb_internal_encoding — Lee/modifica la codificación interna
  • mb_language — Define/Recupera el lenguaje actual
  • mb_lcfirst — Convierte la primera letra de un string a minúscula
  • mb_list_encodings — Devuelve un array que contiene todos los encodings soportados
  • mb_ltrim — Elimina los espacios (u otros caracteres) del inicio de un string
  • mb_ord — Obtiene el punto de código Unicode de un carácter
  • mb_output_handler — Función de tratamiento de los despliegues
  • mb_parse_str — Analiza los datos HTTP GET/POST/COOKIE y asigna las variables globales
  • mb_preferred_mime_name — Detecta la codificación MIME
  • mb_regex_encoding — Define/Recupera la codificación de caracteres para las expresiones regulares multioctetos
  • mb_regex_set_options — Lee y modifica las opciones de las funciones de expresión regular con soporte para caracteres multibyte
  • mb_rtrim — Elimina los espacios (u otros caracteres) del final de un string
  • mb_scrub — Reemplaza las secuencias de bytes mal formadas por el carácter de sustitución.
  • mb_send_mail — Envía un correo electrónico codificado
  • mb_split — Divide una string en un array utilizando una expresión regular multibyte
  • mb_str_pad — Rellena una cadena multibyte hasta una cierta longitud con otra cadena multibyte
  • mb_str_split — Para una cadena multibyte dada, devuelve un array de sus caracteres
  • mb_strcut — Corta una parte de string
  • mb_strimwidth — Trunca una cadena
  • mb_stripos — Encuentra la primera ocurrencia de una cadena en otra, sin tener en cuenta la casilla
  • mb_stristr — Encuentra la primera ocurrencia de una cadena en otra, sin tener en cuenta la casilla
  • mb_strlen — Devuelve la longitud de una cadena
  • mb_strpos — Localiza la primera ocurrencia de un carácter en una cadena
  • mb_strrchr — Encuentra la última ocurrencia de un carácter de una cadena en otra
  • mb_strrichr — Encuentra la última ocurrencia de un carácter de una cadena en otra, sin distinción de mayúsculas y minúsculas
  • mb_strripos — Encuentra la posición de la última ocurrencia de una cadena en otra, sin tener en cuenta la casilla
  • mb_strrpos — Localiza la última ocurrencia de un carácter en una cadena
  • mb_strstr — Encuentra la primera ocurrencia de una cadena en otra
  • mb_strtolower — Convierte todos los caracteres a minúsculas
  • mb_strtoupper — Convierte todos los caracteres a mayúsculas
  • mb_strwidth — Devuelve el tamaño de una cadena
  • mb_substitute_character — Define/Recupera los caracteres de sustitución
  • mb_substr — Extrae una subcadena
  • mb_substr_count — Cuenta el número de ocurrencias de una subcadena
  • mb_trim — Elimina los espacios (u otros caracteres) del inicio y final de un string
  • mb_ucfirst — Convierte una string con la primera letra en mayúscula
add a note

User Contributed Notes 17 notes

up
66
deceze at gmail dot com
13 years ago
Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

<?php

$string  = '漢字はユニコード';
$needle  = 'は';
$replace = 'Foo';

echo str_replace($needle, $replace, $string);
// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.
up
20
Eugene Murai
20 years ago
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

Example:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
up
16
mdoocy at u dot washington dot edu
18 years ago
Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.
up
13
mitgath at gmail dot com
16 years ago
according to:
http://bugs.php.net/bug.php?id=21317
here's missing function

<?php
function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {
   return str_pad($input,
strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);
}
?>
up
5
mattr at telebody dot com
11 years ago
A brief note on Daniel Rhodes' mb_punctuation_trim().
The regular expression modifier u does not mean ungreedy, rather it means the pattern is in UTF-8 encoding. Instead the U modifier should be used to get ungreedy behavior. (I have not otherwise tested his code.)
See http://php.net/manual/en/reference.pcre.pattern.modifiers.php
up
12
Anonymous
12 years ago
Yet another single-line mb_trim() function

<?php
function mb_trim($string, $trim_chars = '\s'){
    return preg_replace('/^['.$trim_chars.']*(?U)(.*)['.$trim_chars.']*$/u', '\\1',$string);
}
$string = '           "some text."      ';
echo mb_trim($string, '\s".');
//some text
?>
up
9
roydukkey at roydukkey dot com
16 years ago
This would be one way to create a multibyte substr_replace function

<?php
function mb_substr_replace($output, $replace, $posOpen, $posClose) {
        return mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);
    }
?>
up
5
treilor at gmail dot com
11 years ago
A small note for those who will follow rawsrc at gmail dot com's advice: mb_split uses regular expressions, in which case it may make sense to use built-in function mb_ereg_replace.
up
6
Ben XO
17 years ago
PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.

<?php
    /**
     * Trim characters from either (or both) ends of a string in a way that is
     * multibyte-friendly.
     *
     * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as
     * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of
     * course, the added bonus that you can put unicode characters in the charlist.
     *
     * We are using a PCRE character-class to do the trimming in a unicode-aware
     * way, so we must escape ^, \, - and ] which have special meanings here.
     * As you would expect, a single \ in the charlist is interpretted as
     * "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances
     * you can ignore this detail.
     *
     * As a bonus, however, we also allow PCRE special character-classes (such as '\s')
     * because they can be extremely useful when dealing with UCS. '\pZ', for example,
     * matches every 'separator' character defined in Unicode, including non-breaking
     * and zero-width spaces.
     *
     * It doesn't make sense to have two or more of the same character in a character
     * class, therefore we interpret a double \ in the character list to mean a
     * single \ in the regex, allowing you to safely mix normal characters with PCRE
     * special classes.
     *
     * *Be careful* when using this bonus feature, as PHP also interprets backslashes
     * as escape characters before they are even seen by the regex. Therefore, to
     * specify '\\s' in the regex (which will be converted to the special character
     * class '\s' for trimming), you will usually have to put *4* backslashes in the
     * PHP code - as you can see from the default value of $charlist.
     *
     * @param string 
     * @param charlist list of characters to remove from the ends of this string.
     * @param boolean trim the left?
     * @param boolean trim the right?
     * @return String
     */
    function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)
    {
        $both_ends = $ltrim && $rtrim;

        $char_class_inner = preg_replace(
            array( '/[\^\-\]\\\]/S', '/\\\{4}/S' ),
            array( '\\\\\\0', '\\' ),
            $charlist
        );

        $work_horse = '[' . $char_class_inner . ']+';
        $ltrim && $left_pattern = '^' . $work_horse;
        $rtrim && $right_pattern = $work_horse . '$';

        if($both_ends)
        {
            $pattern_middle = $left_pattern . '|' . $right_pattern;
        }
        elseif($ltrim)
        {
            $pattern_middle = $left_pattern;
        }
        else
        {
            $pattern_middle = $right_pattern;
        }

        return preg_replace("/$pattern_middle/usSD", '', $string) );
    }
?>
up
4
Hayley Watson
7 years ago
SOME multibyte encodings can safely be used in str_replace() and the like, others cannot. It's not enough to ensure that all the strings involved use the same encoding: obviously they have to, but it's not enough. It has to be the right sort of encoding.

UTF-8 is one of the safe ones, because it was designed to be unambiguous about where each encoded character begins and ends in the string of bytes that makes up the encoded text. Some encodings are not safe: the last bytes of one character in a text followed by the first bytes of the next character may together make a valid character. str_replace() knows nothing about "characters", "character encodings" or "encoded text". It only knows about the string of bytes. To str_replace(), two adjacent characters with two-byte encodings just looks like a sequence of four bytes and it's not going to know it shouldn't try to match the middle two bytes.

While real-world examples can be found of str_replace() mangling text, it can be illustrated by using the HTML-ENTITIES encoding. It's not one of the safe ones. All of the strings being passed to str_replace() are valid HTML-ENTITIES-encoded text so the "all inputs use the same encoding" rule is satisfied.

The text is "x<y". It is represented by the byte string [78 26 6c 74 3b 79]. Note that the text has three characters, but the string has six bytes.

<?php

$string = 'x&lt;y';
mb_internal_encoding('HTML-ENTITIES');

echo "Text length: ", mb_strlen($string), "\tString length: ", strlen($string), " ... ", $string, "\n";
// Three characters, six bytes; the text reads "x<y".

$newstring = str_replace('l', 'g', $string);
echo "Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Three characters, six bytes, but now the text reads "x>y"; the wrong characters have changed.

$newstring = str_replace(';', ':', $string);
echo "Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Now even the length of the text is wrong and the text is trashed.

?>

Even though neither 'l' nor ';' appear in the text "x<y", str_replace() still found and changed bytes. In one case, it changed the text to "x>y" and in the other it broke the encoding completely.

One more reason to use UTF-8 if you can, I guess.
up
6
php at kamiware dot org
9 years ago
str_replace is NOT multi-bite safe.

This Ukrainian word gives a bug when used in the next code: відео

$rubishcharacters='[#|\[{}\]´`≠,;.:-\\_<>=*+"\'?()!§$&%';
$searchstring='відео';

$result = str_replace(str_split($rubishcharacters), ' ', $searchstring);
up
4
Daniel Rhodes
12 years ago
Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
 * Trim singlebyte and multibyte punctuation from the start and end of a string
 * 
 * @author Daniel Rhodes
 * @note we want the first non-word grabbing to be greedy but then
 * @note we want the dot-star grabbing (before the last non-word grabbing)
 * @note to be ungreedy
 * 
 * @param string $string input string in UTF-8
 * @return string as $string but with leading and trailing punctuation removed
 */
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
    
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
    
    return $matches[1];
}

Hope you like it!
up
2
abidul dot rmdn at gmail dot com
6 years ago
Having to migrate to MB functions can be a bit of pain if you have a big project. it took us a while at my company but then we made a small script and explained it in a small blog.
https://link.medium.com/25w1LronCX

which makes it really easy to migrate to mb_ functions.
up
2
peter kehl
19 years ago
UTF-16LE solution for CSV for Excel by Eugene Murai works well:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\\t" character as CSV delimiter rather than comma or colon.

You may also want to use HTTP encoding header, such as
header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );
up
3
rawsrc at gmail dot com
14 years ago
Hi,

For those who are looking for mb_str_replace, here's a simple function :
<?php
function mb_str_replace($needle, $replacement, $haystack) {
   return implode($replacement, mb_split($needle, $haystack));
}
?>
I haven't found a simpliest way to proceed :-)
up
1
pdezwart .at. snocap
19 years ago
If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE
up
0
johannesponader at dontspamme dot googlemail dot co
15 years ago
Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.
To Top