The Will Will Web

記載著 Will 在網路世界的學習心得與技術分享

MySQL 4.1/5 如何將現有 latin1 中文資料匯入成 UTF-8 資料

字集字碼的問題真的是很麻煩的,沒有好的觀念實在很難解決這麼多字集字碼的問題,尤其在轉換資料的時候,最是痛苦,很多人解決不了也是妥協,但常看到別人的網站上出現一堆 \ ? 等字元,站長本身又無法解決時,想必也是很難過吧。以下這篇文章是非常精華的一篇關於如何將 MySQL 4.0 完整且正確的升級至 MySQL 4.1/5.0 的說明。 至於文章中的名人「許功蓋」相信大家都知道是誰吧!^_^

MySQL 關於 Character Sets and Collations 的詳細定義在此:

    http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

概要說明一下:

mysql> set names utf8; set character set utf8;
mysql> SELECT name FROM `cms_content` WHERE `name` LIKE CONVERT( _big5 '%許功蓋%' USING utf8 );

CONVERT:

    宣告 '%歐白%'  為 BIG5 字集,並轉成 UTF8 字碼,用於 SQL 查詢

真相:

    只要用 casting 功能就可以下出正確的 SQL Command

    e.g. 

        mysql> set names utf8;set character set utf8;
        mysql> SELECT name FROM `cms_content` WHERE `name` LIKE _big5 '%許功蓋%';

        mysql> set names big5;set character set big5;
        mysql> SELECT name FROM `cms_content` WHERE `name` LIKE _big5 '%許功蓋%';

        只要能指定「字串」的字集,MySQL 就能辨認字串的比對!  用  _big5 'XXXXX'
        這兩段的結果都可以正確查到資料!

情況(一):轉移 MySQL 4.0 之前的資料

    1. 先 CREATE DATABASE 指定字集為 big5

        CREATE DATABASE `dbname1` DEFAULT CHARACTER SET UTF8;

    2. 選擇資料庫

        USE `dbname1`;

    3. 再宣告接下來要輸入 SQL 所使用的字集 ( Big5 )

        SET NAMES 'big5';
        SET CHARACTER SET big5;

    4. 再來,就新增資料

        CREATE TABLE or INSERT INTO

情況(二):轉移網站 PHP 程式

    在資料庫連線後,立即執行兩行 SQL 指令:

    $conn->Execute("SET NAMES 'big5'");   
    $conn->Execute("SET CHARACTER SET big5");

情況(三):開發新的 PHP 網站

    1. 資料庫、表格與欄位皆宣告為 UTF8

    2. 訊息在處理的時候,程式要先宣告使用的字集: ( e.g. Big5 或 UTF-8 )

        $conn->Execute("SET NAMES 'big5'");   
        $conn->Execute("SET CHARACTER SET big5");

    3. 最好在輸入資料庫的時候,完全以 utf8 存入資料庫

        ※ 潛在問題:

        1. 若輸入字集為 Big5,但 Browser 會將「非Big5」的字集變成 Unicode Entities,要如何正確存入 MySQL 5

            - 策略:先將 BIG5 文字轉成 UTF-8,再將 Unicode Entities 轉成 UTF-8,就大功告成啦!我真是天才~  ^____^
            - 解答:

      $tmp = '我是Will!  :: ク チ コ ミ 付 き 全 国 う ま い ラ';
      $str = iconv("BIG5", "UTF-8", $tmp);
      $str = unescape($str);     # unescape 定義在下方

      $conn->Execute("SET NAMES 'UTF8'");
      $conn->Execute("SET CHARACTER SET UTF8");

      $conn->Execute("INSERT INTO test VALUES ('[ ".date("Y-m-d H:i:s")." ] ".$str."');");

        2. 若輸出的字集是 Big5,原本的 UTF-8 用 iconv 可能無法完全轉回 Big5,要如何將無法轉換的部分轉成 Entities 呢?

            - 策略:先取出所有 UTF-8 的單字,一個一個用 iconv 轉換,轉換不成功的再轉成 Entities
            - 解答:

        $s = ddc_readfile("/home/demo/utf8.txt");

        function utf8_to_big5($str)
        {
            #$to_charset = 'BIG5';
            #$to_charset = 'GB2312';
            #$to_charset = 'UTF-8';

            $old_mb_internal_encoding = mb_internal_encoding();

            mb_internal_encoding('UTF-8');

            $len = mb_strlen($str);

            $buf = '';

            for($i=$len ; $i > 0 ; $i--)
            {
                $idx = $len - $i;

                $char = mb_substr($str, $idx, 1);

                $char_b5 = iconv("UTF-8", $to_charset, $char);

                if($char_b5 == '') {
                    $char = utf8ToUnicodeEntities($char);
                }
                else {
                    $char = $char_b5;
                }

                $buf .= $char;
            }

            mb_internal_encoding($old_mb_internal_encoding);

            return $buf;
        }

        # header("Content-Type: text/html; charset=UTF-8");
        $n = utf8_to_big5($s);

        echo($n);

        ※ 若應用程式是以 Big5 做為傳遞的依據,可以用以下方式確保輸入的字集都是 utf8

        # 先判斷該字串是否為 UTF-8 字串,若不是則用 iconv 從 BIG5 轉換至 UTF-8
        if (iconv('UTF-8', 'UTF-8', $xml) != $xml) {
            $xml = iconv("BIG5", "UTF-8", $xml);
        }

        # 再將所有 Browser 自動轉換的 Entities 轉成 Unicode 字串
        $xml = preg_replace_callback("/(&#[0-9]{1,5}+;)/", "preg_utf8Encode", $xml);
           
        function preg_utf8Encode($matches)
        {
           #dp($matches[1]);
           return utf8Encode($matches[1]);
        }

        function utf8Encode ($source)
        {
            $utf8Str = '';
            $entityArray = explode ("&#", $source);
            $size = count ($entityArray);
            for ($i = 0; $i < $size; $i++)
            {
                $subStr = $entityArray[$i];
                $nonEntity = strstr ($subStr, ';');
                if ($nonEntity !== false)
                {
                  /* Add by Will Huang - BEGIN */
                  if(substr ($subStr, 0, 1) == 'x') {
                    $unicode = intval(hexdec(substr ($subStr, 1, (strpos ($subStr, ';')+1))));
                  }
                  else
                  /* Add by Will Huang - END */
                    $unicode = intval (substr ($subStr, 0, (strpos ($subStr, ';') + 1)));

                    // determine how many chars are needed to reprsent this unicode char
                    if ($unicode < 128) {
                    $utf8Substring = chr ($unicode);
                    }
                    else if ($unicode >= 128 && $unicode < 2048) {
                        $binVal = str_pad (decbin ($unicode), 11, "0", STR_PAD_LEFT);
                        $binPart1 = substr ($binVal, 0, 5);
                        $binPart2 = substr ($binVal, 5);

                        $char1 = chr (192 + bindec ($binPart1));
                        $char2 = chr (128 + bindec ($binPart2));
                        $utf8Substring = $char1 . $char2;
                    }
                    else if ($unicode >= 2048 && $unicode < 65536) {
                        $binVal = str_pad (decbin ($unicode), 16, "0", STR_PAD_LEFT);
                        $binPart1 = substr ($binVal, 0, 4);
                        $binPart2 = substr ($binVal, 4, 6);
                        $binPart3 = substr ($binVal, 10);

                        $char1 = chr (224 + bindec ($binPart1));
                        $char2 = chr (128 + bindec ($binPart2));
                        $char3 = chr (128 + bindec ($binPart3));
                        $utf8Substring = $char1 . $char2 . $char3;
                    }
                    else {
                        $binVal = str_pad (decbin ($unicode), 21, "0", STR_PAD_LEFT);
                        $binPart1 = substr ($binVal, 0, 3);
                        $binPart2 = substr ($binVal, 3, 6);
                        $binPart3 = substr ($binVal, 9, 6);
                        $binPart4 = substr ($binVal, 15);

                        $char1 = chr (240 + bindec ($binPart1));
                        $char2 = chr (128 + bindec ($binPart2));
                        $char3 = chr (128 + bindec ($binPart3));
                        $char4 = chr (128 + bindec ($binPart4));
                        $utf8Substring = $char1 . $char2 . $char3 . $char4;
                    }

                    if (strlen ($nonEntity) > 1) {
                        $nonEntity = substr ($nonEntity, 1); // chop the first char (';')
                    }
                    else  {
                        $nonEntity = '';
                    }

                    $utf8Str .= $utf8Substring . $nonEntity;
                }
                else {
                    $utf8Str .= $subStr;
                }
            }

            return $utf8Str;
        }

UTF-8 Notes:

http://php.nctu.edu.tw/manual/en/function.iconv.php

<?php
//script from http://zizi.kxup.com/
//javascript unesape
function unescape($str) {
  $str = rawurldecode($str);
  preg_match_all("/(?:%u.{4})|&#x.{4};|&#\d+;|.+/U",$str,$r);
  $ar = $r[0];
print_r($ar);
  foreach($ar as $k=>$v) {
   if(substr($v,0,2) == "%u")
     $ar[$k] = iconv("UCS-2","UTF-8",pack("H4",substr($v,-4)));
   elseif(substr($v,0,3) == "&#x")
     $ar[$k] = iconv("UCS-2","UTF-8",pack("H4",substr($v,3,-1)));
   elseif(substr($v,0,2) == "&#") {
echo substr($v,2,-1)."<br>";
     $ar[$k] = iconv("UCS-2","UTF-8",pack("n",substr($v,2,-1)));
   }
  }
  return join("",$ar);
}
?>

 


**********************************************
使用 ADOdb for PHP4 如何順利將 PHP4 + MySQL 4.0 升級至 PHP4 + MySQL 4.1/5.0

※ 必要條件:要安裝 iconv、一定要是輸入 BIG5 的字

※ 開啟 adobe/adodb.inc.php 檔案,並修改以下兩個 Function !!!

    function b2u_walk (&$item, $key) {
        if(!is_array($item)) {
            $item = iconv("BIG5", "UTF-8//IGNORE", $item);
        }
    }

    function GetUpdateSQL(&$rs, $arrFields,$forceUpdate=false,$magicq=false)
    {
        array_walk($arrFields, 'b2u_walk');
       
        global $ADODB_INCLUDED_LIB;
        if (empty($ADODB_INCLUDED_LIB)) include_once(ADODB_DIR.'/adodb-lib.inc.php');
       
        $tmp = _adodb_getupdatesql($this,$rs,$arrFields,$forceUpdate,$magicq);
        $tmp = iconv("UTF-8", "BIG5//IGNORE", $tmp);
        return $tmp;
    }

    function GetInsertSQL(&$rs, $arrFields,$magicq=false)
    {
        array_walk($arrFields, 'b2u_walk');
       
        global $ADODB_INCLUDED_LIB;
        if (empty($ADODB_INCLUDED_LIB)) include_once(ADODB_DIR.'/adodb-lib.inc.php');
       
        $tmp = _adodb_getinsertsql($this,$rs,$arrFields,$magicq);
        $tmp = iconv("UTF-8", "BIG5//IGNORE", $tmp);
        return $tmp;
    }