Convert That DB

From Zoelife4U Wiki
Jump to: navigation, search

Contents

Synopsis

MySQL versions newer than 4.1 use UTF-8 to encode special characters, and prior versions used latin1. This behavior is not appropriate for languages with special characters, Hungarian, Turkish, Swedish, to name a few are examples of languages that either have extra characters to latin1 or altogether different letters. Some people move webhosts from a local host in their country who have the correct charset available by default to a host in the US, who do not. This article is for these situations. For this guide we will be using the latin2/iso-8859-16 charset for our example. Be sure to modify the examples to suit your actual filenames, databasenames, and required charset.

Basic Steps

The following guide will explain how to import your website and database with nonstandard latin characters into your account. SSH Access will be required to accomplish these tasks.

1. We need to export the current database, be sure to export it using the correct charset listed below:

d=`date +%Y%m%d-%H%M`;mysqldump -p -u`whoami` --add-drop-table name_ofdatabase > $HOME/dbname.$d.sql

Next, we need to convert the charset in the actual SQL file, to do this, we run (You did make a backup first, right?):

 
cp -v dbdumpfile.sql dbdumpfile.sql.backup  # Yea, I don't trust you, back it up ;)
sed -i 's/latin1/latin2/g' dbdumpfile.sql  # Now convert the thing

Be sure to substitute your desired character set that is listed in the table at the bottom of this article.

Once we've got our converted SQL file with the correct charset, we import it:

mysql -p -u`whoami` name_ofdatabase < dbdumpfile.sql

OK, so we've got a database all set with our charset and collation set correctly, now, we need to find the one PHP file that will be included by all the others, this can be a config file, the db declarations file, the header file, just so long as it's used in all the other pages, we add the following just below the very first <? or <?php:

header("Content-type: text/html; charset=iso-8859-16\n\n");

Where iso-8859-16 can be any number of charsets as described in the Popular Charset's for the HTTP Header section of this article That's that, now we must go into our PHP script connecting to the database and add the following directly after the mysql_select_db() function:

if(!mysql_query("SET NAMES 'latin2'")) {
    die("Unable to set character set for database: ".mysql_error());
}

Once these changes have been made, the site is set and will display the proper character encoding, provided it is installed and available in the browser.

The Charset Codes

Character Sets Available on MySQL

mysql> show charset;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 | 
| dec8     | DEC West European           | dec8_swedish_ci     |      1 | 
| cp850    | DOS West European           | cp850_general_ci    |      1 | 
| hp8      | HP West European            | hp8_english_ci      |      1 | 
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 | 
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 | 
| latin2   |  iso-8859-2 Central European | latin2_general_ci   |      1 | 
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 | 
| ascii    | US ASCII                    | ascii_general_ci    |      1 | 
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 | 
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 | 
| hebrew   |  iso-8859-8 Hebrew           | hebrew_general_ci   |      1 | 
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 | 
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 | 
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 | 
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 | 
| greek    |  iso-8859-7 Greek            | greek_general_ci    |      1 | 
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 | 
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 | 
| latin5   |  iso-8859-9 Turkish          | latin5_turkish_ci   |      1 | 
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 | 
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 | 
| cp866    | DOS Russian                 | cp866_general_ci    |      1 | 
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 | 
| macce    | Mac Central European        | macce_general_ci    |      1 | 
| macroman | Mac West European           | macroman_general_ci |      1 | 
| cp852    | DOS Central European        | cp852_general_ci    |      1 | 
| latin7   |  iso-8859-13 Baltic          | latin7_general_ci   |      1 | 
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 | 
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 | 
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 | 
| binary   | Binary pseudo charset       | binary              |      1 | 
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 | 
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 | 
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 | 
+----------+-----------------------------+---------------------+--------+
36 rows in set (0.00 sec)

Popular Charset's for the HTTP Header

Example: header("Content-type: text/html; charset=iso-8859-16\n\n");

* ISO/IEC 646| iso-646 
** ASCII 
* EBCDIC 
** Code page 930|CP930  
*ISO/IEC 8859| iso-8859 :
**ISO/IEC 8859-1| iso-8859-1  Western Europe
**ISO/IEC 8859-2| iso-8859-2  Western and Central Europe
**ISO/IEC 8859-3| iso-8859-3  Western Europe and South European ( Turkish, Maltese plus Esperanto )
**ISO/IEC 8859-4| iso-8859-4  Western Europe and Baltic countries ( Lithuania, Estonia and Lapp )
**ISO/IEC 8859-5| iso-8859-5  Cyrillic alphabet
**ISO/IEC 8859-6| iso-8859-6  Arabic
**ISO/IEC 8859-7| iso-8859-7  Greek
**ISO/IEC 8859-8| iso-8859-8  Hebrew
**ISO/IEC 8859-9| iso-8859-9  Western Europe with amended Turkish character set
**ISO/IEC 8859-10| iso-8859-10  Western Europe with rationalised character set for Nordic languages, including complete Icelandic set.
**ISO/IEC 8859-11| iso-8859-11  Thai
**ISO/IEC 8859-13| iso-8859-13  Baltic languages plus Polish
**ISO/IEC 8859-14| iso-8859-14  Celtic languages ( Irish Gaelic, Scottish, Welsh )
**ISO/IEC 8859-15| iso-8859-15  Added the Euro sign and other rationalisations to  iso-8859-1
**ISO/IEC 8859-16| iso-8859-16  Central European languages ( Polish, Czech, Slovenian, Slovak, Hungarian, Albanian, Romanian, German, Italian )
* Code page 437|CP437 ,  Code page 737|CP737 ,  Code page 850|CP850 ,  Code page 852|CP852 ,  Code page 855|CP855 ,  Code page 857|CP857 , Code page 858|CP858 ,  Code page 860|CP860 ,  Code page 861|CP861 ,  Code page 863|CP863 ,  Code page 865|CP865 ,  Code page 866|CP866 , Code page 869|CP869 
* Code page#Windows .28ANSI.29 code pages|MS-Windows character sets :
** Windows-1250  for Central European languages that use Latin script, (Polish, Czech, Slovak, Hungarian, Slovene, Serbian, Croatian, Romanian and Albanian)
** Windows-1251  for Cyrillic alphabets
** Windows-1252  for Western languages
** Windows-1253  for Greek
** Windows-1254  for Turkish
** Windows-1255  for Hebrew
** Windows-1256  for Arabic
** Windows-1257  for Baltic languages
** Windows-1258  for Vietnamese
* Mac OS Roman 
* KOI8-R ,  KOI8-U ,  KOI7 
* MIK Code page|MIK 
* Cork encoding|Cork or T1 
* ISCII 
* TSCII 
* VISCII 
* JIS X 0208  is a widely deployed standard for Japanese character encoding that has several encoding forms.
** Shift_JIS  (Microsoft  Code page 932  is a dialect of Shift_JIS)
** EUC-JP 
** ISO-2022-JP 
* JIS X 0213  is an extended version of JIS X 0208.
** Shift_JIS-2004 
** EUC-JIS-2004 
** ISO-2022-JP-2004 
* Chinese  Guobiao code|Guobiao 
** GB2312 
** GBK  (Microsoft  Code page 936 )
** GB18030 
* Taiwan  Big5  (a more famous variant is Microsoft  Code page 950 )
* Hong Kong  HKSCS 
* KS X 1001  is a Korean double-byte character encoding standard
** Extended Unix Code#EUC-KR|EUC-KR 
** ISO-2022-KR 
* Unicode  (and subsets thereof, such as the 16-bit 'Basic Multilingual Plane'). See  UTF-8 
* ANSEL  or  ISO/IEC 6937
Personal tools
Online Users
Zoelife4U: