Mar 10 2008
Migrating a multilingual Wordpress database
Migrating UTF-8 encoded Wordpress database can be a real headache using conventional methods where characters appear garbled. Here’s a quick and dirty solution that though crude, it just works:
(This article applies to UNIX-like systems running MySQL > v4.1 only!)
1. Dump the database:
Using a command shell on your server, enter the following command:
mysqldump [WordPress DB name] -u [DB User] –password=[DB Password] –default-character-set=latin1 | bzip2 -c > wp.sql.bz2
The command dumps all the content of your database into SQL scripts. The 2nd part of the command pipes the mysqldump output to a BZip2 compressor and saves the dump file as wp.sql.bz2.
The key ingredient is exporting as latin1 character set. As far as I know, Wordpress automatically converts UTF-8 characters into ANSI before saving it into the database. MySQL since version 4.1 dumps everything as UTF-8, so the UTF-8 characters gets doubly encoded, giving rise to the curse of the garbled character.
2. Download dump file and open with Notepad:
After unzipping the dump file, open it with Notepad or any other text editor that can convert character sets. Without modifying the dump file content, save it as a new file with ANSI encoding called ANSI-New.sql.
Some sites may have huge database dumps that will take Notepad years to process. Alternatively, you could use Notepad++, an open source replacement for Notepad.
Supposedly, you could also use iconv (on your server if you have it installed) but half way through transcoding, iconv failed on me.
3. Upload new dump file and import back to database:
You may want to compress the database using zip and tar gzip and upload it to your destination server. Decompress and pipe it to mysql:
mysql -u [New DB User] -p [New DB password] –default_character_set utf8 < ANSI-New.sql
The above command makes mysql execute the commands inside ANSI-New.sql. Again the key ingredient to success is letting mysql import using UTF-8 instead of latin1. The will cause the characters to be encoded only once, instead of twice.
After verifying that all non-English characters are still intact, you may call it a day!