Wednesday, April 14, 2010

11:16 PM

Running an Internationalization / Localization [or i18n / L10n] friendly website can be tricky, and sometimes downright maddening for those who haven’t yet delved into the world of Unicode. Allowing your users to post in whichever language and / or characters of their choice to your site is crucial for any modern website.

Here are a few things I have very painfully learned over the last 5 or so years on this topic … specifically with PHP and MySQL.
There are hundreds of character sets representing most of the languages on Earth, usually one per geographic location [Latin, Cyrillic, Greek, Arabic, Korean, Chinese etc...]. One character set that covers all of these is UTF-8. So how can you put ‘UTF-8‘ to practical use? Easy … here’s how I’ve done it:

Headers! Get your headers!
The most important area to implement UTF-8 is in your charset header within your outgoing HTML headers. This tells the browser that you have multi-byte characters in your HTML and you’d like it do display them as such [and not as the default ISO-8859-1].
To do this, put this at the very top of your PHP scripts [with the headers and before any HTML is echoed]:

<?php
header("Content-Type: text/html; charset=utf-8");
?>
And this in your HTML <head> section:
<?php
echo "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n";
?>

MySQL / UTF-8 love
The second most important thing is to make sure your database is alsoUTF-8 friendly. Be sure to set all your table / column collations [char / text] to utf8_unicode_ci. This tells MySQL to treat this data as UTF-8.

Once you’ve done that, you’ll need to tell PHP to connect to the MySQL daemon under a UTF-8 connection [otherwise the default islatin1 ... and your data will be stored in MySQL as such -- no good!]. Run this right after you connect to MySQL:
<?php
mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER SET utf8");
?>

Multibyte fun
Last, take advantage of PHP’s Multibyte String Functions! Oftentimes this is as easy as prefixing your string comparison functions with mb_. But, before you start using these functions you’ll need to tell PHP which character set to use [once again!] because the default is ISO-8859-1:

<?php
mb_internal_encoding("UTF-8");
?>

Forms
One often neglected method is ensuring that the data the server gets is UTF-8 encoded. One way to try and do this with HTML forms is to include the accept-charset attribute in your form tag. I say “try” because it’s just a suggestion to the client which submits the form. Be aware that some clients may not pay much attention to the attribute, especially older browsers. [Thanks to Alejandro for the heads up :-)]

<form action="/action" method="post" accept-charset="utf-8">
If you’ve gotten this far you should see some dramatic improvements to your web site’s accessibility and usability, drawing in users from around the world.
NOTE: This is a work in progress and I fully welcome any new ideas to this cocktail of methods. If you have anything to add, PLEASE DO SO

0 comments: