Posted on June 6th, 2009 57 comments
Almost every web developer has run into the problem of character sets and character encoding. Joel On Software has the most succinct post on the topic of Unicode.
Here’s the problem. Your web page has certain characters that cannot be displayed properly. Instead of typographer’s quotes (“curly quotes” instead of foot ' and inch " marks), ‘e acute’ (as in the word résumé), the copyright symbol (©), registered symbol (®), etc., usually copied from a program like Microsoft Word, your webpage renders with the dreaded “black diamond question mark” symbol: �
Since the earliest days of the web, we’ve been using HTML Entities to create these characters. HTML Entities are escape sequences to represent special characters in your web page markup. For example, the syntax
renders as ©, in a webpage. I realize I can simply use these escape codes to get special characters to display correctly, but why? What if I have hundreds of pages of content with curly quotes in them and I just want to be able to render a page without using HTML entities?
When I develop websites, I run WAMPServer, which uses PHP 5, MySQL 5, and Apache 2 on Windows XP. I’ve been confused by this topic off and on for over 2 years now. And I’m not the only one.
I’ve tried trouble-shooting the character encoding and serving problem from the top down, starting with the web server software on down the line.
I have edited my Apache httpd.conf file with
I have edited my PHP.ini file with
default_charset = "utf-8"
I have also made sure that MySQL is using UTF-8. This includes both the MySQL Database itself…
…the MySQL connection, the MySQL table, and the MySQL field where my data is stored.
As you can see here, I even have gone into Firefox and set it to accept UTF-8 and receive UTF-8.
Still, I get unrenderable characters. WHY!?
I’m using Firebug to display the HTML Headers, and I’ve verified this is not a bug in Firefox. I’m seeing the dastardly � character whether I use Firefox 2, Firefox 3, Opera, Safari, or Chrome.
I’m sure there’s a character encoding guru out there somewhere that can tell me what I’m missing. I know, I know, I can just turn on iso-8859-1 (Windows Latin), anywhere along the chain of encoding, and everything will be fine. And indeed, this is true. It seems almost unfathomable that I’ve checked every possible setting related to the character set of the content type of the page I am trying to serve, and still get � everywhere.
Still, I thought the whole idea behind the move to UTF-8 was to prevent me from having to worry about all this stuff. I’d love to just happily store pages, create pages and serve pages in UTF-8 so all my characters look like they’re supposed to and I don’t have to escape them at all. Isn’t that the point?
I’m not convinced that I fixed the issue, but I have found a workaround. I decided to turn off the charset handling in both httpd.conf and php.ini, and added…
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
…to my page template. It works, but I still want to know why it works, or more accurately, why declaring everything UTF-8 doesn’t.
Update Oct. 2011
I’ve just discovered yet another issue that is not easy to figure out. It turns out that you can get the dreaded question-mark-in-diamond characters even in UTF-8 encoded files, if the file is written with a BOM (byte-order-mark). We had a PHP application including several files, one of which was encoded with a BOM. The special characters such as ç and õ were showing up fine on one part of the page, and as � on other parts of the page. We removed the BOM on one of the include files with NotePad++ on Windows and everything was fine again.