Special characters show up as a question mark inside of a black diamondPosted on June 6th, 2009 57 comments
Almost every web developer has run into the problem of character sets and character encoding. Joel On Software has the most succinct post on the topic of Unicode.
Here’s the problem. Your web page has certain characters that cannot be displayed properly. Instead of typographer’s quotes (“curly quotes” instead of foot ' and inch " marks), ‘e acute’ (as in the word résumé), the copyright symbol (©), registered symbol (®), etc., usually copied from a program like Microsoft Word, your webpage renders with the dreaded “black diamond question mark” symbol: �
Since the earliest days of the web, we’ve been using HTML Entities to create these characters. HTML Entities are escape sequences to represent special characters in your web page markup. For example, the syntax
renders as ©, in a webpage. I realize I can simply use these escape codes to get special characters to display correctly, but why? What if I have hundreds of pages of content with curly quotes in them and I just want to be able to render a page without using HTML entities?
When I develop websites, I run WAMPServer, which uses PHP 5, MySQL 5, and Apache 2 on Windows XP. I’ve been confused by this topic off and on for over 2 years now. And I’m not the only one.
I’ve tried trouble-shooting the character encoding and serving problem from the top down, starting with the web server software on down the line.
I have edited my Apache httpd.conf file with
I have edited my PHP.ini file with
default_charset = "utf-8"
I have also made sure that MySQL is using UTF-8. This includes both the MySQL Database itself…
…the MySQL connection, the MySQL table, and the MySQL field where my data is stored.
As you can see here, I even have gone into Firefox and set it to accept UTF-8 and receive UTF-8.
Still, I get unrenderable characters. WHY!?
I’m using Firebug to display the HTML Headers, and I’ve verified this is not a bug in Firefox. I’m seeing the dastardly � character whether I use Firefox 2, Firefox 3, Opera, Safari, or Chrome.
I’m sure there’s a character encoding guru out there somewhere that can tell me what I’m missing. I know, I know, I can just turn on iso-8859-1 (Windows Latin), anywhere along the chain of encoding, and everything will be fine. And indeed, this is true. It seems almost unfathomable that I’ve checked every possible setting related to the character set of the content type of the page I am trying to serve, and still get � everywhere.
Still, I thought the whole idea behind the move to UTF-8 was to prevent me from having to worry about all this stuff. I’d love to just happily store pages, create pages and serve pages in UTF-8 so all my characters look like they’re supposed to and I don’t have to escape them at all. Isn’t that the point?
I’m not convinced that I fixed the issue, but I have found a workaround. I decided to turn off the charset handling in both httpd.conf and php.ini, and added…
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
…to my page template. It works, but I still want to know why it works, or more accurately, why declaring everything UTF-8 doesn’t.
Update Oct. 2011
I’ve just discovered yet another issue that is not easy to figure out. It turns out that you can get the dreaded question-mark-in-diamond characters even in UTF-8 encoded files, if the file is written with a BOM (byte-order-mark). We had a PHP application including several files, one of which was encoded with a BOM. The special characters such as ç and õ were showing up fine on one part of the page, and as � on other parts of the page. We removed the BOM on one of the include files with NotePad++ on Windows and everything was fine again.
Javier Mosquera info worked for me too! Thanks Javier!
It might not get all of the black diamonds with the white question mark inside, but it did for my needs.
Great Forum, I’m bookmarking this page for reference just in case the problem reappears in a different form…
Comment out the AddDefaultCharset line in your Apache configuration file, because it overrides the Content-Type specified in the html files.
Just wanted to say THANKS for posting your fix with the iso-8859-1 char set! That worked beautifully for us. We’ve seen a lot of developers having this problem and this is the only solution we’ve seen that addresses the issue. THANKS AGAIN!
Thanks to Dave Burton! Commenting out AddDefaultCharset fixed my problem.
Hi, great post!
I am having the same problem using a java application.
Without the Apache default charset configuration, the page is shown without the unicode replacement characters, but an ajax request returns with them. My guess is the browser chooses the right encoding.
In order to solve the ajax response problem I updated the Apache httpd.conf to: ‘AddDefaultCharset utf-8′, but this causes the whole page (ajax response & hard coded html) to show those question marks diamonds.
My db is utf-8 too.
i am facing the same problem in our of my press release website, i think this is because of the CMS i am using, so i removed the CMS and its working fine now.
Leave a reply