Ramblings on technology with a dash of social commentary
RSS icon Email icon Home icon
  • Special characters show up as a question mark inside of a black diamond

    Posted on June 6th, 2009 phpguru 43 comments

    Almost every web developer has run into the problem of character sets and character encoding. Joel On Software has the most succinct post on the topic of Unicode.

    Here’s the problem. Your web page has certain characters that cannot be displayed properly. Instead of typographer’s quotes (“curly quotes” instead of foot ' and inch " marks), ‘e acute’ (as in the word résumé), the copyright symbol (©), registered symbol (®), etc., usually copied from a program like Microsoft Word, your webpage renders with the dreaded “black diamond question mark” symbol: �

    Since the earliest days of the web, we’ve been using HTML Entities to create these characters. HTML Entities are escape sequences to represent special characters in your web page markup. For example, the syntax

    ©

    renders as ©, in a webpage. I realize I can simply use these escape codes to get special characters to display correctly, but why? What if I have hundreds of pages of content with curly quotes in them and I just want to be able to render a page without using HTML entities?

    When I develop websites, I run WAMPServer, which uses PHP 5, MySQL 5, and Apache 2 on Windows XP. I’ve been confused by this topic off and on for over 2 years now. And I’m not the only one

    I’ve tried trouble-shooting the character encoding and serving problem from the top down, starting with the web server software on down the line.

    I have edited my Apache httpd.conf file with

    AddDefaultCharset UTF-8

    I have edited my PHP.ini file with

    default_charset = "utf-8"

    httpd-conf-php-ini-utf-8

    … and restarted Apache.

    I have also made sure that MySQL is using UTF-8. This includes both the MySQL Database itself… 

    mysql-database-connection-utf-8 

    …the MySQL connection, the MySQL table, and the MySQL field where my data is stored.

    mysql-field-utf8-general-ci

    As you can see here, I even have gone into Firefox and set it to accept UTF-8 and receive UTF-8. 

    firefox-content-fonts-advanced-default-character-encoding

    Still, I get unrenderable characters. WHY!?

    question-mark-in-diamond-firefox-utf-8

    I’m using Firebug to display the HTML Headers, and I’ve verified this is not a bug in Firefox. I’m seeing the dastardly � character whether I use Firefox 2, Firefox 3, Opera, Safari, or Chrome.

    I’m sure there’s a character encoding guru out there somewhere that can tell me what I’m missing. I know, I know, I can just turn on iso-8859-1 (Windows Latin), anywhere along the chain of encoding, and everything will be fine. And indeed, this is true. It seems almost unfathomable that I’ve checked every possible setting related to the character set of the content type of the page I am trying to serve, and still get � everywhere.

    Still, I thought the whole idea behind the move to UTF-8 was to prevent me from having to worry about all this stuff. I’d love to just happily store pages, create pages and serve pages in UTF-8 so all my characters look like they’re supposed to and I don’t have to escape them at all. Isn’t that the point?

    I’m not convinced that I fixed the issue, but I have found a workaround. I decided to turn off the charset handling in both httpd.conf and php.ini, and added…

    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

    …to my page template. It works, but I still want to know why it works, or more accurately, why declaring everything UTF-8 doesn’t.

    Update Oct. 2011

    I’ve just discovered yet another issue that is not easy to figure out. It turns out that you can get the dreaded question-mark-in-diamond characters even in UTF-8 encoded files, if the file is written with a BOM (byte-order-mark). We had a PHP application including several files, one of which was encoded with a BOM. The special characters such as ç and õ were showing up fine on one part of the page, and as � on other parts of the page. We removed the BOM on one of the include files with NotePad++ on Windows and everything was fine again.

     

    43 responses to “Special characters show up as a question mark inside of a black diamond” RSS icon

    • I was just looking for the same thing and found an answer – its happening because your text has been written to the database in iso-8859-1 format, so you just need to convert the data from iso-8859-1 to utf8 before outputting it. E.g.
      $text = “some iso-8859-1 string from database”;
      $text = utf8_encode($text);
      echo $text;

      hope that helps :)

    • The UTF-8 encoding is restricted to show character between this range U+0000 to U+10FFFF.

      Any character beyond this range is treated as invalid and the UTF-8 decoder just inserts a Replacement character for it which is �.

      see:
      1) http://en.wikipedia.org/wiki/UTF-8
      2) http://en.wikipedia.org/wiki/Replacement_character

      The above two links nicely explains the same.

    • I understand that. What makes no sense is when Apache, PHP, and my browser are all calling the content UTF-8 at each step in the process, and still, the special character remains. There aren’t any ways to identify where it is breaking! The whole point of declaring your files UTF-8, your database fields UTF-8, your Apache & PHP config in UTF-8… is so you can put special characters in the site code and content and not have to escape it. But when something doesn’t work, there’s no way to find out what to fix.

    • Make sure you set the client to UTF as well (e.g. in PHP):

      mysqli_set_charset($dbc, ‘utf8′);

    • Hi All,

      Followed through the discussion and i have a valid UTF8 character in Database(Séguin, Daniel). Php.ini and httpd.conf do not have any default character-set defined.
      Current Character set in the Page is

      However in Chrome/FFox, the data is seen as S�guin and in IE as Sguin, Daniel

      Is there something i am missing that i still find the �?.

      Any Workaround please.
      Vikram.

    • phpguru, are you by chance performing any type of character manipulation using a PHP function? For example, strtolower() and ucwords() are not designed for operation on multi-byte characters, and if one uses such functions on multi-byte characters, the result will contain � wherever a multi-byte character would otherwise appear.

      Most string manipulation functions have a multi-byte version. In the case of the examples cited above, one should instead use

      mb_convert_case($str, MB_CASE_TITLE, ‘UTF-8′)

      and

      mb_strtolower($str, ‘utf-8′)

      respectively.

      You did not provide code samples from your MySQL queries, and as such, I assume that you have already done the following before executing the MySQL queries that return UTF-8 characters in the result set:

      mysql_query(“SET NAMES ‘UTF8′”);
      mysql_query(“SET CHARACTER SET ‘UTF8′”);

      While I have not had to employ this particular measure myself, others have stated that

      mysql_set_charset(‘utf8′);

      may also be necessary in some cases.

    • Were having the same problem on our website— almost all of our products have the “degree” character in them (tool website)… the problem is that we are seeing the black-diamond EVERYYYWHERE on all 5,000+ pages.. all of our UTF-8 stuff is set correctly… the ‘?’ actually appears IN the database, but if i use htmlentities($value) when inserting– magento doesnt decode it on the other side…

      any thoughts??

      thanks

    • You may want to look at the CAST( ) and CONVERT( ) MySQL Functions, and/or the iconv library for PHP. Sounds to me like you have non-UTF-8 character data stored in a UTF-8 database table. In my case, I had to write a command-line PHP script that went through all the fields in my database and explicitly convert it to utf-8 from PHP. You may be able to UPDATE field SET field = CONVERT(field USING UTF-8 )

    • I see this is an old discussion, but…

      This can be caused by something else that has nothing to do with encoding, casts, conversions, etc. I ran across it while developing an error handling class. Part of the process was to create a backtrace. During development, I was printing it to the screen (using ‘pre’ tags) and to a log. The black diamond was there whenever an object’s private properties were being enumerated. This did not happen with public properties. In a browser it would look like this: ObjectNamepropName. Printed to the log, it looked like this: NULObjectNameNULpropName.

      Come to find out the diamonds were actually the ASCII NULL character (a backslash followed by a zero, 0×00 in hex). I used this to clear it out:


      /* Replace ASCII NULL with empty string */
      $var = str_replace("\0",'',$var);

      Hope this helps someone. Took me several hours and half a pack of smokes to figure it out.

    • Is there a solution for when this happens with a simple character such as a ‘space’, the one you get when you press the spacebar on your keyboard? I’m asking because I’ve recently begun to see random spaces in text I entered in my MySpace blog using the built in text editor being replaced with that question mark inside a diamond symbol. I didn’t use any escape sequences or any special characters just the regular space from the spacebar, additionally it’s not happening to every space I insert, only in a few locations where I used 2 spaces next to each other as proper typing etiquette dictates you should use when starting a new sentence, even then it’s not happening to all of those instances, only a couple in random places in the text I type, but only where 2 spaces have been placed. Anyone have any idea how to fix this, I have tried editing it removing the offending symbol but every time I submit the changes it comes back. Thanks for any help you can provide.

    • Actually I just discovered the answer to this question a week ago. In HTML, there’s no way to put “two spaces” because whitespace is coalesced. What your WYSIWYG editor is doing is inserting a non-breaking-space character, that is then not displaying correctly given your document’s character set. So, what you’re seeing is quite likely a non-breaking-space character before or after an actual space character.

      I think some of the latest versions of HTML editors (Tiny MCE / FCK Editor) as well as Word (as usual) are the culprit. I discovered by accident that typing shift-space enters a non-breaking-space character. If you have a lazy pinky on the shift key when pressing the space bar, you can accidentally be injecting lots of non-breaking-space characters into your text documents without even realizing it until something like this happens.

      If you have access to the database, check to see if you can find CHAR(160) anywhere. In MySQL you do…

      SELECT content FROM table WHERE content LIKE CONCAT('%', CHAR(160) ,'%')

      If you get more than zero rows returned, then this is exactly what the problem is. To fix it, you can try something like:

      UPDATE table SET content = REPLACE(content, CHAR(160), '&nbsp;') WHERE content LIKE CONCAT('%', CHAR(160) ,'%')

      CHAR(160) is the non-breaking space character, not the html entity reference, `&nbsp;` , but the actual non-breaking space character. Note, of course, that you’ll have to change the SQL above to match your table and column names in your actual database. (Use at your own risk, always back up, etc. etc.)

      If you can’t access your database, switch to the code view of your HTML WYSIWYG editor, and delete the two spaces (one is probably a non-breaking space or the evil question-mark-in-diamond character. Instead put &nbsp; and then a space, or “  ” (you have to view source to see this last bit – WordPress doesn’t render the comment correctly). Ampersand-n-b-s-p-semicolon.

      Hope that helps…

    • The code from phpguru worked for me – no more diamonds!! I basically had a bunch of html code in a bunch of database fields that was “diamond infested” – phpguru’s template code above worked. Thanks again phpguru.

    • Hey there. I had the EXACT same problem with those black diamonds.

      I read a tutorial (in spanish, sadly for you), but it all comes to one simple line you have to add JUST after you selected your data base.

      This is the line:
      mysql_query (“SET NAMES ‘utf8′”);

      Example:

      Do not decode nor encode the text. Do not do that, ok?

      What else……??? Oh! you also have to set your charset as UTF-8 like this:

      And also this…. you’ll have to set all your data base stuff to the charset:
      utf8_unicode_ci

      Having done all that, you should be OK.

      The line of code…

      … you’ll have to use it on ALL the pages involving the text you want to decode, that means it has to be on the “edit recordset page” (if there is one).

      OK… that’s the solution I found. Hope you can use it same way I did. Cheers!

    • You can also have this issue if you have copy and pasted text directly from MS Word or another text editor.

      I had the same issue and had to delete and re-enter the suspect characters like ” , ” and ‘ manually.

    • Yes, Microsoft Word is notorious for injecting curly quotes and actual non-breaking-space characters into CMS systems. CK Editor has “Paste from Word” for that reason, but most users don’t realize the importance of using it. I always paste Word content into TextMate (Mac) or UltraEdit (Windows) to strip all formatting before using in any web database.

    • I’ve had to deal with UTF-8 from a slightly different perspective, but it may help the situation here.

      UTF-8 is not the same as “allow all characters to go through.”

      UTF-8 actually encodes most of the Unicode character set into multi-byte characters. The kicker here is that it means many bytes are not valid characters, unless they’re preceded by the proper prefix byte.

      If memory serves, the Wikipedia article has a table of the UTF-8 character set and how it’s encoded. Any textual description of UTF-8 is hard to understand … and I’ve been a software engineer for over 30 years!

      ISO8859-1, a.k.a. ISO-Latin1, doesn’t perform this extra encoding, allowing any 8-bit character to go through.

    • Joe,
      Your point is well taken. UTF-8 is an encoding, just like any other. I think the thing that confused me and many other people the most is the fact that you can have a UTF-8 database column, a UTF-8 database table, a UTF-8 mysql connection, a UTF-8 Apache server, a UTF-8 PHP file and specify UTF-8 for your HTML document encoding… and still get question mark in diamond characters. Just because you specify UTF-8 all the way down the chain doesn’t automatically change characters in the wrong encoding to the right one (UTF-8). In my case, I had to update database content explicitly inside of a repeat loop in PHP. Once I checked all of the above settings and ran my conversion script – all my problems went away. My script simply selected one row at a time, converted the text and HTML content to UTF-8 and then updated that row.

    • Well, it looks like I’m not the only one having that issue, but mine is now slightly different.

      I’m migrating server from FreeBSD to CentOS, so copied the DB, obviously files are exactly the same and magic happens, diamonds everywhere.

      The database is set to latin1_swedish_ci, my HTML document is set to iso-8859-1, obviously this can be an issue (I think), but still how can I explain it works fine on the other server? the DB and files are exactly the same, only thing that has changed is the server.

      Any ideas?

    • This is a very good thread and more useful than the Joel On Software page (which is also informative, but not as practical IMHO).

      I’ve not yet found my answer — I’m copy/pasting “curly quotes” and other “special typographics” from EditPadPro, SlickEdit, Word and other apps and pasting directly into SlickEdit (which displays the spec chars properly), then saving to html file served by apache (using adddefaultcharset utf-8), but still getting the diamond question box…

      It appears a function is necessary since the “curly quotes” exist only in WINDOWS-1252 but not in ISO-8859-1 or UTF-8, so THANK YOU (AGAIN) MICROSOFT. Let’s be sure to correctly identify to culprit, and, once again, it’s Micro$oft.

      http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php

    • Bill, thanks for the link.

      Changing certain characters to different ones which are similar to the original ones is more of a hack/workaround. For example, converting typographer’s quotes (a.k.a. “curly quotes”) to their straight quote equivalents (foot mark or inch mark), doesn’t help help someone trying to accurately display their résumé or describe their foot in German (Füße).

      One of the great benefits of using UTF-8 character set on your web documents is to be able to support all sorts of foreign language characters without having to convert “upper register” characters (above ASCII 127) into other, lower register (ASCII 127 or below) characters, or especially collections of characters (a.k.a. entity references) such as &quote;. In the former case, you’re changing the content to something different. In the latter case, you’re adding extraneous bytes to your database unnecessarily and making the content less future-proof.

      The PHP script you linked us to is handy, though, for cases when your database column is set to ASCII or Latin1 and you don’t want to change it. In those cases, you could convert incoming “bad” characters to an approximate “good” character that will work correctly when stored to the database.

      With that said, I’ve found I rarely need to use a PHP script for these types of conversions, except to pre-process incoming form $_POST data. For example, right in MySQL you can do:
      UPDATE content SET body=CONVERT(body USING utf8); That usually does the trick for me.

    • Hi, i found this page, i am having this issue on my computer. Certain forums I go on has this character for me and no one else.

      In laymans terms what do i need to do??

    • Faye-

      It sounds to me like you may have an unusual font assigned as the default font on your computer or browser. For example, I had a strange font issue a few years ago when I installed the software for an HP printer. The installer installed a Postscript font named `Helvetica` over the default, TrueType font, and pages looked rather strange afterwards. I ended up uninstalling the font.

      I would try to change your default web font to a different one and see if the problem still happens on the same characters. See How to change the default font in Firefox or if you use Internet Explorer, Change the default font in IE. Good luck!

    • Loragnor, your solution worked miracles,

      well done!

    • @Loragnor, thanks, that solution really worked for me.

    • As this is high on Google results for the subject, thought I’d feed back that adding the $mysqli->set_charset(‘utf8′); line worked for me. To get to the point where that was the only remaining problem, I had already added the following into my standard all-pages header code:

      @setlocale(LC_ALL,’en_GB’);
      @define(‘CHARSET’,'utf-8′);
      header(‘Content-type: text/html; charset=utf-8′);

      And thus ends 20 minutes of frustrated searching! Thanks for initiating the discussion.

      Al

    • I bought oscommerce template and the helpdesk didnt give me answer, they only sent me to this forum.
      So maybee someone in here can be able to tell me how i can chsnge the font or what i should do to get the webshop to show the character æ, å, ø
      Is that possible on a easy way?

    • I am having this same issue with a clients website @ http://5loaves2fishes.net The strange thing is I am using (at the clients request) a third party javascript. The CMS is Drupal 6 and the script is placed in the body of its own content type. So the content is not actually a node. I suppose the way to correct this would be to create a parser to store the content in the database where the proper utf-8 characters would be stored. Does anybody know of a way around this issue? In a simple way? The site has already consumed too much of my time.

    • I too came across the dreaded black diamond in a question mark, and found my cause!…

      I couldn’t fathom why to begin with, as I was simply trying to store a description as a PHP variable to then be echo’d by PHP within a page. It turned out I was using a “closing apostrophe” (not supported) rather than a “regular one”.

      The cause to this “closing apostrophe” was simple, it had appeared because I was copying and pasting content from MSWord into my PHP files, rather than typing it.

      Try going through your text and retyping the apostrophes and other special characters.

      :)

    • Hey, I am having this problem on my blog right now. Cannot figure this out. Talked to Hostgator but they told I would need to go through all posts manually to fix this LOL (hundreds of posts and thousands of comments)

      Can you please help me with a solution?

      My blog url is under my name. All I see is diamonds all over my content at this time.

    • Hi,
      Even this is a old thread but I hope somebody can help me.
      I don’t use PHP but asp.net and MySql. I have read whole thread and made any changes regardomg encodings but still I see only question marks on my site. I enter arabic letters.
      U have set all my database stuff to the charset:
      utf8_unicode_ci (except database mysql server itself becuase I don’t have access to the web host’s server).
      I have all above combinations but still only question marks are saved in database table.
      thank you for your help

    • It could be that it just looks like it is not working. Let me explain. I was doing some testing with a colleague who uses Windows 7. I don’t know if you use SQLYog but he does. We were trying to get UTF-8 characters displaying correctly in any MySQL client — even the command line — and it wouldn’t work. I could see valid UTF-8 characters from the Mac and from terminal SSH’d into the MySQL server running on Ubuntu. It turns out that SQLYog is not capable of rendering characters correctly even though they are stored correctly in the database. They claim it is a bug in the MySQL command line client on Windows. In fact, even PuTTY by default is using ISO-8859-1 on Windows – so even if you use PuTTY to SSH into a linux server that has UTF-8 characters in it, you won’t always see them correctly in your client. You need to verify that the client you are using to view the data in the database supports and is configured to display UTF-8.

    • You’re missing a step:

      Your forms must have accept-charset=”utf-8″ or the browser will convert on POST.

    • Interesting that you point this out, oldman. This is actually not a requirement, according to my research. However, what I have found is that you can accept utf-8 encoded values in a form that is served under iso-8859-1 or other encoding.

    • See the 2011 edit to this article too, guys! Remove BOM – Save files as UTF-8 WITHOUT Byte-order-mark also. Cheers!

    • I ran into this problem while scraping product information from a supplier website and pasting it to the update form on our windows xp application that uses a mssql database. No obvious problems doing that.

      However, I subsequently retrieve this data and display on our website (LAMP) and there the little blighters are.

      Fixed this for now by doing the following to the content before I display it.

      $bad = array("’","—","é","”"); // I hope these show up in the post!!!!
      $good = array("'","-","&#233;","\"");
      $m_long = str_replace($bad,$good,$row['Description']);

      Where did I get those bad characters? Why I copied and pasted them into my code from the original supplier website.

      Now I know that this is a bandaid. But it serves my purpose for now by editing the most frequent offenders!

    • Thanks for the code snippet, Neil. Yeah, that’s a quick and dirty hack alright, one that I often resorted to back in the day before I fully understood the real issue happening. Here’s the HTML entities reference I use at Wikipedia, when needed. Replacing bogus characters with entity references is fine if you want to just get something displayed quickly without fixing the real problem, but in reality you should try to avoid doing that if you can.

      Sounds to me what happened in your case is a common problem – you took some UTF-8 content from the web and stored it in your database (probably in a VARBINARY in MS SQL), but you’re serving your pages with ISO-8859-1, aka Latin 1, not UTF-8.

      The whole point of UTF-8 encoding is to represent characters as they actually are, so that you don’t need entity references in your markup. é is just é in the source code, not &#233!

    • Thank you for all this help. I had two pages which persisted with question marks and black diamonds, the words of a hymn were being used in a novel in quotes and with apostrophes and whenever I used Mozilla Firefox as browser, the dreaded black diamonds appeared.
      As it was only two pages and as I haven’t got a clue about programming, I tried mmck’s useful tip about retyping them in… and YES !!! they disappeared and the right symbols reappeared. As my website is pretty DIY anyway, I am especially grateful as it made it look really terrible. Thank you everyone and especially mmck.

    • If you are working with c# and find these characters popping up in your controls… See this link.

    • Javier Mosquera your solution worked like magic

      thanks a million and one

    • Javier Mosquera, exactly as alex noted – everything worked like magic. Been struggling with this for hours. Thanks very much!

    • Can someone please help me figure this out? On my computer at work Facebook shows up as all characters only. No letters whatsoever. I think its some kind of encoding thing but Facebook is the only page that does this. I wouldn’t even be worried but since it’s at work im afraid its a virus and I dont want to get in trouble. Yesterday My tool bar completely disappeared so I was hitting f1-f12 randomly and changing everything I could on the tool bar to fix it-thinking it would fix it (I figured out how to make the tool bar reappear) but sometime later I noticed facebook looked like this: ����0�gs΄��;�. The whole page-and at the top of the tab it has the whole http://www.facebook.com instead of just facebook. This website seems a little in depth for my problem but Ive been looking for a couple days on how to fix this and this is the only one ive come across that has a spot to ask questions and isnt from 2007. If anyone AT ALL could help I would REALLY appreciate it. Thanks!

    • I found it really easy to correct this issue once I learned the character code was 160. For me, I was accessing content from a MySQL database that was saved by WordPress’ TinyMCE editor.

      A PHP line of code like the following takes care of the issue. I wanted to correct it within PHP and not via a MySQL query:

      $clean_content = str_replace(chr(160), ” “, $bad_content);

    • Hollie, you have a virus. You should run a virus scan and stop surfing around Facebook at work.


    Leave a reply

    Spam Protection by WP-SpamFree