More tags »
Not logged in
Issues with Non-ASCII Chars in URLs
on Monday, 11 January 2010, 23:35
Views: 9046, comments: 1
This is a very quick post to warn developers about gotchas with using non-ascii chars in URLs (especially when making AJAX calls).
I discovered this odd behaviour quite unexpectedly, and I believe this can help others.
When you type an URL into the browser's address bar, the browsers should do the following for every non-ascii character:
a) convert it into UTF-8
b) convert it into the %HH code.
So, for say Cyrillic letters your server should be receiving 6 characters in the URL for every non-ASCII char typed into the address bar. This works with all modern versions of browsers, with one big exception - the characters in query string (ie, after the question mark), will not be encoded in that way. For me, it's some single-byte encoding (I didn't have time to investigate).
Note this happens for typeins only; when you follow a link from a page where the content-type explicitly sets UTF-8, then the query strings will be properly encoded into UTF-8 and then into %HH form. This bug affects IE8 and FF3 (please check and comment for other browsers).
As a solution for one live site, I had to update URL rewriting from:
This causes browsers to treat someotherstring as part of URL, not as part of the query string.
Another bug is manifesting in IE8 only. If you set a JS variable to some UTF-8 string, and then use that string to construct an URL for an AJAX call, that string will not be encoded into UTF-8 and then to %HH form. Instead, IE will use some other, non-unicode encoding.
A quick solution is to inject the JS variable as an URL-encoded string, eg instead of:
Note that in this article I am assuming that your PHP script source files are in UTF8 and the content served is UTF-8 (including the response headers which explicitly say so).
SimpleXML, DOM and Encodings
Sorting Non-English Strings with MySQL and PHP (Part 1)
i18n with PHP5: Pitfalls
Exceptions in __autoload()
Learning PHP Data Objects
Clickable, Obfuscated Email Addresses
Some SEO Tips You Would Not Like to Miss
Note: Comments to this article are premoderated. They won't be immediately published.
on Tuesday, 12 January 2010, 02:14
After testing this out on my installs of FF3.5, and 3.6b5. The russian characters come back as UTF8 for both /topic?russian, and /topic-russian.
My Browser Language, and System locale are US English, and as there is no Russian in ISO-8859-1, they are being encoding as UTF-8 properly.
Only comments that are related to this article will be published.