onPHP5.com

PHP5: Articles, News, Tutorials, Interviews, Software and more
  
Featured Article:
Learning PHP Data Objects
 
 
Fri, 20 Oct 2017
 Home   About   Contribute   Contact Us   Polls 
Top Tags
ajax article codeigniter conference dom namespace news onphp5 oop php5 poll prado security solar sqlite symfony unicode zend core zend framework zend platform
More tags »

Not logged in
Login | Register

den_hotmail@fbzz

Issues with Non-ASCII Chars in URLs

Exceptions in __autoload() »

By dennisp on Monday, 11 January 2010, 23:35
Published under: ajax   article   unicode   url
Views: 8779, comments: 1

This is a very quick post to warn developers about gotchas with using non-ascii chars in URLs (especially when making AJAX calls).


I discovered this odd behaviour quite unexpectedly, and I believe this can help others.

When you type an URL into the browser's address bar, the browsers should do the following for every non-ascii character:

a) convert it into UTF-8
b) convert it into the %HH code.

So, for say Cyrillic letters your server should be receiving 6 characters in the URL for every non-ASCII char typed into the address bar. This works with all modern versions of browsers, with one big exception - the characters in query string (ie, after the question mark), will not be encoded in that way. For me, it's some single-byte encoding (I didn't have time to investigate).

Note this happens for typeins only; when you follow a link from a page where the content-type explicitly sets UTF-8, then the query strings will be properly encoded into UTF-8 and then into %HH form. This bug affects IE8 and FF3 (please check and comment for other browsers).

As a solution for one live site, I had to update URL rewriting from:

http://www.exaplle.com/topic?someutfstring

to

http://www.exaplme.com/topic-someutfstring

This causes browsers to treat someotherstring as part of URL, not as part of the query string.

Another bug is manifesting in IE8 only. If you set a JS variable to some UTF-8 string, and then use that string to construct an URL for an AJAX call, that string will not be encoded into UTF-8 and then to %HH form. Instead, IE will use some other, non-unicode encoding.

A quick solution is to inject the JS variable as an URL-encoded string, eg instead of:

var myVar='someutfstring';

use:

var myVar=' <?= urlencode('someutfstring');?> ';

Note that in this article I am assuming that your PHP script source files are in UTF8 and the content served is UTF-8 (including the response headers which explicitly say so).

Related articles

SimpleXML, DOM and Encodings
Sorting Non-English Strings with MySQL and PHP (Part 1)
i18n with PHP5: Pitfalls
Exceptions in __autoload()
Advocating Namespaces
Learning PHP Data Objects
Clickable, Obfuscated Email Addresses
Some SEO Tips You Would Not Like to Miss

Comments

#1  By Anonymous on Tuesday, 12 January 2010, 02:14
After testing this out on my installs of FF3.5, and 3.6b5. The russian characters come back as UTF8 for both /topic?russian, and /topic-russian.

My Browser Language, and System locale are US English, and as there is no Russian in ISO-8859-1, they are being encoding as UTF-8 properly.

Post your comment

Your name:

Comment:

Protection code:
 

Note: Comments to this article are premoderated. They won't be immediately published.
Only comments that are related to this article will be published.


© 2017 onPHP5.com