onPHP5.com

PHP5: Articles, News, Tutorials, Interviews, Software and more
  
Featured Article:
Learning PHP Data Objects
 
 
Thu, 24 Apr 2014
 Home   About   Contribute   Contact Us   Polls 
Top Tags
article book conference mysql mysqli news onphp5 oop pdo php5 poll prado security solar symfony unicode zend zend core zend framework zend platform
More tags »

Not logged in
Login | Register

den_hotmail@fbzz

i18n with PHP5: Pitfalls

« Zend Framework 0.9.0 Beta Released Zend Core 2.0 Released »

By dennisp on Wednesday, 14 March 2007, 17:00
Published under: article   php5   unicode
Views: 145875, comments: 4

PHP5 inherited the PHP4's localization support that is far from being perfect. This article pinpoints the commonest problems of localization in PHP as well as gives some tips on working around them


setLocale()


PHP5 inherited all the problems of its predecessor, PHP4, when it comes to i18n. The main function, setLocale(), is working differently on different platforms. On Linux/Unix (and they're mainly used for sites in production), you have to specify the locale as a string consisting of the lowercased language code, underscore and uppercased country code. You can also force all the localization functions to return UTF-8 strings by appending '.UTF-8' to the locale name.

On Windows, which is used for development by a very large number of developers, however, you have to specify the locale name in a quite different manner - the language name, underscore and country name (not official name; Windows is not consistent here - for example, US is United States (should be United States of America), but Czech Republic is Czech Republic). Also you have to specify the code page for that country - eg, Ukrainian_Ukraine.1251.

Obviously this is not very cross-platform, but there is a small hack around this. setLocale() accepts either a string for locale name or an array of locale names. On Windows it is possible to call setLocale() with the language code only (ie setLocale(LC_ALL, 'cz'). This function returns the locale name that has been set, so for Windows we get Czech_Czech Republic.1250, where 1250 is the codepage of all values returned by locale-aware functions.

This fact allows us to create a cross-platform function:
<?php
/**
 * Set locale in a platform-independent way
 * @param  string $locale  the locale name ('en_US', 'uk_UA', 'fr_FR' etc)
 * @return  string  the encoding name used by locale-aware functions
 * @throw  Exception  if the locale could not be set
 */
function setLocaleCP($locale) {
  list(
$lang$cty) = explode('_'$locale);
  
$locales = array($locale '.UTF-8'$lang);
  
$result setlocale(LC_ALL$locales);

  if(!
$result) {
    throw new 
Exception("Unknown Locale name $locale");
  }

  
// See if we have successfully set it to UTF-8
  
if(!strpos($result'UTF-8')) {
    
preg_match('~\.(\d+)$~'$result$m);
    
$encoding 'CP' $m[1];
  } else {
    
$encoding 'UTF-8';
  }

  return 
$encoding;
}
?>


This function will return the encoding that local-aware functions will use. You will have to ensure the parameters you pass them are in that encoding, as well as to convert from that encoding their return values (this can be done via iconv extension, which uses this list of supported iconv encodings).

However, some locales on Linux/Unix use language codes not compatible with Windows. For example, 'uk', 'ru' or 'cz' work on both platforms while on Windows you have to use 'gr' vs. 'el' on Linux/Unix. This function can be easily modified to handle these cases (by having a hardcoded list of locales than need language code substitution).

Also, this workaround does not set the country on Windows, so currency-related functions will still behave wrongly.

Besides, setLocale() is not thread-safe which means your application may behave differently under load on many modern web servers.

strColl()


This is the function that collates two strings according to the language rules of the current locale. However, its behavior is not consistent across platforms - for example, Ukrainian letter 'ь' is the last in the alphabet, however, on most Linux/Unix systems it is not. Windows behaves correctly here. (I assume this happens for a very few languages. If this assumption is correct then this function can be wrapped into a PHP function that fixes it).

Date and time formatting


strftime() and date() do not support genitive month names used in Slavic languages, so for example strftime("%A %d %B %Y") returns grammatically incorrect results for these languages. Also, the case of the first letter of month names is wrong for some of them. In addition to that, some languages (German, for example), use a dot after the day number to specify the genitive case. There are no means in strftime() and date() to counterpart these two errors.

Localized country, language and currency names; number formatting


PHP does not provide any means to get translated country, language and currency names. Also it has problems formatting numbers and these problems are not easily overcome. The obvious solution by this point is to have a database of all locale data and access it from PHP code directly. Zend framework does use and PHP6 will use CLDR for this purpose. However, Zend framework does not offer ways of collating strings, while PHP6 will have this functionality.

Regular expressions


This is another problem, though not widely discussed. PCRE and other regular expression extensions are not locale-aware. This most notably influences the \w class that is unable to work for Cyrillic letters. There could be a workaround for this if some preprocessor for the regex string could replace \w and friends with character range prior to calling PCRE functions.

Localized strings


PHP is unaware of the encoding of strings. It treats them as a stream of bytes, while a displayable character in certain encodings can consume several bytes. That's why string offsets are not working for such encodings. This problem is solved in PHP6, however, a class that handles UTF-16 strings can emulate this behavior in PHP5 (by implementing the ArrayAccess interface and storing the string in UTF-16 internally), or the whole application can be developed with UTF in mind and use iconv functions throughout the code.

Conclusions


It's clear that most of these problems will be solved in PHP6, which first release is expected by the end of 2007. However, given the amount of internal functions that have to be upgraded we may experience lots of errors. Also, given the slow acceptance of PHP5, the predictions about PHP6 acceptance are not very promising. However, some of decent i18n functionality can be emulated even now.

Related articles

SimpleXML, DOM and Encodings
Exceptions in __autoload()
Sorting Non-English Strings with MySQL and PHP (Part 1)
Advocating Namespaces
Issues with Non-ASCII Chars in URLs
Learning PHP Data Objects
PHP Version 5.2.2 Released
PHP Version 5.2.3 Released
PHP Version 5.2.4 (RC1) Released for Testing
PHP Version 5.2.4 Released
PHP Version 5.2.2 (RC1) Released for Testing
PHP Version 5.2.1 Released
Some SEO Tips You Would Not Like to Miss
Clickable, Obfuscated Email Addresses
Most Important Feature of PHP 5?
PHP5 More Secure than PHP4

Comments

#1  By Anonymous on Friday, 16 March 2007, 13:50
Besides, setLocale() is not thread-safe which means your application may behave differently under load on many modern web servers.

Indeed it does and that's why it is basically useless, if you want to have different localization for users.

The same problem exist for date_default_timezone_set()
Note that there may be some unexpected side-effects that result from using either set_default_timezone() or the putenv("TZ=...") workalike for earlier PHP versions. ANY date formatted and output either by PHP or its apache host process will be unconditionally expressed in that timezone.


#2  By Álvaro G. Vicario on Wednesday, 01 August 2007, 08:05
You don't even need a high load to get inconsistent results with setlocale() and family. I've found dates being displayed in different languages *in the same page* just in my development box.

Furthermore, not only strftime() is OS dependant; it also seems to change its behaviour between releases. In my old box (Windows XP), the %e modifier was ignored; in my new box (Windows Vista), the whole string is ignored if it contains %e.

It's crazy. I hope PHP 6 eventually fixes all this mess.


#3  By Anonymous on Friday, 05 August 2011, 12:38
You can find an example how to match and validate Cyrillic and other utf8 characters in this post:

http://itworkarounds.blogspot.com/2011/08/validating-cyrillic-utf8-alphanumeric.html


#4  By Gildas on Thursday, 13 October 2011, 04:50
You rellay saved my skin with this information. Thanks!

Post your comment

Your name:

Comment:

Protection code:
 

Note: Comments to this article are premoderated. They won't be immediately published.
Only comments that are related to this article will be published.


© 2014 onPHP5.com