onPHP5.com

PHP5: Articles, News, Tutorials, Interviews, Software and more
  
Featured Article:
Learning PHP Data Objects
 
 
Mon, 11 Dec 2017
 Home   About   Contribute   Contact Us   Polls 
Top Tags
ajax article codeigniter conference dom namespace news onphp5 oop php5 poll prado security solar sqlite symfony unicode zend core zend framework zend platform
More tags »

Not logged in
Login | Register

den_hotmail@fbzz

Sorting Non-English Strings with MySQL and PHP (Part 1)

« Zend Platform 3.0 Beta for Windows Released Prado 3.1.0 Alpha Released »

By dennisp on Monday, 15 January 2007, 22:57
Published under: article   mysql   unicode
Views: 21880, comments: 4

When it comes to storing textual data in a database and then sorting them alphabetically, many developers face a common problem - the database won't sort strings in their language. This short series is an advice that may suit almost all needs.


Introduction


First thing to note is that the approach discussed in this part applies best to languages that use Cyrillic script, mostly Ukrainian and Russian. This is because they have simple collation rules and fairly identical alphabets. However, some points raised here are a good example to follow for all developers that use MySQL. In next parts we will see how to deal with this in other languages.

So, let's assume you know what a character set is, and even heard the word "collation". By default, MySQL will use latin1 for character set of all your newly created tables. The default collation will be latin1_swedish_ci.

The problem with these defaults is that they cannot be used to sort non-English strings. Our solution is to use Unicode, or, more specifically, the UTF-8 character set. Why, you may ask! Because this character set has a collation that will be capable of properly sorting Cyrillic strings.

To enforce UTF-8 to all your tables you have to specify it when creating them:


mysql> CREATE TABLE words(word varchar(50)) CHARACTER SET utf8 COLLATE utf8_unicode_ci;


Alternatively, you can make the whole database use UTF-8:


mysql> CREATE DATABASE words CHARACTER SET utf8 COLLATE utf8_unicode_ci;


So doing, you won't have to specify the charset and collation for every table in the database.

Connection Encoding


Another important point is instructing MySQL about correct connection encoding. Connection encoding is the character set you are sending text to MySQL and receiving text from MySQL. The connection encoding does not have to be the same as the character set used in the CREATE TABLE statement - in such case MySQL will take care of converting the strings from/to your connection encoding.

In our next example we will use mysql command line client to insert some data into our table. On Windows machines this client "talks" cp866 encoding, so we must instruct MySQL that the data should be converted from/to cp866:


mysql> SET NAMES cp866;


Please note that cp866 is the encoding on Windows machines; the encoding on Linux/Unix will be dependant on your locale.

Inserting Data and Sorting the result


Now let's insert some data - we will simply use letters of the Ukrainian alphabet. We will insert them in a random order so that later we can see that MySQL really sorts the data properly:


mysql> INSERT INTO words VALUES("г"), ("б"), ("м"), ("в"), ("є"), ("а"), ("ж"), ("ь"), ("у"), ("щ");
Query OK, 10 rows affected (0.02 sec)
Records: 10 Duplicates: 0 Warnings: 0

mysql> select * from words order by word;
+------+
| word |
+------+
| а |
| б |
| в |
| г |
| є |
| ж |
| м |
| у |
| щ |
| ь |
+------+
10 rows in set (0.00 sec)


We can see that now the letters are in their alphabetical order.

Example in PHP


Now let's get to PHP side of things. When establishing the connection, be sure to issue the

<?php
...
mysql_query('SET NAMES utf8');
// Or, if using PDO:
$conn = new PDO($connStr);
$conn->query('SET NAMES utf8');
...
?>


Of course, you may substitute the utf8 character set with, say cp1251, but this will slow MySQL a bit as every query will have to be converted twice - the query itself and the result set. Also, you have to instruct the browser that you will be sending the page in the utf8 encoding:

<?php
...
header('Content-Type: text/html; charset=utf-8');
// Or for cp1251:
header('Content-Type: text/html; charset=windows-1251');
...
?>


Please note that here we use utf-8 (note the dash - the charset names are not very standartized in terms of dashes usage). Same about different names for cp1251 - browsers know it as windows-1251.

Having ensured that we are talking to MySQL and to the browser the correct character set, we can produce the rest of page:

<?php
...
$q mysql_query('SELECT word FROM words ORDER BY word');
while(
$r mysql_fetch_row($q)) {
  echo 
"$r[0]<br/>";
}
// PDO way:
$stmt $conn->query('SELECT word FROM words ORDER BY word');
while(
$r $stmt->fetchColumn()) {
  echo 
"$r<br/>";
}
...
?>


Handling User's Input


Of course, in the real world application the data will arrive from users filling forms on the pages of your site. And, these data must be in the encoding you are talking to MySQL. In general, browsers will send the data back to your server in the same charset as the page itself (in our example, utf-8). You can also enforce that with the accept-charset attribute of the <form> tag:


<form action="..." accept-charset="utf-8">
...
</form>


(note, however, this may break the validity of your HTML - this attribute is defined in HTML 4.01).

If, however, the data arrive in a different charset, then it is your responsibility to convert them into the connection encoding (for example, by using mbstring or iconv PHP extensions).

Conclusion


This part just showed you the basics of maintaining the same charset for database and scripts. So doing you will never have problems with non-English sites, and sticking to de-facto standard, Unicode, will help you develop multi-language sites.
In the following parts we will look at issues related to the utf8_unicode_ci collation and see how to sort non-Cyrillic but non-English languages too.

Related articles

Issues with Non-ASCII Chars in URLs
i18n with PHP5: Pitfalls
SimpleXML, DOM and Encodings
Some SEO Tips You Would Not Like to Miss
Exceptions in __autoload()
Advocating Namespaces
Learning PHP Data Objects
Clickable, Obfuscated Email Addresses

Comments

#1  By buh on Tuesday, 16 January 2007, 07:29
Great article, great start! I'd like to believe this site will become very successful :)


#2  By razonklnbd at yahoo dot com on Sunday, 11 February 2007, 04:02
what about befoe start mysql query like following...

mysql_connect( "localhost" ,"root"," root");
mysql_select_ db("dbname" );
mysql_query( 'SET CHARACTER SET utf8');
mysql_query( "SET SESSION collation_connectio n ='utf8_general_ ci'");


#3  By dennisp (editor) on Sunday, 11 February 2007, 08:26
In reply to #2:
SET NAMES is a shorthand for these


#4  By Abdulsalam Al-Majidy -Yemen on Monday, 16 February 2009, 07:38
Thanks for you very much . I want alot about php language.

Post your comment

Your name:

Comment:

Protection code:
 

Note: Comments to this article are premoderated. They won't be immediately published.
Only comments that are related to this article will be published.


© 2017 onPHP5.com