onPHP5.com

PHP5: Articles, News, Tutorials, Interviews, Software and more
  
Featured Article:
Learning PHP Data Objects
 
 
Fri, 16 May 2008
 Home   About   Contribute   Contact Us   Polls 
Top Tags
article book conference mysql mysqli news onphp5 oop pdo php5 poll prado security solar symfony unicode zend zend core zend framework zend platform
More tags »

Not logged in
Login | Register

SimpleXML, DOM and Encodings

« Learning PHP Data Objects PHP Version 5.2.5 Released »

By dennisp on Sunday, 18 November 2007, 22:13
Published under: article   dom   php5   simplexml   unicode
Views: 5720, comments: 0

This is a short article that might be of interest to those who work with XML and HTML documents in different encodings using PHP5's SimpleXML and DOM extensions


As I have been working on one of my pet project today, I needed a way to get all the <link> tags to discover the RSS/Atom feeds as well as shortcut icon (favicon) for a webpage. The feeds would then later be processed and some metadata would be gathered.

Since RSS (both flavors) and Atom feeds are XML documents, the natural solution is to employ SimpleXML extension to parse them. As we can see from the docs, this extension does not allow us to learn what encoding the source document had. This is because SimpleXML will automatically convert the text nodes and attribute values into UTF-8.

XML documents are by default UTF-8 (well, Unicode, and parsers must be able to detect the encoding - any flavor of UTF - by heuristic algorithm), so it's not a big problem for SimpleXML to detect the correct encoding. The same applies to the DOM extension when used with the XML files. However, DOM will not convert the text nodes and attribute values into UTF-8 for you. However, since we can learn what the encoding is for XML files, it's a minor issue.

Problems begin to pop out when you use DOM to parse HTML files. Since HTML files (note that XHTML files are in fact XML) do not have the XML declaration, the resulting DOMDocument will rely on the <meta> tags to discover the content-type and encoding for the document. And, if there is no corresponding <meta> tag, then DOM will have no idea of what the encoding is. The only way to determine the encoding is to parse the HTTP headers returned by the server where the source HTML document lives. DOMDocument::loadHTMLFile() called with an URL will ignore the header to detect the encoding, so you will have to use cURL (or sockets) to initiate the HTTP request and then call DOMDocument::loadHTML().

You can examine whether DOM detected the encoding by looking at the DOMDocument::actualEncoding property. It will be set if you load an XML document with the encoding explicitly declared, or if you load an HTML document with the corresponding <meta> tag present. If you have to use DOM for parsing local HTML documents without the encoding specified by <meta> tag then there is no way to determine the encoding of the file.

The described problem suggests that existing way of communicating the encoding for HTML or even text documents over HTTP is not perfect. XML documents do not rely on the charset section of the Content-Type HTTP response header (just like other document types such as PDF, RTF, or Office documents) and they can be transferred across servers without any external metadata. HTML documents, however, do rely on the external metadata which makes offline or local processing of them difficult. As a conclusion, HTML authors should always include the encoding in the <meta> tag or switch to XHTML.

Related articles

i18n with PHP5: Pitfalls
Advocating Namespaces
Sorting Non-English Strings with MySQL and PHP (Part 1)
PHP5 More Secure than PHP4
Exceptions in __autoload()
PHP Version 5.2.3 Released
PHP Version 5.2.4 (RC1) Released for Testing
PHP Version 5.2.2 Released
PHP Version 5.2.5 Released
PHP Version 5.2.4 Released
Learning PHP Data Objects
Most Important Feature of PHP 5?
Clickable, Obfuscated Email Addresses
Some SEO Tips You Would Not Like to Miss
Error On devzone.zend.com
PHP Version 5.2.1 Released
PHP Version 5.2.2 (RC1) Released for Testing

Post your comment

Your name:

Comment:

Protection code:
 

Note: Comments to this article are premoderated. They won't be immediately published.
Only comments that are related to this article will be published.


© 2008 onPHP5.com