PHP5: Articles, News, Tutorials, Interviews, Software and more
Featured Article:
Learning PHP Data Objects
Wed, 26 Jan 2022
 Home   About   Contribute   Contact Us   Polls 
Top Tags
ajax article codeigniter conference dom namespace news onphp5 oop php5 poll prado security solar sqlite symfony unicode zend core zend framework zend platform
More tags »

Not logged in
Login | Register


SimpleXML, DOM and Encodings

« Learning PHP Data Objects 2008 PHP Quebec Conference Call for Papers »

By dennisp on Sunday, 18 November 2007, 20:03
Published under: article   dom   php5   simplexml   unicode
Views: 14631, comments: 0

This is a short article that might be of interest to those who work with XML and HTML documents in different encodings using PHP5's SimpleXML and DOM extensions

As I have been working on one of my pet project today, I needed a way to get all the tags to discover the RSS/Atom feeds as well as shortcut icon (favicon) for a webpage. The feeds would then later be processed and some metadata would be gathered.

Since RSS (both flavors) and Atom feeds are XML documents, the natural solution is to employ SimpleXML extension to parse them. As we can see from the docs, this extension does not allow us to learn what encoding the source document had. This is because SimpleXML will automatically convert the text nodes and attribute values into UTF-8.

XML documents are by default UTF-8 (well, Unicode, and parsers must be able to detect the encoding - any flavor of UTF - by heuristic algorithm), so it's not a big problem for SimpleXML to detect the correct encoding. The same applies to the DOM extension when used with the XML files. However, DOM will not convert the text nodes and attribute values into UTF-8 for you. However, since we can learn what the encoding is for XML files, it's a minor issue.

Problems begin to pop out when you use DOM to parse HTML files. Since HTML files (note that XHTML files are in fact XML) do not have the XML declaration, the resulting DOMDocument will rely on the tags to discover the content-type and encoding for the document. And, if there is no corresponding tag, then DOM will have no idea of what the encoding is. The only way to determine the encoding is to parse the HTTP headers returned by the server where the source HTML document lives. DOMDocument::loadHTMLFile() called with an URL will ignore the header to detect the encoding, so you will have to use cURL (or sockets) to initiate the HTTP request and then call DOMDocument::loadHTML().

You can examine whether DOM detected the encoding by looking at the DOMDocument::actualEncoding property. It will be set if you load an XML document with the encoding explicitly declared, or if you load an HTML document with the corresponding tag present. If you have to use DOM for parsing local HTML documents without the encoding specified by tag then there is no way to determine the encoding of the file.

The described problem suggests that existing way of communicating the encoding for HTML or even text documents over HTTP is not perfect. XML documents do not rely on the charset section of the Content-Type HTTP response header (just like other document types such as PDF, RTF, or Office documents) and they can be transferred across servers without any external metadata. HTML documents, however, do rely on the external metadata which makes offline or local processing of them difficult. As a conclusion, HTML authors should always include the encoding in the tag or switch to XHTML.

Related articles

i18n with PHP5: Pitfalls
Exceptions in __autoload()
Sorting Non-English Strings with MySQL and PHP (Part 1)
Advocating Namespaces
Issues with Non-ASCII Chars in URLs
Learning PHP Data Objects
PHP Version 5.2.2 Released
PHP Version 5.2.3 Released
PHP Version 5.2.4 (RC1) Released for Testing
PHP Version 5.2.4 Released
PHP Version 5.2.2 (RC1) Released for Testing
PHP Version 5.2.1 Released
Some SEO Tips You Would Not Like to Miss
Clickable, Obfuscated Email Addresses
Most Important Feature of PHP 5?
PHP5 More Secure than PHP4

Post your comment

Your name:


Protection code:

Note: Comments to this article are premoderated. They won't be immediately published.
Only comments that are related to this article will be published.

© 2022 onPHP5.com