This is a short article that might be of interest to those who work with XML and HTML documents in different encodings using PHP5's SimpleXML and DOM extensions
As I have been working on one of my pet project today, I needed a way to get all the tags to discover the RSS/Atom feeds as well as shortcut icon (favicon) for a webpage. The feeds would then later be processed and some metadata would be gathered.
Since RSS (both flavors) and Atom feeds are XML documents, the natural solution is to employ SimpleXML extension to parse them. As we can see from the docs, this extension does not allow us to learn what encoding the source document had. This is because SimpleXML will automatically convert the text nodes and attribute values into UTF-8.
XML documents are by default UTF-8 (well, Unicode, and parsers must be able to detect the encoding - any flavor of UTF - by heuristic algorithm), so it's not a big problem for SimpleXML to detect the correct encoding. The same applies to the DOM extension when used with the XML files. However, DOM will not convert the text nodes and attribute values into UTF-8 for you. However, since we can learn what the encoding is for XML files, it's a minor issue.
Problems begin to pop out when you use DOM to parse HTML files. Since HTML files (note that XHTML files are in fact XML) do not have the XML declaration, the resulting DOMDocument will rely on the tags to discover the content-type and encoding for the document. And, if there is no corresponding tag, then DOM will have no idea of what the encoding is. The only way to determine the encoding is to parse the HTTP headers returned by the server where the source HTML document lives. DOMDocument::loadHTMLFile() called with an URL will ignore the header to detect the encoding, so you will have to use cURL (or sockets) to initiate the HTTP request and then call DOMDocument::loadHTML().
You can examine whether DOM detected the encoding by looking at the DOMDocument::actualEncoding property. It will be set if you load an XML document with the encoding explicitly declared, or if you load an HTML document with the corresponding tag present. If you have to use DOM for parsing local HTML documents without the encoding specified by tag then there is no way to determine the encoding of the file.
The described problem suggests that existing way of communicating the encoding for HTML or even text documents over HTTP is not perfect. XML documents do not rely on the charset section of the Content-Type HTTP response header (just like other document types such as PDF, RTF, or Office documents) and they can be transferred across servers without any external metadata. HTML documents, however, do rely on the external metadata which makes offline or local processing of them difficult. As a conclusion, HTML authors should always include the encoding in the tag or switch to XHTML.