Microsoft Web
Saturday, March 31st, 2007Here’s a snippet from Microsoft’s current corporate home page:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en" dir="ltr">
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-16">
<title>Microsoft Corporation</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="SearchTitle" content="Microsoft.com">
<meta name="SearchDescription" content="Microsoft.com Homepage">
Download this code: /code/microsoft.txt
Can you spot the problem?
Update: James Booker is the first in with an answer: there are two Content-Type meta tags above, both of which specify different character sets.
Specifying the content-type in meta tags is a bit of a hack, as the browser has to seek through the first section of the document looking for a content-type declaration, then try reinterpreting the page with the character set the page specifies. Specifying a character set of "utf-16" doesn't make any sense in this scenario, as the browser is going to try the sniffing by interpreting the HTML as ASCII. If the page were actually UTF-16, this wouldn't work, as the representation for the string "Content-type" in UTF-16 isn't identical to its representation in UTF-8, as we can see in a Python shell:
Thankfully, the HTML spec foresaw this problem:
The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.
So, there are actually three problems with the above HTML. There are two content-type declarations in meta tags, one of them is bogus, and the correct one isn't as early as is possible in the head element. These problems, thankfully, are mitigated by the presence of an HTTP header that specifies the correct character set, and by the incredible amount of effort browser vendors have put into making their code accepting of mistakes such as these.