PRB: XML Parser Cannot Parse UTF-7 Documents

Symptoms

When attempting to load an XML file saved as UTF-7 (a transfer encoding format for Unicode), the XML parser in Internet Explorer generates the following error message:
Invalid at the top level of the document.
The same error also occurs when using the MSXML parser from server-side or client-side script.

Cause

Versions of the MSXML parser prior to MSXML 2.6 do not support UTF-7.

Resolution

To resolve this problem, save your XML documents as UTF-8, the preferred transfer encoding format for Unicode.

MSXML 2.6 or later supports UTF-7 encoding.

Status

This behavior is by design.

More Information

Although Unicode is a uniform character set representing nearly all the world's languages, there are many byte representations, or transformation formats, that a Unicode file can use. The most popular format is UTF-8, which represents Unicode characters as a sequence of one to four 8-bit bytes. UTF-7 is a 7-bit transformation format defined to allow Unicode text to pass through mail gateways that assume ASCII and strip out the high bit of text messages.

Based on the XML 1.0 standard, Section 4.3.3, a valid XML file is required to be one of following:

  • A Unicode file in UTF-8 format.
  • A Unicode file in UTF-16 format.
  • A file in some other character encoding (for example, ASCII) that has as its very first bytes the
UTF-7 does not use the Byte Order Mark. Also, UTF-7 converts the special XML character < to +ADw, which ends up being the first character of the UTF-7 encoded XML document. Since this is not compliant with the XML standard, MSXML refuses to load such files.

Many text editors and word processors allow you to save Unicode text files, known as encoded text in Microsoft Word, in many different transfer encodings, including UTF-7. So if you save a document in Word as "encoded text UTF-7," MSXML will refuse to load it for the above reasons.

Steps to Reproduce Behavior

  1. Create a simple XML file in Word 2000:
    <?xml version="1.0"?>

    <MyTag>
    <EmbeddedTag name1="value"/>
    </MyTag>
  2. Save the file as encoded text. When Word asks you if you wish to lose formatting, click Yes. Word will then prompt you for an encoding format to use. Select UTF-7, and then save the document as cap file name TestUTF7.xml.
  3. Load cap file name TestUTF7.xml in Internet Explorer 5. You will receive the following error message:
    Invalid at the top level of the document. Line 1, Position 1

    +ADw-?xml version+AD0AIg-1.0+ACI-?+AD4-.

References

For the latest Microsoft Global Software Development

http://www.unicode.org/ for the latest Unicode Standard.
For more information about developing Web-based solutions for Microsoft Internet Explorer, visit the following Microsoft Web sites:(c) Microsoft Corporation 2000, All Rights Reserved. Contributions by Jay Andrew Allen, Microsoft Corporation.

Eigenschaften

Artikelnummer: 251134 – Letzte Überarbeitung: 20.08.2008 – Revision: 1

Feedback