Article ID: 940521 - View products that this article applies to.
After you install the security update that is described in security bulletin MS07-040 for the Microsoft .NET Framework 2.0, the behavior of the UTF8Encoding class, the UnicodeEncoding class, and the UTF32Encoding class changes to comply with the Unicode 5.0 requirements for Unicode encodings. Invalid bytes are not removed. Instead, the invalid bytes are replaced by the Unicode character U+FFFD. U+FFFD is the Unicode replacement character. There is no change for a valid Unicode string.
Note By default, the .NET Framework 2.0 for Windows Vista includes this behavior.
The change for invalid data occurred after the release of the .NET Framework 2.0. Unicode specifications have become stricter in the latest Unicode 5.0 specifications. Security bulletin MS07-040 is a cumulative update for the .NET Framework. Security bulletin MS07-040 includes an update to comply with the latest standard Unicode 5.0 specification.
The new default behavior is equivalent to setting the replacement fallbacks to U+FFFD instead of the empty string. If you would prefer to use the old behavior in a program, a program can create UTF8Encoding with the replacement fallbacks set to empty strings. If you set an EncoderReplacementFallback("") class and a DecoderReplacementFallback("") class, the fallbacks remove the invalid data. However, if you set an EncoderReplacementFallback("") class and a DecoderReplacementFallback("") class, the security issue that was corrected by the new default behavior will still exist.
Earlier versions of the .NET Framework 2.0 followed the latest available Unicode standard, Unicode 4.1. The specifications for Unicode 4.1 disallowed the passing of invalid UTF code points. Any invalid data that was encountered was dropped. This behavior was considered to have minimal effect on current programs.
However, ignoring invalid bytes may enable nonsecure, hostile data because invalid characters would be removed, and invalid text strings could become valid. The new requirement for Unicode 5.0 is that invalid bytes cannot be removed. Therefore, invalid bytes are now being replaced with U+FFFD, the Unicode replacement character. The Unicode replacement character is a diamond that contains a question mark.
Before this change, invalid characters in the middle of text strings would be silently removed. For example, the string "Ad\xD800min\xDC00istrator" would change to "Administrator" because the Unicode characters U+D800 and U+DC00 are invalid. This could cause a security problem for some programs. After you install security bulletin MS07-040, this string now becomes "Ad\xFFFDmin\xFFFDistrator." This string is decoded to "Ad�min�istrator," where the � is the Unicode replacement character.
A noteworthy scenario that also produces the Unicode replacement character occurs when a UTF-16 byte array that consists of a single trailing NULL byte is passed through the decoder to be converted. In the past, the trailing NULL byte was removed. However, now the NULL byte is converted to the Unicode replacement character.
The following are examples of invalid characters and invalid bytes that cause the Unicode replacement character to be added:
Article ID: 940521 - Last Review: December 3, 2007 - Revision: 2.2
Contact us for more help