The behavior of the UTF8Encoding class, the UnicodeEncoding class, and the UTF32Encoding class changes after you install the security update for the .NET Framework 2.0 that is described in security bulletin MS07-040
This article has been archived. It is offered "as is" and will no longer be updated.
After you install the security update that is described in security bulletin MS07-040 for the Microsoft .NET Framework 2.0, the behavior of the UTF8Encoding class, the UnicodeEncoding class, and the UTF32Encoding class changes to comply with the Unicode 5.0 requirements for Unicode encodings. Invalid bytes are not removed. Instead, the invalid bytes are replaced by the Unicode character U+FFFD. U+FFFD is the Unicode replacement character. There is no change for a valid Unicode string.
Note By default, the .NET Framework 2.0 for Windows Vista includes this behavior.
The change for invalid data occurred after the release of the .NET Framework 2.0. Unicode specifications have become stricter in the latest Unicode 5.0 specifications. Security bulletin MS07-040 is a cumulative update for the .NET Framework. Security bulletin MS07-040 includes an update to comply with the latest standard Unicode 5.0 specification.
The new default behavior is equivalent to setting the replacement fallbacks to U+FFFD instead of the empty string. If you would prefer to use the old behavior in a program, a program can create UTF8Encoding with the replacement fallbacks set to empty strings. If you set an EncoderReplacementFallback("") class and a DecoderReplacementFallback("") class, the fallbacks remove the invalid data. However, if you set an EncoderReplacementFallback("") class and a DecoderReplacementFallback("") class, the security issue that was corrected by the new default behavior will still exist.
Earlier versions of the .NET Framework 2.0 followed the latest available Unicode standard, Unicode 4.1. The specifications for Unicode 4.1 disallowed the passing of invalid UTF code points. Any invalid data that was encountered was dropped. This behavior was considered to have minimal effect on current programs.
However, ignoring invalid bytes may enable nonsecure, hostile data because invalid characters would be removed, and invalid text strings could become valid. The new requirement for Unicode 5.0 is that invalid bytes cannot be removed. Therefore, invalid bytes are now being replaced with U+FFFD, the Unicode replacement character. The Unicode replacement character is a diamond that contains a question mark.
Before this change, invalid characters in the middle of text strings would be silently removed. For example, the string "Ad\xD800min\xDC00istrator" would change to "Administrator" because the Unicode characters U+D800 and U+DC00 are invalid. This could cause a security problem for some programs. After you install security bulletin MS07-040, this string now becomes "Ad\xFFFDmin\xFFFDistrator." This string is decoded to "Ad�min�istrator," where the � is the Unicode replacement character.
A noteworthy scenario that also produces the Unicode replacement character occurs when a UTF-16 byte array that consists of a single trailing NULL byte is passed through the decoder to be converted. In the past, the trailing NULL byte was removed. However, now the NULL byte is converted to the Unicode replacement character.
The following are examples of invalid characters and invalid bytes that cause the Unicode replacement character to be added:
Unauthorized characters in Unicode strings during the encoding to UTF-8, UTF-16, and UTF-32 include the following:
Individual high surrogates without a following low surrogate
Low surrogates without a high surrogate
Surrogate pairs that are positioned out of order
Unauthorized bytes when decoding UTF-8 include the following:
Trailing bytes without a lead byte
Malformed sequences that have an incorrect number of trailing bytes for the lead byte
Characters that are encoded by using more bytes than are necessary to represent the character (non-shortest form)
Characters that decode to non-characters, such as individual surrogates
Unauthorized bytes when decoding UTF-16 include the following:
An odd number of bytes in the buffer. This causes the last lone byte to fallback.
Individual high surrogates without a following low surrogate.
Low surrogates without a high surrogate.
Surrogate pairs that are positioned out of order.
Unauthorized bytes when decoding UTF-32 include the following:
Values that are outside the Unicode character range. For example, these values may be greater than U+10FFFF.