Description of storing UTF-8 data in SQL Server


Some applications (especially those that are Web based) must deal with Unicode data that is encoded with the UTF-8 encoding method. SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data. This article discusses some options for dealing with this situation.

More Information

Unicode data can be encoded in many different ways. UCS-2 and UTF-8 are two common ways to store bit patterns that represent Unicode characters. Microsoft Windows NT, SQL Server, Java, COM, and the SQL Server ODBC driver and OLEDB provider all internally represent Unicode data as UCS-2.

The options for using SQL Server 7.0 or SQL Server 2000 as a backend server for an application that sends and receives Unicode data that is encoded as UTF-8 include:

  1. If your application uses Active Server Pages (ASP) and you are using Internet Information Server (IIS) 5.0 and Microsoft Windows 2000, you can add "<% Session.Codepage=65001 %>" to your server-side ASP script. This instructs IIS to convert all dynamically generated strings (example: Response.Write) from UCS-2 to UTF-8 automatically before sending them to the client.

    If you do not want to enable sessions, you can alternatively use the server-side directive "<%@ CodePage=65001 %>".

    Any UTF-8 data sent from the client to the server via GET or POST is also converted to UCS-2 automatically. The Session.Codepage property is the recommended method to handle UTF-8 data within a web application. This Codepage setting is not available on IIS 4.0 and Windows NT 4.0. For additional information, see the following article in the Microsoft Knowledge Base:
    254313 Error Message: Active Server Pages Error 'ASP 0203' Invalid Code
  2. Translate to and from UCS-2 or UTF-8 as appropriate within the application. Sample code for this type of conversion is located at the Unicode Consortium's site:

    A high-level description of the algorithm to convert UCS-2 to UTF-8 can be found in the Internet Request For Comments document RFC2279.

    On Windows NT or Windows 2000, you may use the Win32 functions MultiByteToWideChar and WideCharToMultiByte to convert UTF-8 to and from UCS-2 by passing the constant CP_UTF8 (65001) as the first parameter to the functions.
  3. Modify the application to use UCS-2 instead of UTF-8 encoding.
  4. Store the actual UTF-8 data on the server using the BINARY/VARBINARY/IMAGE columns. Storing UTF-8 data on SQL Server means that you can not use SQL Server to sort or find ranges of these values as if the data were valid character data. The types of operations on columns containing UTF-8 data that would not return expected results include "ORDER BY", greater-than ">" and less-than "<" comparisons, and the built-in SQL Server string manipulation functions such as SUBSTRING().

    However, equality comparisons, will work as long as the strings being compared are equivalent at a byte level. Note that if you store UTF-8 data in SQL Server you should not use character columns (CHAR/NCHAR/VARCHAR and so forth). UTF-8 is not valid character data to SQL Server, and by storing non-character data in character columns you risk encountering problems such as the issues discussed in the following Microsoft Knowledge Base articles:
    155723 INF: SQL Server Truncation of a DBCS String
    234748 PRB: SQL Server ODBC Driver Converts Language Events to Unicode
    If you are considering this option, keep in mind that if you ever need to access UTF-8 data stored inside SQL Server from any application other than a Web browser (for example, from a non-Web-based ODBC application) you will need to do a conversion from UTF-8 to UCS-2 within this application as ODBC, OLEDB, COM, Win32 API calls, VB, and C runtime string manipulation functions do not work with UTF-8 data. This moves the burden of translation to a different application.
  5. If your requirements do not include the need to store data from a combination of languages that cannot be satisfied by a single code page, you may not need to use Unicode.
Unicode support was introduced to SQL Server starting with SQL Server 7.0. Since SQL Server 6.5 does not support the storing of Unicode data, the only options for SQL Server 6.5 are outlined in step 4 and step 5.