For this month's column, I am going to discuss globalization issues in Active Server Pages (ASP) and ASP.NET, the issues that we face in ASP, how things have changed in ASP.NET 1x, and what's up with ASP.NET 2.0 on the globalization front.
Note If you come across a term you don't understand, see the Glossary section at the bottom of this column.
Globalization issues in ASPBefore ASP.NET, there was no structured support for the development of applications for global users. During the early development of ASP, developers such as myself found only scattered support for globalization in operating systems, browsers, ASPs and back-end systems. However, we seldom observed any automatic connectivity across these applications. Fortunately, we did understand concepts such as character sets, code pages, browser languages, and fonts which we could leverage for the development of applications for global users.
It would be too difficult to separate into categories all of the globalization issues that those of us in ASP.NET have seen. Instead, I'll list a series of concepts that relate to a variety of those issues.
Character sets and codepagesWe all know that the characters on our computer screen are just a series of bytes. The byte series can be created and interpreted in any number of ways. If the interpretation uses an encoding that's different than the encoding that the byte array was created with, the interpretation will display as garbage. Character sets (charsets) are encoding formats that are usually used by browsers. The Codepage property, which is more applicable for server-side conversions, is just a conversion table that specifies how characters are encoded.
Browsers encode the form post data according to the current character set. If the current character set is "windows-1256," then the byte transmission to the server is also encoded as "windows-1256."
When the ASP is being interpreted, the Form and Querystring collections are not built until they are referenced in code. When they are being built, the string data is transformed to Unicode according to the current codepage. (By default, both ASP and ASP.NET process content by using Unicode format). It is very important that you set the correct codepage before referencing the collections; otherwise, the Unicode representation in memory won't be correct.
To set a codepage, use Session.Codepage or Response.Codepage. The Response.Codepage is only available in Microsoft Internet Information Services (IIS) 5.1 or later versions. For information about the integer values (which correspond to the character set) that we would set these properties to, visit the following Microsoft Web site:
Session.Codepage = 1256
Accept languagesIf an ASP developer wants to know which languages a user has set in his browser, the developer can use the Request.ServerVariables ("HTTP_ACCEPT_LANGUAGE") variable to find the list of languages that the user would like to read the response in, (such as English, German, or Indian) and the order of preference that the user would like to see these languages in. In ASP.NET, similar information is present in the Request.UserLanguages property as an array.
For more information about how to use this information in ASP code, click the following article number to view the article in the Microsoft Knowledge Base:
Displaying multi-byte character sets in Internet ExplorerThe only encoding format that can show a multi-byte character set is Unicode (UTF-8). With UTF-8, we can display Cyrillic, Indian, and Japanese all on the same page. If we do not use UTF-8, we can only show one of these languages at a time. To set the charset of the browser, use the Response.CharSet property.
Static multi-byte characters on a pageTo display multi-byte characters stored directly in the page, we must first save the page with specific encoding. UTF-8 will be best, but a specific codepage (matched to the codepage of the characters) will work as well.
Saving an ASP file using Microsoft Visual InterDev doesn't help here, since Visual InterDev can only save in ANSI English or Unicode. Any ASP page saved as Unicode is not supported by ASP.
In Microsoft Visual Studio .NET, you can save a file in any encoding. There are two ways to do this. The default way is to save the file by using the current codepage for the user. An additional way to save a file with an encoding is as follows:
On the File menu, click Save File As. In the
Save File As dialog box, click the drop-down arrow on the
Save button. When you click the arrow, the options are
Save and Save with Encoding. When you click
Save with Encoding, the Advanced Save Options dialog box appears where you can select the type of encoding that you want to apply from a list of the codepages that are installed on the computer.
Note This changes the encoding for the save operation, but is for one time only. The next save will be set back to the default.
To change the default codepage, click Advanced Save Options on the
File menu. In the Advanced Save Options dialog box, you can set the default encoding for save operations to the codepage of your choice.
These methods are related to how the file is saved on disk. However, to control the output for ASP, as already discussed, we need to set the Session.CodePage and the Response.CharSet properties. With IIS 5.1 and later versions, we can also use the Response.CodePage property.
Default CODEPAGE on serverThe default locale and the default codepage for the page depend on the registry settings for the .DEFAULT user. We can find the international key at registry hive HKEY_USERS\.DEFAULT\Control Panel\International. We can also change the behavior of the locale that is chosen by IIS. For more information, see the "IIS 5.0 " section in the following Knowledge Base article:
Example: Default locale has date format set as 11.1.2004, while the logged on user (with the same locale set) has the date format as 11/1/2004. The 11/1/2004 setting will take effect for ASP.
(For ASP.NET, this can vary. In some installations, the ASPNET user will have its own profile that will show up under HKEY_USERS when it is loaded. In others, it will use the .DEFAULT profile. We can also use the codepage attribute in the <%@ %> declaration. This should be used when the file is saved with a different encoding then the default, such as codepage 932 (Japanese)).
Codepage issues versus font conversion issues: which is which?At times, you may see a question mark (?) character or a box where a character is supposed to appear.
Codepage conversion issuesWhen a character is replaced by a question mark (?) character, this is an indication that a codepage conversion issue, has occurred. The question mark (?) is a default character for the codepage conversion and basically means that the operating system does not know how to handle the character value and convert it. It replaces the character value with a question mark (?). This could mean that the character has an invalid value for the codepage or that the codepage that is needed for the conversion is not installed.
Font conversion issuesWhen a character is replaced by a box, this is an indication that a font conversion issue has occurred. This occurs on the client side when the client does not have the correct font installed to display this character correctly. For example, when a character is from the Japanese charset, and the client does not have the Japanese fonts installed, the Japanese character is displayed as a box.
Next, I'll talk about how things changed in ASP.NET 1.x, and how those changes affect globalization issues in the context of ASP.NET.
Globalization issues in ASP.NET 1.x:With ASP.NET, three great things were introduced:
- The <globalization> tag in web.config file
The <globalization> tag takes us away from the incoherent concepts of codepages and charsets and lets us control most of the variants within ASP.NET.
- The System.Globalization namespace
The Globalization namespace provides us with the programmatic power of handling globalization.
- The concept of resource files has been greatly improved.
We don't deal with resource files in the way that we used to in ASP. Now, the resource files are in the form of XML files when we design and develop them, and they exist as assembles at runtime.
Two important settings in the tag are as follows:
|fileEncoding||Specifies the default encoding for .aspx, .asmx, and .asax file parsing. Unicode and UTF-8 files saved with the byte order mark prefix (with signature) will be automatically recognized, regardless of the value of fileEncoding.|
|Culture||Specifies the default culture for processing incoming Web requests (applicable on methods of classes from the System.Globalization namespace).|
|uiCulture||Specifies the default culture for processing locale-dependent resource searches (satellite assemblies).|
For requestEncoding, the runtime will read the request and interpret it according to the setting in this section. This is a setting that can cause problems, however. The table below shows the bit layout of a valid UTF-8 byte sequence.
If the character value falls in the ASCII 7 bit standard, the byte value is not modified. If the value is above 127, it must follow the rules below. The leading set of bits shows how many characters are in the sequence. Each byte after the first must start with the first bit set to 1.
UTF-8 byte layout:
|3||16||1110vvvv 10vvvvvv 10vvvvvv|
|4||21||11110vvv 10vvvvvv 10vvvvvv 10vvvvvv|
Runtime encoding changesIn the Application_BeginRequest event, we can modify the value of requestEncoding and have it take effect before the request is processed. For the response, the Page_PreRender event is the last chance to modify the encoding of the output. Also note that Response.Write will put characters into this buffer as soon as we call it, so be sure to have the right encoding set before using Response.Write.
Original data is non Unicode: How to still make Internet Explorer interpret multi-byte charsets?We can also make ASP.NET behave like ASP if we need to. To make this occur, we need to set the responseEncoding and requestEncoding to windows-1252 (a more complete encoding than iso-8859-1), and use the Response.Charset property to display the text correctly. This works because windows-1252 is a single byte encoding scheme, and does not modify any bytes that are added to the buffer. Thus, double-byte characters are sent as a series of single bytes. We can then tell Internet Explorer how to interpret the bytes by using the Response.Charset property. This scenario may be necessary if the original data is not stored as Unicode or UTF-8, such as a return value from a COM object, or if the data is stored in Microsoft SQL Server in a non N field (such as varchar).
SQL Server and ASP.NET globalization issues
Unicode data input to SQL ServerThe best way to store data in SQL Server is to utilize Unicode. Whenever we use INSERT, UPDATE, etc, if there is even a least chance of Unicode data, we need to add an N before the value. This tells the database that the value is Unicode. A good example of this is the ADO objects. They do this automatically if we use the Recordset object to add new records.
The following is an example:
INSERT INTO MusicAlbum (Album_ID, [Year], Name, Artist_ID, Company_ID) VALUES (12345, 2005, N'Abida', 4653, 403)
Dim t As String = "INSERT INTO MusicAlbum(Album_ID, [Year], Name, Artist_ID, Company_ID) VALUES (12345, 2005, N'" & TextBox1.Text & "', 4653, 403)"
Date/Time input to SQL ServerUsually we have the knowledge about the culture and locale of the date/time being interpreted within our ASP.NET application. However, while pushing and pulling the date/time data to and from external sources, we run the risk of misinterpreting the date/time formats. This is because we cannot always guarantee the culture and locale of the external source to be the same as in our application. In SQL Server this can be solved by using the 'current language' attribute in the connectionstring of the connection being established to the SQL database. We can provide the same language setting in the connectionstring as is the culture in our application. This protects us from the risk of misinterpretation, because SQL Server always accepts and sends the date/time data in consent with the above-mentioned setting.
System.Globalization namespaceThis namespace is the core of globalization and localization in the .NET Framework. The main class used in this namespace is the CultureInfo class. It holds culture-specific information, such as the date/time format, number formats, comparison information, and text information. For more information about the CultureInfo class, visit the following MSDN Web site:
Neutral cultures vs. specific culturesA neutral culture is a culture that is associated with a language, but not a specific country or region. A specific culture is associated both with a language and a specific country.
An example: "DE" (neutral culture) is for the German language, but "de-AT" (specific culture) is for the German language as it's spoken in Austria. Neutral cultures cannot be used for formatting.
Current thread and culture awareness of .NET Framework classesAll classes and methods in the .NET Framework library where we would expect the output to be culture-dependent have two built-in behaviors:
- They let us specify the culture code while supplying the arguments so that the output is based on the culture specified. This is optional.
- If this is missed (usually it is), the classes are intelligent enough to keep a check on the Thread.CurrentThread.CurrentCulture property and work according to that.
Dim ci As CultureInfo
ci = New CultureInfo("de-AT")
Thread.CurrentThread.CurrentCulture = ci
The framework ensures (as follows) that the CurrentCulture property is always initialized:
- Whatever it is set to programmatically.
- In case it is not explicitly set by the programmer, the property is picked from the configuration files (<globalization> tag).
- If the property is missing there, it is the culture on which the Web server is running. This is usually the neutral culture that corresponds to the language of the operating system.
Resource filesAll .resx, .resource files, and files that have the Build Action attribute set to Embedded Resource that are added to an ASP.NET project in Visual Studio .NET, are automatically compiled and embedded within application assembly as part of its manifest. This can even be done manually by using the Resource File Generator (RESGEN) utility via a Visual Studio .NET command prompt. For more information, visit the following MSDN Web site:
Satellite assembliesSatellite assemblies can be used in an ASP.NET project when you make sure the following are true:
- All the user-interface elements in all aspx files need to be equipped with id and runat=server attributes.
- We create separate .resx files. Each one must correspond to each culture we want our application to support.
- We must decide a common first name for all these files for ex. 'Strings'.
- We name the separate .resx files with the following naming convention commonfirstname. languagecode-regioncode.resx (for example: Strings.de-AT.resx, Strings.en-GB.resx ).
- We should have the resource file
commonfirstname.resx (Strings.resx) that has all the strings as we want displayed in the default case.
- Write code to detect user's culture and set the Thread.CurrentThread.CurrentUICulture property to match to it.
- Write code to load the resources by using the ResourceManager class.
- Write code to extract strings from the loaded object, and assign them to user interface elements.
Difference between CurrentCulture and CurrentUICultureWhile the methods of classes in the System.Globalization namespace depend on the Thread.CurrentThread.CurrentCulture property to give their output, the ResourceManager class that loads the resource assembly depends on the Thread.CurrentThread.CurrentUICulture property to load the appropriate satellite assembly. The following is an example of C# code:
protected ResourceManager gStrings = new ResourceManager("MyGlobalizationTestProjectName.strings", typeof(MyTestWebFormName).Assembly);
// Get the user's preferred language.
string sLang = Request.UserLanguages;
// Set the thread's culture for formatting, comparisons, etc.
Thread.CurrentThread.CurrentCulture = CultureInfo.CreateSpecificCulture(sLang);
// Set the thread's UICulture to load resources
// from satellite assembly.
Thread.CurrentThread.CurrentUICulture = new CultureInfo(sLang);
private void Page_Load(object sender, System.EventArgs e)
// Get strings from resource file and assign to UI elements.
head1.InnerHtml = gStrings.GetString("satellite.head1");
p1.InnerHtml = gStrings.GetString("satellite.p1");
sp1.InnerHtml = gStrings.GetString("satellite.sp1");
sp2.InnerHtml = gStrings.GetString("satellite.sp2");
butOK.Text = gStrings.GetString("satellite.butOK");
butCancel.Value = gStrings.GetString("satellite.butCancel");
Order in which ASP.NET selects satellite assemblies:When you have set the thread's CurrentUICulture, ASP.NET automatically selects the resources that match, in the following order:
- If a satellite assembly is found with a matching culture, the resources from that assembly are used.
- If a satellite assembly is found with a neutral culture that matches the CurrentUICulture, resources from that assembly are used.
- If a match is not found for the CurrentUICulture, the fallback resources stored in the executable assembly are used.
Manually creating satellite assemblies:This use of satellite assemblies is where Visual Studio .NET creates the assemblies itself. Visual Studio .NET doesn't strong name satellite assemblies by default, however. If you want to change these options, you would need to create satellite assemblies manually. For more information, visit the following MSDN Web site:
What's up with ASP.NET 2.0 on the globalization front?The widespread usage of ASP.NET and the kinds of issues that we would see with respect to globalization features in ASP.NET 2.0 are still some distance ahead. However, it would be good to take a brief look at what direction the globalization methodology is headed for web applications.
Globalization support in ASP.NET 2.0 has undergone a radical change and Web developers have been given the ability to make the localization of Web applications as easy as it is for Windows-based applications. The following is a list of features that are the foundation of globalization methodology in ASP.NET 2.0:
Strongly-typed resources At the core of the .NET Framework 2.0 release is support for strongly-typed resources that provide developers with Intellisense and simplifies code required to access resources at runtime.
Managed Resource Editor Visual Studio .NET 2.0 includes a new resource editor with better support for creating and managing resource entries including strings, images, external files, and other complex types.
Resource generation for Web Forms Windows Forms developers have already enjoyed the benefits of automatic internationalization. Visual Studio .NET 2005 will now support rapid internationalization by automatically generating resources for Web Forms, user controls, and master pages.
Improved runtime support ResourceManager instances are managed by the runtime and readily accessible to server code through more accessible programming interfaces.
Localization expressions Modern declarative expressions for Web pages support mapping resource entries to control properties, HTML properties, or static content regions. These expressions are also extensible, providing additional ways to control the process of attaching localized content to HTML output.
Automatic culture selection Managing culture selection for each Web request can be automatically linked to browser preferences.
Resource provider model A new resource provider model allows developers to host resources in alternate data sources such as flat files and database tables, while the programming model for accessing those resources remains consistent.
For more information about globalization methodology in ASP.NET 2.0, visit the following MSDN Web site:
ConclusionThat's all for now about globalization issues in ASP and ASP.NET. I hope this article will help a few customers troubleshoot their globalization issues in ASP and ASP.NET before they opt to contact Microsoft Support. I will end with the following thought:
"Wherever and whenever you are developing, think about the millions of people you can empower around the world. Make your solutions World-Ready! Microsoft tools and technologies make internationalization easier."
We will catch up again next month with another interesting topic.
Thank you for your time.
GlossaryANSI Stands for American National Standards Institute. In this context, it represents a specific codepage for a specific language/character set. Most often refers to the English codepage (windows-1252).
ASCII A 1 byte (or 7 bit) encoding scheme. Only the characters in the range 0-127 are standardized. The range 128-255 is extensions to ASCII and not part of the standard. An example of this is the difference between the upper range of the OEM ASCII chart and the VB ASCII chart.
CharSet Setting used mostly for Internet Explorer and browsers that tells the browser how to interpret the character data. Example: Response.charSet = "iso-8859-1."
Codepage A conversion table that specifies how characters are encoded (usually used for servers).
Globalization Globalization is a process of designing and creating an application so that the unique requirements of a culture, region, or national and linguistic needs can be met. In other words designing an application in a way that it can be localized later is globalization.
Locale/Culture Language and region specific formats/preferences including, date and calendar formats, time formats, currency formats, casing, sorting and string comparison, address formats, phone number formats, paper sizes, unit of measure, writing direction, etc.
LocaleID (LCID) A DWORD value that specifies the language identifier and sorting ID. It can be used to specify the specific region formats for ex date/time etc should be formatted according to.
Localizability Ability of an application to present content for the demanded language/locale.
Localization Localization is the process of translating a user interface into specific languages and/or locales.
Multibyte character set A character set in which the characters are composed of two or more bytes, such as Japanese. UTF-8 also falls under this category. (Unicode technically is in this category, but in Windows, it has its own category.)
Unicode A 2-byte encoding scheme. Windows uses Unicode internally. Any APIs specifically for Unicode are signified by a "W" on the end of the function name. Also known as wide char; cannot be directly used from web applications.
UTF-8 A character encoding where a character can be represented by 1-6 bytes. In Windows, the range is 1-3 bytes. Not supported under NT4 for web applications.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
Wide character set An alias for Unicode. Also known as DBCS (double byte character set), UCS-2, UTF-16.
Article ID: 893663 - Last Review: 19 Jul 2012 - Revision: 1