Article ID: 2252955 - Last Review: July 7, 2010 - Revision: 2.0

Considerations when using the SQL Server Full-Text search engine for the Japanese language

System TipThis article applies to a different operating system than the one you are using. Article content that may not be relevant to you is disabled.
Expand all | Collapse all

INTRODUCTION

This article describes the considerations that apply when you use the SQL Server Full-Text Search engine for the Japanese language.

MORE INFORMATION

In the Japanese language, a phrase may consist of two or more words without spaces between those words. In Microsoft SQL Server, when you use the SQL Server Full-Text search engine to perform a prefix search for a Japanese phrase, the Full-Text search engine does not consider the phrase to be a prefix term. Instead, the Full-Text search engine considers the phrase to be word terms. This is because a word is defined as a string of characters without spaces or punctuation. Additionally, the search engine works only in the prefix-matching mode. The search engine does not work in the suffix-matching mode.

For example, you create a table and insert some Japanese phrases by running the following statements in SQL Server:
CREATE TABLE test(c1 int PRIMARY KEY,c2 nvarchar(255))

INSERT test VALUES(1,N'添付テスト')
INSERT test VALUES(2,N'Fw: テスト')
INSERT test VALUES(3,N'KK-Information:テスト')
INSERT test VALUES(4,N'[Q] ポリシーテスト')
INSERT test VALUES(5,N'KK-Information:タイトルフィルタテスト2')
INSERT test VALUES(6,N'テスト')
INSERT test VALUES(7,N'フィルタテスト3')
INSERT test VALUES(8,N'テストフィルタ1')
INSERT test VALUES(9,N'RE: テストメール')
INSERT test VALUES(10,N'テストメール')
INSERT test VALUES(11,N'White Listテスト')
INSERT test VALUES(12,N'フィルタリングテスト')

CREATE FULLTEXT CATALOG test AS DEFAULT;
GO

CREATE FULLTEXT INDEX ON test(c2) KEY INDEX PK__<IndexName>;
GO


Then, you run the following three queries:
Query 1
SELECT * FROM test WHERE CONTAINS(c2, N'テスト')
The result of Query 1 is as follows:
c1	c2
---    ----------------------------------------------
2	Fw: テスト
3	KK-Information:テスト
6	テスト
Query 2
SELECT * FROM test WHERE CONTAINS(c2, '"テスト*"')
The result of Query 2 is as follows:
c1	c2
---    ----------------------------------------------
2           Fw: テスト
3           KK-Information:テスト
6           テスト
8           テストフィルタ1
9           RE: テストメール
10          テストメール
Query 3
SELECT * FROM test WHERE CONTAINS(c2, '"*テスト*"')
The result of Query 3 is as follows:
c1	c2
---    ----------------------------------------------
2           Fw: テスト
3           KK-Information:テスト
6           テスト
8           テストフィルタ1
9           RE: テストメール
10          テストメール













From the results of the queries, you can find that the result of Query 2 is the same as the result of Query 3 because the Full-Text query does not work in the suffix-matching mode. Additionally, “テスト” is a token that differs from “ポリシーテスト” or from “White Listテスト” in the matchings. 

To tokenize phrases, a word breaker for the language family must be used. Work breakers use spaces and other signs to recognize phases. Therefore, some phases cannot be recognized by the word breaker and cannot be searched by using Full-Text engine in the Japanese language. For more information about word breakers, see the “Word Breakers and Stemmers” topic in the “Reference” section.

The best practice to use the Full-Text search engine in the Japanese language is to test the phases to see whether the phrases are affected by the limitation. If a phase consists of words without spaces, you cannot use the Full-Text functionality to search the phrase. Instead, you can use the LIKE keyword together with wildcard characters. However, the performance of the LIKE operation is lower than the performance of the Full-Text searching. You must consider the performance effect for your application.

The following are some sample queries of the LIKE keyword to search for phrases.

Query 4
SELECT * FROM test WHERE c2 like 'テスト%'
The result is as follows:
c1	c2
---    ----------------------------------------------
6           テスト
8           テストフィルタ1
10          テストメール
Query 5
SELECT * FROM test WHERE c2 like '%テスト%'
The result is as follows:
c1	c2
---    ----------------------------------------------
1           添付テスト
2           Fw: テスト
3           KK-Information:テスト
4           [Q] ポリシーテスト
5           KK-Information:タイトルフィルタテスト2
6           テスト
7           フィルタテスト3
8           テストフィルタ1
9           RE: テストメール
10          テストメール
11          White Listテスト
12          フィルタリングテスト






 
Note
If you use the Full-Text search engine in SQL Server 2008 or later versions, you can find more information about the content of a full-text index by using the following query:
SELECT * FROM sys.dm_fts_index_keywords(db_id('test'), object_id('test'))
GO
The result is as follows:
keyword                                               display_term             column_id   document_count
----------------------------------------------------- ------------------------ ----------- --------------------

0x00660077                                           fw                       2           1
0x0069006E0066006F0072006D006100740069006F006E       information              2           2
0x006B006B                                           kk                       2           2
0x006C00690073007430C630B930C8                       listテスト               2           1
0x00770068006900740065                               white                    2           1
0x30BF30A430C830EB30D530A330EB30BF30C630B930C80032   タイトルフィルタテスト2  2           1
0x30C630B930C8                                       テスト                   2           3
0x30C630B930C830D530A330EB30BF0031                   テストフィルタ1          2           1
0x30C630B930C830E130FC30EB                           テストメール             2           2
0x30D530A330EB30BF30C630B930C80033                   フィルタテスト3          2           1
0x30D530A330EB30BF30EA30F330B030C630B930C8           フィルタリングテスト     2           1
0x30DD30EA30B730FC30C630B930C8                       ポリシーテスト           2           1
0x6DFB4ED830C630B930C8                               添付テスト               2           1
0xFF                                                 END OF FILE              2           12

(14 row(s) affected)
In the sample result, only 3 rows contain the word “テスト.” The Full-Text search engine treats the word “テスト” as a different token from the word ”テストメール.”  

REFERENCES

For more information about the SQL Server Full-Text search engine, visit the following Microsoft websites:

APPLIES TO
  • Microsoft SQL Server 2008 Standard
  • Microsoft SQL Server 2008 Developer
  • Microsoft SQL Server 2008 Enterprise
  • Microsoft SQL Server 2008 Workgroup
  • Microsoft SQL Server 2005 Standard Edition
  • Microsoft SQL Server 2005 Developer Edition
  • Microsoft SQL Server 2005 Enterprise Edition
  • Microsoft SQL Server 2005 Enterprise X64 Edition
  • Microsoft SQL Server 2008 R2 Standard
  • Microsoft SQL Server 2008 R2 Developer
  • Microsoft SQL Server 2008 R2 Enterprise
  • Microsoft SQL Server 2008 R2 Workgroup
Keywords: 
kbsurveynew kbinfo kbexpertiseadvanced KB2252955