Considerations when using the SQL Server Full-Text search engine for the Japanese language

Applies to: SQL Server 2008 StandardSQL Server 2008 DeveloperSQL Server 2008 Enterprise

INTRODUCTION


This article describes the considerations that apply when you use the SQL Server Full-Text Search engine for the Japanese language.

More Information


In the Japanese language, a phrase may consist of two or more words without spaces between those words. In Microsoft SQL Server, when you use the SQL Server Full-Text search engine to perform a prefix search for a Japanese phrase, the Full-Text search engine does not consider the phrase to be a prefix term. Instead, the Full-Text search engine considers the phrase to be word terms. This is because a word is defined as a string of characters without spaces or punctuation. Additionally, the search engine works only in the prefix-matching mode. The search engine does not work in the suffix-matching mode.

For example, you create a table and insert some Japanese phrases by running the following statements in SQL Server:
CREATE TABLE test(c1 int PRIMARY KEY,c2 nvarchar(255))

INSERT test VALUES(1,N'添付テスト')
INSERT test VALUES(2,N'Fw: テスト')
INSERT test VALUES(3,N'KK-Information:テスト')
INSERT test VALUES(4,N'[Q] ポリシーテスト')
INSERT test VALUES(5,N'KK-Information:タイトルフィルタテスト2')
INSERT test VALUES(6,N'テスト')
INSERT test VALUES(7,N'フィルタテスト3')
INSERT test VALUES(8,N'テストフィルタ1')
INSERT test VALUES(9,N'RE: テストメール')
INSERT test VALUES(10,N'テストメール')
INSERT test VALUES(11,N'White Listテスト')
INSERT test VALUES(12,N'フィルタリングテスト')

CREATE FULLTEXT CATALOG test AS DEFAULT;
GO

CREATE FULLTEXT INDEX ON test(c2) KEY INDEX PK__<IndexName>;
GO


Then, you run the following three queries:
Query 1
SELECT * FROM test WHERE CONTAINS(c2, N'テスト')
The result of Query 1 is as follows:
c1c2
--- ----------------------------------------------
2Fw: テスト
3KK-Information:テスト
6テスト
Query 2
SELECT * FROM test WHERE CONTAINS(c2, '"テスト*"')
The result of Query 2 is as follows:
c1c2
--- ----------------------------------------------
2 Fw: テスト
3 KK-Information:テスト
6 テスト
8 テストフィルタ1
9 RE: テストメール
10 テストメール
Query 3
SELECT * FROM test WHERE CONTAINS(c2, '"*テスト*"')
The result of Query 3 is as follows:
c1c2
--- ----------------------------------------------
2 Fw: テスト
3 KK-Information:テスト
6 テスト
8 テストフィルタ1
9 RE: テストメール
10 テストメール













From the results of the queries, you can find that the result of Query 2 is the same as the result of Query 3 because the Full-Text query does not work in the suffix-matching mode. Additionally, “テスト” is a token that differs from “ポリシーテスト” or from “White Listテスト” in the matchings. 

To tokenize phrases, a word breaker for the language family must be used. Work breakers use spaces and other signs to recognize phases. Therefore, some phases cannot be recognized by the word breaker and cannot be searched by using Full-Text engine in the Japanese language. For more information about word breakers, see the “Word Breakers and Stemmers” topic in the “Reference” section.

The best practice to use the Full-Text search engine in the Japanese language is to test the phases to see whether the phrases are affected by the limitation. If a phase consists of words without spaces, you cannot use the Full-Text functionality to search the phrase. Instead, you can use the LIKE keyword together with wildcard characters. However, the performance of the LIKE operation is lower than the performance of the Full-Text searching. You must consider the performance effect for your application.

The following are some sample queries of the LIKE keyword to search for phrases.

Query 4
SELECT * FROM test WHERE c2 like 'テスト%'
The result is as follows:
c1c2
--- ----------------------------------------------
6 テスト
8 テストフィルタ1
10 テストメール
Query 5
SELECT * FROM test WHERE c2 like '%テスト%'
The result is as follows:
c1c2
--- ----------------------------------------------
1 添付テスト
2 Fw: テスト
3 KK-Information:テスト
4 [Q] ポリシーテスト
5 KK-Information:タイトルフィルタテスト2
6 テスト
7 フィルタテスト3
8 テストフィルタ1
9 RE: テストメール
10 テストメール
11 White Listテスト
12 フィルタリングテスト






 
Note
If you use the Full-Text search engine in SQL Server 2008 or later versions, you can find more information about the content of a full-text index by using the following query:
SELECT * FROM sys.dm_fts_index_keywords(db_id('test'), object_id('test'))
GO
The result is as follows:
keyword                                               display_term             column_id   document_count
----------------------------------------------------- ------------------------ ----------- --------------------

0x00660077 fw 2 1
0x0069006E0066006F0072006D006100740069006F006E information 2 2
0x006B006B kk 2 2
0x006C00690073007430C630B930C8 listテスト 2 1
0x00770068006900740065 white 2 1
0x30BF30A430C830EB30D530A330EB30BF30C630B930C80032 タイトルフィルタテスト2 2 1
0x30C630B930C8 テスト 2 3
0x30C630B930C830D530A330EB30BF0031 テストフィルタ1 2 1
0x30C630B930C830E130FC30EB テストメール 2 2
0x30D530A330EB30BF30C630B930C80033 フィルタテスト3 2 1
0x30D530A330EB30BF30EA30F330B030C630B930C8 フィルタリングテスト 2 1
0x30DD30EA30B730FC30C630B930C8 ポリシーテスト 2 1
0x6DFB4ED830C630B930C8 添付テスト 2 1
0xFF END OF FILE 2 12

(14 row(s) affected)
In the sample result, only 3 rows contain the word “テスト.” The Full-Text search engine treats the word “テスト” as a different token from the word ”テストメール.”