How to utilize the spelltuner to add or boost terms for spellchecking
The FAST Search for SharePoint 2010 product uses an automatic Spell Check Tuner (spelltuner) to adapt spellcheck results to indexed content. The only source that is used for spelltuning is the content that is fed through the document processing pipeline. The spelltuning process keeps a running tally of terms fed through the document processing pipeline. During spelltuning, this frequency data is used to add terms to the dictionary or to modify the internal spellcheck weight values assigned to terms that already exist in the dictionary. These internal weight values are used to calculate how relevant a term is during searching.
This process can cause limitations when a very important term either does not show up in the index at all or shows up infrequently in the index. In such cases, the term is either not added to the spellcheck dictionary or the term gets a poor weighting and is not spellchecked as expected.
This article explains how to add a term to the spellcheck dictionary and how to boost the relevancy of a term in the spellcheck dictionary.
To work with the term frequency lists, one must first check their status to get an idea of how frequently terms show up in the index and what consitutes a 'rare' term.
The existing frequency list can be found in %FASTSEARCH%\data\spelltuner\en_frequencies_txt.gz. This file is a gzipped file, which can be extracted by a number of third party tools. When fully extracted, the .gz file contains a directory structure which leads to a .part file indicating the language (for example, en_frequencies_txt.gz.part). The frequency list will have a format of <term> <frequency>, one term per line. For example:
Once the frequency list has been extracted from the .gz file, examine the frequencies of the other terms. When trying to add or boost a term, one would want to give the term a high frequency value, relative to the other terms in the list.
Once the the desired frequency is determined, the following steps can be used in order to modify a term in the dictionary:
1. To begin, the structure of the frequency list’s .gz file is needed to write to, however one should NOT alter the current list, therefore a copy must first be made. Using PowerShell, run:
PS C:\FASTSearch\bin> Copy-Item "C:\FASTSearch\data\spelltuner\en_frequencies_txt.gz" "C:\FASTSearch\data\spelltuner\en_frequencies_txt2.gz"
2. Using PowerShell, run the Python interpreter, cobra, from the %FASTSEARCH%\bin\ directory, along with a simple script to write the desired term (in this example “myterm”) to the new frequency list:
PS C:\FASTSearch\bin> cobra
Cobra fs14/281 (2010-01-08 11:09:33)
>>> import os, os.path
>>> import gzip
>>> fname = "C:\FASTSearch\data\spelltuner\en_frequencies_txt2.gz"
>>> of = gzip.open(fname, 'wb')
>>> of.write("en myterm 999\n") #NOTE: the syntax here is: ("<language> <term to add/boost> <new frequency>")
3. Using PowerShell, copy the updated frequency list to the resourcestore (this is where the spelltuner will recognize updates):
PS C:\FASTSearch\bin>Copy-Item "C:\FASTSearch\data\spelltuner\en_frequencies_txt2.gz" "C:\FASTSearch\components\resourcestore\dictionaries\spelltuner\en_frequencies_txt2.gz"
4. Run the spelltuner. The spelltuner can be forced to run immediately with the -f flag. This step will also require changing the threshold (--wordcount-threshold) and the max word count (-m) so the spelltuner will pick up the simple, single change. It is also recommended to turn on debugging (-l debug):
PS C:\FASTSearch\bin> spelltuner --wordcount-threshold 1 -m -1 -f -l debug
[2011-08-05 10:35:45.533] DEBUG systemmsg Reading 'en' dictionary
[2011-08-05 10:35:47.034] DEBUG systemmsg Merging 'en' dictionary
[2011-08-05 10:35:47.081] DEBUG systemmsg Saving 'en' dictionary
[2011-08-05 10:35:47.127] DEBUG systemmsg Converting ...
[2011-08-05 10:35:47.143] DEBUG systemmsg Sorting ...
[2011-08-05 10:35:47.143] DEBUG systemmsg Inserting ...
[2011-08-05 10:35:47.159] DEBUG systemmsg Saving ...
[2011-08-05 10:35:47.409] DEBUG systemmsg Converting ...
[2011-08-05 10:35:47.549] DEBUG systemmsg Sorting ...
[2011-08-05 10:35:47.628] DEBUG systemmsg Inserting ...
[2011-08-05 10:35:47.971] DEBUG systemmsg Saving ...
[2011-08-05 10:35:49.472] DEBUG systemmsg Got 'searchservice::configuration
' list from nameserver: [('FS14.contoso.com', 13390, 1312534907000000004L, 'fds/
5. Review the spelltuner.log file located in the FASTSEARCH\var\log\spelltuner\ directory. This will have entries indicating that tuning has started, information about the dictionary updates and when tuning has completed:
[2011-08-05 10:35:38.391] INFO systemmsg Running forced update.
[2011-08-05 10:35:45.236] VERBOSE systemmsg Got 1 File(s)
[2011-08-05 10:35:45.439] VERBOSE systemmsg Persisted to C:\FASTSearch\data\spelltuner\en_frequencies_txt.gz (update=1)
[2011-08-05 10:35:45.439] VERBOSE systemmsg Updating dictionaries for language 'en'
[2011-08-05 10:35:45.565] VERBOSE systemmsg No spellcheck dictionary for language 'en', will skip in update.
[2011-08-05 10:35:47.127] VERBOSE systemmsg Saving 'SPW' dictionary with 4942 entries
[2011-08-05 10:35:47.393] VERBOSE systemmsg Saved 'en' SPW dictionary to C:\FASTSearch\tmp\tmpv66v-0\en_spell_iso8859_1.xml.gz
[2011-08-05 10:35:47.409] VERBOSE systemmsg Uploaded 'en' SPW dictionary to DICTIONARIES/SPELLCHECK/en_spell_iso8859_1.xml.gz
[2011-08-05 10:35:47.409] VERBOSE systemmsg Saving 'EXC' dictionary with 57455 entries
[2011-08-05 10:35:49.425] VERBOSE systemmsg Saved 'en' EXC dictionary to C:\FASTSearch\tmp\tmpv66v-0\en_exceptions_iso8859_1.xml.gz
[2011-08-05 10:35:49.472] VERBOSE systemmsg Uploaded 'en' EXC dictionary to DICTIONARIES/SPELLCHECK/en_exceptions_iso8859_1.xml.gz
[2011-08-05 10:35:49.472] VERBOSE systemmsg Updated DYM for the languages ['en'] (accumulated data for 1 languages).
[2011-08-05 10:35:49.472] VERBOSE systemmsg Trying to reconfigure qrserver ...
[2011-08-05 10:35:52.191] VERBOSE systemmsg Success.
[2011-08-05 10:35:52.191] INFO systemmsg Updated 1 dictionaries, but activation of the new dictionaries might be delayed.
[2011-08-05 10:35:52.191] INFO systemmsg Forced update completed.
6. Test a query for the new term, misspelling the term, such as "mytermm". You should now see that terms used for spellchecking have been added or boosted. If the spell suggestion is still not producing desired results, repeat STEPs 1-6, giving a higher frequency number in STEP 2: of.write("en myterm 999\n").
Article ID: 2592062 - Last Review: 08/24/2011 16:45:00 - Revision: 5.0
Microsoft FAST Search Server 2010 for SharePoint