The ESP Advanced Linguistics Guide introduces dictionary alignment and provides the basic steps to align spellcheck dictionaries with indexed content. This article builds on the content provided in the guide.
As noted in the ESP Advanced Linguistics Guide, dictionary alignment has two phases. The first phase is gathering the terms and their frequencies from the index. The second phase is adding and aligning the terms to a dictionary.
The createspelldata tool is used in the first phase of alignment. This tool creates frequency lists based on the index content of an ESP installation. The resulting frequency list is used in the second phase.
The tunespellcheckfl Dictman parameter is used in the second phase of alignment. This tool modifies the contents of a specified dictionary based on the terms and frequencies gathered by createspelldata.
A summary of each of the options for createspelldata can be found by running createspelldata --help from the %FASTSEARCH%\bin\ directory on the command line. A summary of each of the options for tunespellcheckfl can be found in the Dictman tool by using the question make, ?, command. The ESP Advanced Linguistics Guide explains some of these parameters for each tool in further detail. This article discusses some of the most common uses.
A very basic usage of the createspelldata command is as follows:
It is good practice to clear the cache before running createspelldata because the frequency information from past usage will remain in the cache. Not clearing the cache can cause undesired terms and frequencies in the frequency list.
To clear the cache, one must run cmctrl remove <frequency list>. For example:
The createspelldata tool gathers terms from each bconbody field within the fixml. A separate bconbody field will be present in the fixml for any composite field that contains a reference to the body field (as defined in the index-profile).
For example, if the body field contains the word “test” and, if the body field is present in two composite fields, there will be two bconbody fields in the fixml that each contains the word “test”. In this case, createspelldata will count the word “test” twice in this document and give it a frequency of 2.
There are two ways to review the contents of the frequency list created by createspelldata.
1. The createspelldata command can be run without the -a parameter, so the frequency list is written to a file, instead of written to memory. An example is as follows:
querying for '', returning max. 50 entries starting from offset 0
car X 14
truck X 17
boat X 18
returned 3 entries
It is possible to use createspelldata to gather terms from fields other than the body field. In order to run createspelldata to gather data from fields other than the body field, (for example, title) the --word parameter must be used. The syntax for this would look like the following:
Gathering Terms from a Specified Collection
It is very common to require a unique dictionary for each collection. With the createspelldata tool, it is possible to gather terms from a specified collection. In order to run createspelldata to gather data from only a single collection, (for example, MyTestCollection) the --collection or -c parameter must be used. The syntax for this would look like the following:
The createspelldata command can gather phrases from the index by passing the --phrase option, however the phrase option does have some limitations.
First, the phrase option requires the user to specify a multi-valued field and its separator as defined in the index-profile. For example, in a default index-profile, the “companies” field is defined as a multi-value field with a semicolon as a separator:
<field name="companies" separator=";"/>
The syntax for the --phrase option is as follows:
Notice the command examples above do NOT include the -a parameter, which writes the frequency list to the cache. When the -a parameter is omitted, the frequency list is written to the %FASTSEARCH%\data\spell\ directory. The file will have a name such as <hostname>.spelltuning.P.<lang>, for example:
This frequency list written to the file above can be very useful, even though it cannot be aligned with a dictionary via the tunespellcheckfl Dictman parameter.
This frequency list still does contain all of the phrases from the specified multi-value field gathered by createspelldata, however one would not want to simply import this frequency list into a dictionary. The frequency list has frequencies instead of dictionary weighs. Term frequencies and dictionary weights are very different; these two values are inverses of each other. When tunespellcheckfl is run against a single word dictionary, terms with high frequencies get low weight values and terms with low frequencies get high weights or do not get added to the dictionary at all (depending on tunespellcheckfl thresholds).
If it is required to build a phrase spellcheck dictionary with the data gathered from createspelldata, one could parse this frequency list created from the --phrase option and change the frequency values into weights, which can then be imported into the dictionary. ESP does not provide an out of the box solution for this process; therefore, such an implementation would require a custom script/implementation.
As an example, the default propername/phrase dictionary has weights that range from 1-10. A custom script could potentially be implanted to assign terms with high frequency values to a low weight and low frequency values to a high weight, relative to the 1-10 range. A custom range may also be used, if desired. The resulting file could then be imported into a dictionary within Dictman, however such a script would be considered a custom implantation and would not be supported by Microsoft.
Multiple Indexer Columns
When an ESP environment consists of multiple indexer columns, createspelldata must be run on each column, one at a time. When createspelldata is run on an indexer node, it must have connection to the cachemanager (usually found on the admin node), as it will write the data to memory on the cachemanager node.
The following steps explain how to:
A. Use the createspelldata tool to gather terms from a single field (title) within a single collection (MyTestCollection1) on a two column ESP environment (indexer01 and indexer02).
B. Create a new dictionary (MyDictionary) in which to add these terms.
C. Tune the dictionary with the terms gathered from createspelldata.
D. Deploy the dictionary to:
c. The global view (espsystemwebcluster).
1. Clear the cache data from all previous createspelldata runs:
4. Open dictman and create a custom dictionary:
command (? for help)>create SPW MyDictionary en
Tune the dictionary with the terms gathered from createspelldata:
5. Run tunespellcheckfl on the admin node to add the terms gathered by createspelldata to the custom dictionary:
querying for 'a', returning max. 50 entries starting from offset 0
1. Export the view for the collection to an XML file:
<didyoumean mode="suggest" >
3. Save the changes to the XML file.
4. Undeploy the collection views:
1. Navigate to the “FAST Home” page and click the Search Business Center link for the MyView search profile.
2. Click the "Search Profile Settings" tab.
3. Click the "Query Handling" sub tab.
4. In the "'Did You Mean?' Dictionaries" section, locate the "Spell Correction for Words" option and click the "Remove" button to remove the default dictionary.
6. Click the "Add" button.
7. Click the "Publishing" tab at the top.
8. Click the "Search Profile Settings & Synonyms" sub tab.
9. Click the "Publish Search Profile" button to publish the changes.
Deploy the dictionary to the global view (espsystemwebcluster):
1. Stop the qrservers by running the following on each qrserver node (WARNING: This will bring down search on the node where the qrserver is stopped):
3. Clear the qrserver cache on each qrserver node (WARNING: This will remove all view information until the views are deployed):
Manually rename the %FASTSEARCH%\var\qrserver\ directory.
4. Start the qrservers by running the following on each node:
In order to update existing collection-based views, follow these steps (NOTE: This should ONLY be done for collection-based views):
1. Undeploy the collection views:
Usage of createspelldata
Show this help message and exit
-d DIRECTORY, --directory=DIRECTORY
Directory with fixml content.
-p PREFIX, --prefix=PREFIX
Prefix of spelltuning files.
Prune caches to reduce memory usage.
Separator in CSV spelltuning files.
Hostname for building significant filenames
-o OUTDIR, --out-dir=OUTDIR
Directory to use for merge operations and final output. Should not be too small. Will be created if it doesn't exist.
Store the spell data to a central storage instance. Thus it can be used from the dictman command line tool.
-c COLLECTION, --collection=COLLECTION
The collection to process. Several collections can be processed by specifying this switch several times.
The attribute to collect words from. --word=name extracts names from default tag <context>. Example:
--word=bconbody--word=name:tag extracts names from given tag.
--word=bsrctitle:sFieldSeveral word fields can be processed by specifying this switch several times.
The attribute to collect phrases (propernames) from. Default separator is ';'.--phrase=name extracts default tag <sField> with given name. Example:
--phrase=bsrcartist--phrase=name:tag extracts from given tag with name.
--phrase=name:tag:separator extracts from given tag with name and separator.
Several phrase fields can be processed by specifying this switch several times.
Cascades phrase sources in to different buckets along the count of words. .P.1. with one word, .P.2. with two words, .P.3 for the rest
-l LANGUAGE, --lang=LANGUAGE
The language to process. Several languages can be processed by specifying this switch several times.
Maximum length of words that are accepted. Default is not used
Minimum length of words that are accepted. Default is not used
Filters abnormal words
Add the regular expression REGEX as rejector: words that match are omitted.
Prints additional statements for debugging purposes
Limiting the read fixml files for test purposes
Uses discbased datastructure to handling massdata
Typical use of this command:
createspelldata -l en –a
Usage of tunespellcheckfl
tunespellcheckfl <target type><name> <freq. list name> <threshold(abs)>
This denotes the dictionary type being used. For single word spellcheck dictionaries, the type will be “SPW” (without quotes). SPW stands for “spellcheck word”. This tunes a spellcheck dictionary.
This is the name of the dictionary being tuned. For example, if using the default English dictionary the name will be “en_spell_iso8859_1” (without quotes).
<freq. list name>
This is the name of the frequency list added to memory from the creatspelldata command. For English, the value will be “spelltuning_en” (without quotes).
This is the threshold of the values to add from the frequency list. Any value below the threshold will not be considered. For example, if the index contains the word “car” three times and a threshold value of 4 is used, this term will not be considered during tuning.
Setting the threshold to 1, will allow all terms gathered from createspelldata to be tuned to the dictionary.
Typical use of this command:
tunespellcheckfl SPW en_spell_iso8859_1 spelltuning_en 4
Article ID: 2733531 - Last Review: 1 Nov 2012 - Revision: 1