org.apache.lucene.index.memory

Class SynonymMap


public class SynonymMap
extends Object

Loads the WordNet prolog file wn_s.pl into a thread-safe main-memory hash map that can be used for fast high-frequency lookups of synonyms for any given (lowercase) word string.

There holds: If B is a synonym for A (A -> B) then A is also a synonym for B (B -> A). There does not necessarily hold: A -> B, B -> C then A -> C.

Loading typically takes some 1.5 secs, so should be done only once per (server) program execution, using a singleton pattern. Once loaded, a synonym lookup via getSynonyms(String)takes constant time O(1). A loaded default synonym map consumes about 10 MB main memory. An instance is immutable, hence thread-safe.

This implementation borrows some ideas from the Lucene Syns2Index demo that Dave Spencer originally contributed to Lucene. Dave's approach involved a persistent Lucene index which is suitable for occasional lookups or very large synonym tables, but considered unsuitable for high-frequency lookups of medium size synonym tables.

Example Usage:

 String[] words = new String[] { "hard", "woods", "forest", "wolfish", "xxxx"};
 SynonymMap map = new SynonymMap(new FileInputStream("samples/fulltext/wn_s.pl"));
 for (int i = 0; i < words.length; i++) {
     String[] synonyms = map.getSynonyms(words[i]);
     System.out.println(words[i] + ":" + java.util.Arrays.asList(synonyms).toString());
 }
 
 Example output:
 hard:[arduous, backbreaking, difficult, fermented, firmly, grueling, gruelling, heavily, heavy, intemperately, knockout, laborious, punishing, severe, severely, strong, toilsome, tough]
 woods:[forest, wood]
 forest:[afforest, timber, timberland, wood, woodland, woods]
 wolfish:[edacious, esurient, rapacious, ravening, ravenous, voracious, wolflike]
 xxxx:[]
 
See Also:
prologdb man page , Dave's synonym demo site

Constructor Summary

SynonymMap(InputStream input)
Constructs an instance, loading WordNet synonym data from the given input stream.

Method Summary

protected String
analyze(String word)
Analyzes/transforms the given word on input stream loading.
String[]
getSynonyms(String word)
Returns the synonym set for the given word, sorted ascending.
String
toString()
Returns a String representation of the index data for debugging purposes.

Methods inherited from class java.lang.Object

clone, equals, extends Object> getClass, finalize, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details

SynonymMap

public SynonymMap(InputStream input)
            throws IOException
Constructs an instance, loading WordNet synonym data from the given input stream. Finally closes the stream. The words in the stream must be in UTF-8 or a compatible subset (for example ASCII, MacRoman, etc.).
Parameters:
input - the stream to read from (null indicates an empty synonym map)
Throws:
IOException - if an error occured while reading the stream.

Method Details

analyze

protected String analyze(String word)
Analyzes/transforms the given word on input stream loading. This default implementation simply lowercases the word. Override this method with a custom stemming algorithm or similar, if desired.
Parameters:
word - the word to analyze
Returns:
the same word, or a different word (or null to indicate that the word should be ignored)

getSynonyms

public String[] getSynonyms(String word)
Returns the synonym set for the given word, sorted ascending.
Parameters:
word - the word to lookup (must be in lowercase).
Returns:
the synonyms; a set of zero or more words, sorted ascending, each word containing lowercase characters that satisfy Character.isLetter().

toString

public String toString()
Returns a String representation of the index data for debugging purposes.
Overrides:
toString in interface Object
Returns:
a String representation