ConfigurableAnalyzerFactory (Blazegraph Database Platform 2.1.5 API)

java.lang.Object
- com.bigdata.search.ConfigurableAnalyzerFactory

All Implemented Interfaces:

IAnalyzerFactory
```
public class ConfigurableAnalyzerFactory
extends Object
implements IAnalyzerFactory
```
This class can be used with the bigdata properties file to specify which Analyzers are used for which languages. Languages are specified by the language tag on RDF literals, which conform with RFC 5646. Within bigdata plain literals are assigned to the default locale's language. The bigdata properties are used to map language ranges, as specified by RFC 4647 to classes which extend Analyzer. Supported classes included all the natural language specific classes from Lucene, and also:
- PatternAnalyzer
- TermCompletionAnalyzer
- KeywordAnalyzer
- SimpleAnalyzer
- StopAnalyzer
- WhitespaceAnalyzer
- StandardAnalyzer
More generally any subclass of Analyzer that has at least one constructor matching:
- no arguments
- Version
- Version, Set
is usable. If the class has a static method named getDefaultStopSet() then this is assumed to do what it says on the can; some of the Lucene analyzers store their default stop words elsewhere, and such stopwords are usable by this class. If no stop word set can be found, and there is a constructor without stopwords and a constructor with stopwords, then the former is assumed to use a default stop word set.
Configuration is by means of the bigdata properties file. All relevant properties start com.bigdata.search.ConfigurableAnalyzerFactory which we abbreviate to c.b.s.C in this documentation. Properties from ConfigurableAnalyzerFactory.Options apply to the factory.
Other properties, from ConfigurableAnalyzerFactory.AnalyzerOptions start with c.b.s.C.analyzer.language-range where language-range conforms with the extended language range construct from RFC 4647, section 2.2. There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to substitute for '*' in extended language ranges in property names. These are used to specify an analyzer for the given language range.
If no analyzer is specified for the language range * then the StandardAnalyzer is used.
Given any specific language, then the analyzer matching the longest configured language range, measured in number of subtags is returned by getAnalyzer(String, boolean) In the event of a tie, the alphabetically first language range is used. The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647.
Some useful analyzers are as follows:

KeywordAnalyzer

This treats every lexical value as a single search token

WhitespaceAnalyzer

This uses whitespace to tokenize

PatternAnalyzer

This uses a regular expression to tokenize

TermCompletionAnalyzer

This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases.

EmptyAnalyzer

This suppresses the functionality, by treating every expression as a stop word.

there are in addition the language specific analyzers that are included by using the option ConfigurableAnalyzerFactory.Options.NATURAL_LANGUAGE_SUPPORT
Author:

jeremycarroll

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static interface`	`ConfigurableAnalyzerFactory.AnalyzerOptions` Options understood by analyzers created by `ConfigurableAnalyzerFactory`.
`static interface`	`ConfigurableAnalyzerFactory.Options` Options understood by the `ConfigurableAnalyzerFactory`.

Constructor Summary

Constructors
Constructor and Description

ConfigurableAnalyzerFactory(FullTextIndex<?> fullTextIndex)
Builds a new ConfigurableAnalyzerFactory.

Constructors
Constructor and Description
`ConfigurableAnalyzerFactory(FullTextIndex<?> fullTextIndex)` Builds a new ConfigurableAnalyzerFactory.

Method Summary

Methods
Modifier and Type	Method and Description
`org.apache.lucene.analysis.Analyzer`	`getAnalyzer(String languageCode, boolean filterStopwords)` Return the token analyzer to be used for the given language code.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ConfigurableAnalyzerFactory
```
public ConfigurableAnalyzerFactory(FullTextIndex<?> fullTextIndex)
```
    Builds a new ConfigurableAnalyzerFactory.
    
    Parameters:
    fullTextIndex -
- Method Detail
  - getAnalyzer
```
public org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode,
                                              boolean filterStopwords)
```
    Description copied from interface: IAnalyzerFactory
    
    Return the token analyzer to be used for the given language code.
    
    Specified by:
    
    getAnalyzer in interface IAnalyzerFactory
    
    Parameters:
    languageCode - The language code or null to use the default Locale.
    filterStopwords - if false, return an analyzer with no stopwords
    
    Returns:
    The token analyzer best suited to the indicated language family.

Class ConfigurableAnalyzerFactory

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ConfigurableAnalyzerFactory

Method Detail

getAnalyzer