public class ConfigurableAnalyzerFactory extends Object implements IAnalyzerFactory
Analyzer
s are used for which languages.
Languages are specified by the language tag on RDF literals, which conform
with RFC 5646.
Within bigdata plain literals are assigned to the default locale's language.
The bigdata properties are used to map language ranges, as specified by
RFC 4647 to classes which extend Analyzer
.
Supported classes included all the natural language specific classes from Lucene, and also:
PatternAnalyzer
TermCompletionAnalyzer
KeywordAnalyzer
SimpleAnalyzer
StopAnalyzer
WhitespaceAnalyzer
StandardAnalyzer
Analyzer
that has at least one constructor matching:
Version
Version
, Set
getDefaultStopSet()
then this is assumed
to do what it says on the can; some of the Lucene analyzers store their default stop words elsewhere,
and such stopwords are usable by this class. If no stop word set can be found, and there is a constructor without
stopwords and a constructor with stopwords, then the former is assumed to use a default stop word set.
Configuration is by means of the bigdata properties file.
All relevant properties start com.bigdata.search.ConfigurableAnalyzerFactory
which we
abbreviate to c.b.s.C
in this documentation.
Properties from ConfigurableAnalyzerFactory.Options
apply to the factory.
Other properties, from ConfigurableAnalyzerFactory.AnalyzerOptions
start with
c.b.s.C.analyzer.language-range
where language-range
conforms
with the extended language range construct from RFC 4647, section 2.2.
There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to
substitute for '*' in extended language ranges in property names.
These are used to specify an analyzer for the given language range.
If no analyzer is specified for the language range *
then the StandardAnalyzer
is used.
Given any specific language, then the analyzer matching the longest configured language range,
measured in number of subtags is returned by getAnalyzer(String, boolean)
In the event of a tie, the alphabetically first language range is used.
The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647.
Some useful analyzers are as follows:
KeywordAnalyzer
WhitespaceAnalyzer
PatternAnalyzer
TermCompletionAnalyzer
EmptyAnalyzer
ConfigurableAnalyzerFactory.Options.NATURAL_LANGUAGE_SUPPORT
Modifier and Type | Class and Description |
---|---|
static interface |
ConfigurableAnalyzerFactory.AnalyzerOptions
Options understood by analyzers created by
ConfigurableAnalyzerFactory . |
static interface |
ConfigurableAnalyzerFactory.Options
Options understood by the
ConfigurableAnalyzerFactory . |
Constructor and Description |
---|
ConfigurableAnalyzerFactory(FullTextIndex<?> fullTextIndex)
Builds a new ConfigurableAnalyzerFactory.
|
Modifier and Type | Method and Description |
---|---|
org.apache.lucene.analysis.Analyzer |
getAnalyzer(String languageCode,
boolean filterStopwords)
Return the token analyzer to be used for the given language code.
|
public ConfigurableAnalyzerFactory(FullTextIndex<?> fullTextIndex)
fullTextIndex
- public org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode, boolean filterStopwords)
IAnalyzerFactory
getAnalyzer
in interface IAnalyzerFactory
languageCode
- The language code or null
to use the default
Locale
.filterStopwords
- if false, return an analyzer with no stopwordsCopyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.