public class TermCompletionAnalyzer
extends org.apache.lucene.analysis.Analyzer
This analyzer generates several index terms for each word in the input. These are intended to match short sequences (e.g. three or more) characters of user-input, to then give the user a drop-down list of matching terms.
This can be set up to address issues like matching half-time
when the user types
tim
or if the user types halft
(treating the hyphen as a soft hyphen); or
to match TermCompletionAnalyzer
when the user types Ana
In contrast, the Lucene Analyzers are mainly geared around the free text search use case.
The intended use cases will typical involve a prefix query of the form:
?t bds:search "prefix*" .to find all literals in the selected graphs, which are indexed by a term starting in
prefix, so the problem this class addresses is finding the appropriate index terms to allow matching, at sensible points, mid-way through words (such as at hyphens).
To get maximum effectiveness it maybe best to use private language subtags (see RFC 5647),
e.g. "x-term"
which are mapped to this class by ConfigurableAnalyzerFactory
for
the data being loaded into the store, and linked to some very simple process
like KeywordAnalyzer
for queries which are tagged with a different language tag
that is only used for bds:search
, e.g. "x-query"
.
The above prefix query then becomes:
?t bds:search "prefix*"@x-query .
Constructor and Description |
---|
TermCompletionAnalyzer(Pattern wordBoundary,
Pattern subWordBoundary)
Divide the input into words, separated by the wordBoundary,
and return a token for each whole word, and then
generate further tokens for each word by removing prefixes
up to and including each successive match of
subWordBoundary
|
TermCompletionAnalyzer(Pattern wordBoundary,
Pattern subWordBoundary,
Pattern softHyphens,
boolean alwaysRemoveSoftHypens)
Divide the input into words and short tokens
as with
TermCompletionAnalyzer(Pattern, Pattern) . |
Modifier and Type | Method and Description |
---|---|
protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents |
createComponents(String fieldName) |
public TermCompletionAnalyzer(Pattern wordBoundary, Pattern subWordBoundary, Pattern softHyphens, boolean alwaysRemoveSoftHypens)
TermCompletionAnalyzer(Pattern, Pattern)
.
Each term is generated, and then an additional term
is generated with softHypens (defined by the pattern),
removed. If the alwaysRemoveSoftHypens flag is true,
then the first term (before the removal) is suppressed.wordBoundary
- The definition of space (e.g. " ")subWordBoundary
- Also index after matches to this (e.g. "-")softHyphens
- Discard these characters from matchesalwaysRemoveSoftHypens
- If false the discard step is optional.public TermCompletionAnalyzer(Pattern wordBoundary, Pattern subWordBoundary)
wordBoundary
- subWordBoundary
- protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents createComponents(String fieldName)
createComponents
in class org.apache.lucene.analysis.Analyzer
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.