V
- The generic type of the document identifier.public class FullTextIndex<V extends Comparable<V>> extends AbstractRelation
The basic data model consists of documents, fields in documents, and tokens extracted by an analyzer from those fields.
The frequency distributions may be normalized to account for a variety of effects producing "term weights". For example, normalizing for document length or relative frequency of a term in the overall collection. Therefore the logical model is:
token : {docId, freq?, weight?}+(For RDF, docId is the term identifier as assigned by the term:id index.)
The freq and weight are optional values that are representative of the kinds of statistical data that are kept on a per-token-document basis. The freq is the token frequency (the frequency of occurrence of the token in the document). The weight is generally a normalized token frequency weight for the token in that document in the context of the overall collection.
In fact, we actually represent the data as follows:
{sortKey(token), weight, docId, fldId} : {freq?, sorted(pos)+}That is, there is a distinct entry in the full text B+Tree for each field in each document in which a given token was recognized. The text of the token is not stored in the key, just the Unicode sort key generated from the token text. The value associated with the B+Tree entry is optional - it is simply not used unless we are storing statistics for the token-document pair. The advantages of this approach are: (a) it reuses the existing B+Tree data structures efficiently; (b) we are never faced with the possibility overflow when a token is used in a large number of documents. The entries for the token will simply be spread across several leaves in the B+Tree; (c) leading key compression makes the resulting B+Tree very efficient; and (d) in a scale-out range partitioned index we can load balance the resulting index partitions by choosing the partition based on an even token boundary.
A field is any pre-identified text container within a document. Field
identifiers are integers, so there are 32^2
distinct possible
field identifiers. It is possible to manage the field identifiers through a
secondary index, but that has no direct bearing on the structure of the full
text index itself. Field identifies appear after the token in the key so that
queries may be expressed that will be matched against any field in the
document. Likewise, field identifiers occur before the document identifier in
the key since we always search across documents (in a search key, the
document identifier is always Long.MIN_VALUE
and the field identifier
is always Integer.MIN_VALUE
). There are many applications for fields:
for example, distinct fields may be used for the title, abstract, and full
text of a document or for the CDATA section of each distinct element in
documents corresponding to some DTD. The application is responsible for
recognizing the fields in the document and producing the appropriate token
stream, each of which must be tagged by the field.
A query is tokenized, producing a (possibly normalized) token-frequency vector. The relevance of documents to the query is generally taken as the cosine between the query's and each document's (possibly normalized) token-frequency vectors. The main effort of search is assembling a token frequency vector for just those documents with which there is an overlap with the query. This is done using a key range scan for each token in the query against the full text index.
fromKey := token, Long.MIN_VALUE toKey := successor(token), Long.MIN_VALUEand extracting the appropriate token frequency, normalized token weight, or other statistic. When no value is associated with the entry we follow the convention of assuming a token frequency of ONE (1) for each document in which the token appears.
Tokenization is informed by the language code (when declared) and by the
configured Locale
for the database otherwise. An appropriate
Analyzer
is chosen based on the language code or Locale
and
the "document" is broken into a token-frequency distribution (alternatively a
set of tokens). The same process is used to tokenize queries, and the API
allows the caller to specify the language code used to select the
Analyzer
to tokenize the query.
Once the tokens are formed the language code / Locale
used to produce
the token is discarded (it is not represented in the index). The reason for
this is that we never utilize the total ordering of the full text index,
merely the manner in which it groups tokens that map onto the same Unicode
sort key together. Further, we use only a single Unicode collator
configuration regardless of the language family in which the token was
originally expressed. Unlike the collator used by the terms index (which
often is set at IDENTICAL strength), the collector used by the full text
index should be chosen such that it makes relatively few distinctions in
order to increase recall (e.g., set at PRIMARY strength). Since a total order
over the full text index is not critical from the perspective of its IR
application, the Locale
for the collator is likewise not critical and
PRIMARY strength will produce significantly shorter Unicode sort keys.
The term frequency within that literal is an optional property associated with each term identifier, as is the computed weight for the token in the term.
Note: Documents should be tokenized using an Analyzer
appropriate for
their declared language code (if any). However, once tokenized, the language
code is discarded and we perform search purely on the Unicode sort keys
resulting from the extracted tokens.
Because the first component in the key is the token, both updates (when indexing document) and queries (reading against different tokens) will be scattered across shards. Therefore it is not necessary to register a split handler for the full text index.
Modifier and Type | Class and Description |
---|---|
static interface |
FullTextIndex.Options
Options understood by the
FullTextIndex . |
Modifier and Type | Field and Description |
---|---|
static String |
NAME_SEARCH
The basename of the search index.
|
indexManager
Constructor and Description |
---|
FullTextIndex(IIndexManager indexManager,
String namespace,
Long timestamp,
Properties properties)
Ctor specified by
DefaultResourceLocator . |
Modifier and Type | Method and Description |
---|---|
Hit<V>[] |
_search(ITextIndexer.FullTextQuery query) |
protected Hit<V>[] |
applyRegex(Hit<V>[] hits,
Pattern regex)
Subclasses can override this method to do regex post-processing.
|
int |
count(ITextIndexer.FullTextQuery query)
Perform a range count on a full text query.
|
void |
create()
Conditionally registers the necessary index(s).
|
long |
delete(IChunkedOrderedIterator itr)
Remove elements from the relation.
|
void |
destroy()
Destroy any logically contained resources (relations, indices).
|
protected Hit<V>[] |
executeQuery(TermFrequencyData<V> qdata,
boolean prefixMatch,
long timeout,
TimeUnit unit) |
protected org.apache.lucene.analysis.Analyzer |
getAnalyzer(String languageCode,
boolean filterStopwords)
Return the token analyzer to be used for the given language code.
|
Class<?> |
getElementClass()
Return the class for the generic type of this relation.
|
IIndex |
getIndex()
The index used to associate term identifiers with tokens parsed from
documents.
|
Set<String> |
getIndexNames()
Return the fully qualified name of each index maintained by this
relation.
|
protected IKeyBuilder |
getKeyBuilder()
Return a
ThreadLocal IKeyBuilder instance configured to
support full text indexing and search. |
IKeyOrder |
getKeyOrder(IPredicate p)
Return the
IKeyOrder for the predicate corresponding to the
perfect access path. |
Iterator |
getKeyOrders()
Return the
IKeyOrder s corresponding to the registered indices for
this relation. |
IKeyOrder |
getPrimaryKeyOrder()
Return the
IKeyOrder for the primary index for the relation. |
protected org.apache.lucene.analysis.TokenStream |
getTokenStream(String languageCode,
Reader r,
boolean filterStopwords)
Tokenize text using an
Analyzer that is appropriate to the
specified language family. |
void |
index(TokenBuffer<V> buffer,
V docId,
int fieldId,
String languageCode,
Reader r)
See
#index(TokenBuffer, long, int, String, Reader, boolean) . |
void |
index(TokenBuffer<V> buffer,
V docId,
int fieldId,
String languageCode,
Reader r,
boolean filterStopwords)
Index a field in a document.
|
long |
insert(IChunkedOrderedIterator itr)
Write elements on the relation.
|
boolean |
isOverwrite()
Return the value configured by the
FullTextIndex.Options.OVERWRITE property. |
boolean |
isReadOnly()
|
protected Hit<V>[] |
matchExact(Hit<V>[] hits,
String query)
Subclasses can override this method to do exact match processing.
|
Object |
newElement(List a,
IBindingSet bindingSet)
Create and return a new element.
|
Hiterator<Hit<V>> |
search(ITextIndexer.FullTextQuery query)
Performs a full text search against indexed documents returning a hit
list.
|
protected Hit<V>[] |
slice(ITextIndexer.FullTextQuery query,
Hit<V>[] a) |
protected TermFrequencyData<V> |
tokenize(ITextIndexer.FullTextQuery query) |
getAccessPath, getAccessPath, getAccessPath, getFQN, getFQN, getFQN, getIndex, getIndex, getIndex, newAccessPath, newIndexMetadata
acquireExclusiveLock, assertWritable, getBareProperties, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getCommitTime, getContainer, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, init, isForceSerialExecution, toString, unlock
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
getExecutorService, getIndexManager
getContainerNamespace, getNamespace, getTimestamp, init
public static final transient String NAME_SEARCH
public FullTextIndex(IIndexManager indexManager, String namespace, Long timestamp, Properties properties)
DefaultResourceLocator
.client
- The client. Configuration information is obtained from the
client. See FullTextIndex.Options
.FullTextIndex.Options
public IIndex getIndex()
public boolean isOverwrite()
FullTextIndex.Options.OVERWRITE
property.public final boolean isReadOnly()
isReadOnly
in class AbstractResource
public void create()
create
in interface IMutableResource
create
in class AbstractResource
IllegalStateException
- if the client does not have write access.public void destroy()
IMutableResource
destroy
in interface IMutableResource
destroy
in class AbstractResource
protected org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode, boolean filterStopwords)
languageCode
- The language code or null
to use the default
Locale
.protected final IKeyBuilder getKeyBuilder()
ThreadLocal
IKeyBuilder
instance configured to
support full text indexing and search.public void index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r)
#index(TokenBuffer, long, int, String, Reader, boolean)
.
Uses a default filterStopwords value of true
.
public void index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r, boolean filterStopwords)
Note: This method does NOT force a write on the indices. If the buffer
overflows, then there will be an index write. Once the caller is done
indexing, they MUST invoke TokenBuffer.flush()
to force any data
remaining in their buffer to the indices.
Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).
buffer
- Used to buffer writes onto the text index.docId
- The document identifier.fieldId
- The field identifier.languageCode
- The language code -or- null
to use the default
Locale
.r
- A reader on the text to be indexed.filterStopwords
- if true, filter stopwords from the token streamTokenBuffer.flush()
protected org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode, Reader r, boolean filterStopwords)
Analyzer
that is appropriate to the
specified language family.languageCode
- The language code -or- null
to use the default
Locale
).r
- A reader on the text to be indexed.filterStopwords
- if true, filter stopwords from the token streampublic Hiterator<Hit<V>> search(ITextIndexer.FullTextQuery query)
The basic algorithm computes cosine between the term-frequency vector of the query and the indexed "documents". The cosine may be directly interpreted as the "relevance" of a "document" to the query. The query and document term-frequency vectors are normalized, so the cosine values are bounded in [0.0:1.0]. The higher the cosine the more relevant the document is to the query. A cosine of less than .4 is rarely of any interest.
The implementation creates and runs a set parallel tasks, one for each distinct token found in the query, and waits for those tasks to complete or for a timeout to occur. Each task uses a key-range scan on the terms index, collecting metadata for the matching "documents" and aggregating it on a "hit" for that document. Since the tasks run concurrently, there are concurrent writers on the "hits". On a timeout, the remaining tasks are interrupted.
The collection of hits is scored and hits that fail a threshold are
discarded. The remaining hits are placed into a total order and the
caller is returned an iterator which can read from that order. If the
operation is interrupted, then only those IHit
s that have already
been computed will be returned.
query
- The query (it will be parsed into tokens).languageCode
- The language code that should be used when tokenizing the
query -or- null
to use the default Locale
).minCosine
- The minimum cosine that will be returned.maxCosine
- The maximum cosine that will be returned.minRank
- The min rank of the search result.maxRank
- The max rank of the search result.prefixMatch
- When true
, the matches will be on tokens which
include the query tokens as a prefix. This includes exact
matches as a special case when the prefix is the entire token,
but it also allows longer matches. For example,
free
will be an exact match on free
but a partial match on freedom
. When
false
, only exact matches will be made.matchAllTerms
- if true, return only hits that match all search termstimeout
- The timeout -or- ZERO (0) for NO timeout (this is equivalent
to using Long.MAX_VALUE
).unit
- The unit in which the timeout is expressed.public int count(ITextIndexer.FullTextQuery query)
protected TermFrequencyData<V> tokenize(ITextIndexer.FullTextQuery query)
public Hit<V>[] _search(ITextIndexer.FullTextQuery query)
protected Hit<V>[] slice(ITextIndexer.FullTextQuery query, Hit<V>[] a)
protected Hit<V>[] executeQuery(TermFrequencyData<V> qdata, boolean prefixMatch, long timeout, TimeUnit unit)
protected Hit<V>[] matchExact(Hit<V>[] hits, String query)
protected Hit<V>[] applyRegex(Hit<V>[] hits, Pattern regex)
public long delete(IChunkedOrderedIterator itr)
IMutableRelation
itr
- An iterator visiting the elements to be removed. Existing
elements in the relation having a key equal to the key formed
from the visited elements will be removed from the relation.public long insert(IChunkedOrderedIterator itr)
IMutableRelation
itr
- An iterator visiting the elements to be written.public Set<String> getIndexNames()
IRelation
public Iterator getKeyOrders()
IRelation
IKeyOrder
s corresponding to the registered indices for
this relation. [rather than getIndexNames?]public Object newElement(List a, IBindingSet bindingSet)
IRelation
a
- An ordered list of variables and/or constants.bindingSet
- A set of bindings.public Class<?> getElementClass()
IRelation
public IKeyOrder getPrimaryKeyOrder()
IRelation
IKeyOrder
for the primary index for the relation.public IKeyOrder getKeyOrder(IPredicate p)
IRelation
IKeyOrder
for the predicate corresponding to the
perfect access path. A perfect access path is one where the bound values
in the predicate form a prefix in the key space of the corresponding
index.IKeyOrder
for the perfect access path -or-
null
if there is no index which provides a perfect
access path for that predicate.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.